45
LAMPIRAN A
Instalasi Microsoft Visual Studio
Gambar 1 Install Visual Studio
46
Gambar 3 Path Instalasi
Pada langkah ini tentukan dimana letak Visual Studio akan di install. Setelah itu klik Next dan tunggu proses instalasi sampai selesai.
47
Gambar 5 Restart Komputer
48
Gambar 7 Proses Instalasi Selesai Instalasi MPICH2
49
Gambar 9 Proses Instalasi dan Finishing Setup
Ikuti perintah next setelah window setup muncul sampai muncul window path installation, kemudian tentukan dimana MPICH2 akan di install. Klik next, maka proses instalasi akan dimulai, tunggu sampai selesai kemudian finish.
Gambar 10 Install smpd dan Validasi MPI
50
Setting MPI Pada Visual Studio
Gambar 11
Additional Include Directories.Klik kanan
projectpada
Solution Manager,kemudian pilih
properties.Pada
Configuration Properties,
expandC/C++ pilih
general, kemudian pada kolom
Additional Include Directories,berikan
pathdari
folder includeOpenMPI, supaya
headerdapat terbaca oleh sistem.
51
Gambar 13
Additional Dependencies.52
Setting Koneksi
ClusterKonfigurasi
FirewallFirewall
pada masing-masing komputer user harus terbuka, supaya koneksi dari
MPI yang dikirimkan dari komputer
clustertidak di block oleh komputer lainnya.
Gambar 14 Pencarian Firewall dengan
searchbox.53
Gambar 16
Firewall Properties.Kemudian status
Firewall Statepilih menjadi off. Sehingga
inbounddan
outbound connectionstidak memblokir koneksi MPI pada saat mengirim data pada cluster atau pada
saat menerima data.
Konfigurasi IP dan User Credential
54
Gambar 18
Network and Sharing Center.Pilih pada
Change adapter setting, kemudian pada
Local Area Connectionklik
kanan dan pilih properties.
55
Gambar 20 IPV4
Properties.56
PC 1
PC 2
Gambar 3.19
User Accounthost dan client.
Nama user pada PC 1 dan PC 2 dan juga password harus identik, supaya pada
proses eksekusi OpenMPI PC 2 terdeteksi, dan MPI dapat melakukan transfer data antara
PC 1 dan PC2.
Setting Component Service
Pada
search boxstart menu, ketikkan
dcomcnfg.exe, tekan enter, pilih Component
service, kemudian masuk ke folder
Computer,pada
my Computerklik kanan pilih
properties.
57
Gambar 22
Limit COM Securitypada
My Computer Properties.Klik
COM Securitypilih
edit limits. Disini akan di konfigurasikan koneksi user ke
komputer utama, supaya
security PCmemberikan status
allowpada user yang terhubung
pada komputer utama. Add terlebih dahulu user yang akan diberikan
permissionuntuk
mengakses komputer utama.
Gambar 23
Search Select User.58
Gambar 24
Advanced Select User.Klik Find now untuk mencari jenis user, kemudian pilih
everyone,lalu klik OK.
Gambar 25
Edit Permissionuntuk
useryang dipilih.
59 Tes Koneksi dan Eksekusi Aplikasi MPI
Gambar 26 Test Ping
Gunakan command ping dengan diikuti nomor IP komputer cluster untuk mengetahui koneksi cluster yang sudah terhubung.
Gambar 27 Eksekusi MPI dengan Menggunakan Command prompt
Aplikasi yang di implementasikan dengan MPI dijalankan menggunakan command prompt dengan perintah :
Local :
mpirun–
np 2 file.exeAngka 2 pada
commandtersebut digunakan untuk mensimulasikan jumlah proses
yang secara virtual berjalan pada local host, bisa diganti dengan angka yang berjumlah 2
nCluster :
mpirun–
np 2–
host host1,host2 file.exe60
Gambar 28 Task Manager Komputer Cluster
Pastikan pada saat eksekusi dengan menggunakan MPI , CPU usage pada komputer cluster menunjukkan aktivitas pemrosesan. Hal ini menandakan ada data yang di proses di komputer cluster.
Setting Nvidia Nsight
Langkah awal dalam menggunakan Nvidia Nsight, adalah pada PC
usersudah
terinstall visual studio, supaya pada waktu instalasi Nvidia
Toolkit,
templatedari Nsight
dapat terintegrasi pada
new projectvisual studio, sehingga dapat langsung digunakan oleh
user. Setelah Instalasi berhasil dilakukan, cek kompatibilitas dari hardware GPU,
support atau tidakuntuk memprogram dan menjalankan CUDA.
61
Gambar 30 Pencarian
Code Samplesuntuk uji coba GPU.
Untuk mengetahui apakah GPU yang terpasang di PC mendukung
CUDA dapat
dilakukan pada NVIDIA CUDA
samples browser, searchdengan kata kunci particles
kemudian pada
smoke particlesklik
run.Gambar 31
Smoke screen code samples.62
Gambar 32 Template dari CUDA yang terintegrasi dengan Visual Studio.
Setelah proses instalasi selesai maka
installation summaryakan menampilkan
fitur-fitur dan komponen dari CUDA
Nsightyang telah berhasil di integrasikan pada visual
studio dan pada PC user. Dan pada visual studio sudah terintegrasi template project CUDA
runtime.Gambar 33 Path CUDA pada
environment variables.63 Eksekusi Aplikasi CUDA
Pada saat CUDA di eksekusi pastikan GPU berjalan dengan menggunakan aplikasi GPU-z atau CUDA – z , pada aplikasi tersebut terdapat sensor dari processor GPU yang akan
menunjukkan kepada user.
Gambar 34 Eksekusi aplikasi CUDA
64
LAMPIRAN B
Source Code CPU Computing Sorting #include <stdio.h> #include <conio.h> #include <stdlib.h> #include <iostream> #include <windows.h>
void quicksort(float [10],int,int); int main() { LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime; QueryPerformanceFrequency(&frequency); int size,i; float *x;
float aa = 100.0;
printf("Enter size of the array: "); scanf("%d",&size);
x = (float *)malloc( (size+1)*sizeof(float) ); for(i=0;i<size;i++)
{
x[i]=((float)rand()/(float)(RAND_MAX)) * aa; }
QueryPerformanceCounter(&t1); quicksort(x,0,size-1);
QueryPerformanceCounter(&t2);
elapsedTime = (t2.QuadPart - t1.QuadPart)*1000.0/ frequency.QuadPart;
printf("\n\n%f ms\n",elapsedTime); system("pause");
return 0; }
65 if(i<j) { temp=x[i]; x[i]=x[j]; x[j]=temp; } } temp=x[pivot]; x[pivot]=x[j]; x[j]=temp; quicksort(x,first,j-1); quicksort(x,j+1,last); } } Binary Search #include <stdio.h> #include <conio.h> #include <stdlib.h> #include <iostream> #include <windows.h> int main() { LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime; int c,n;
int first, last, middle; float search;
double *array; float c2=1.25;
printf("number of elements\n"); scanf("%d",&n);
array = (double *)malloc((n+1) * sizeof(double)); //printf("Enter %d integers\n", n);
QueryPerformanceFrequency(&frequency); for ( c = 0 ; c < n ; c++ )
{
array[c]=c2; c2=c2+1.25; }
printf("\nvalue to find\n"); scanf("%f",&search);
first = 0;
last = n - 1;
middle = (first+last)/2; QueryPerformanceCounter(&t1); while( first <= last )
{
if ( array[middle] < search ){ first = middle + 1;}
else if ( array[middle] == search ){
66
break;} else
{
last = middle - 1; }
middle = (first + last)/2; }
if ( first > last )
{ printf("Not found! %d is not present in the list.\n",
search); }
QueryPerformanceCounter(&t2);
elapsedTime = (t2.QuadPart - t1.QuadPart)*1000.0/ frequency.QuadPart; printf("\n\n\n%f ms\n",elapsedTime); system("pause"); return 0; } Matrix Multiplication #include <stdio.h> #include <conio.h> #include <stdlib.h> #include <iostream> #include <windows.h> int main()
{ //FLOATING
int i, j, k;
double **mat1, **mat2, **res; long n;
float aa = 5.0;
LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime;
// get the order of the matrix from the user printf("Size of matrix:");
scanf("%d", &n);
QueryPerformanceFrequency(&frequency);
// dyamically allocate memory to store elements mat1 = (double **)malloc(sizeof(double) * n); mat2 = (double **)malloc(sizeof(double) * n); res = (double **) malloc(sizeof(double) * n);
for (i = 0; i < n; i++) {
mat1[i] = (double *)malloc(sizeof(double) * n); mat2[i] = (double *)malloc(sizeof(double) * n); res[i] = (double *)malloc(sizeof(double) * n); }
// get the input matrix printf("\n");
for (i = 0; i < n; i++) {
67
//mat1[i][j] = rand() % 10 +1; mat1[i][j] =
((float)rand()/(float)(RAND_MAX)) * aa; }
}
printf("matrix 1:\n"); for(int aa=0; aa<n ; aa++) {
for(int bb=0; bb<n ;bb++) { printf("%.2f ",mat1[aa][bb]); } printf("\n"); } printf("\n");
// get the input for second matrix from the user printf("matrix 2:\n");
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
//mat2[i][j] = rand() % 10 +1;
mat2[i][j]=((float)rand()/(float)(RAND_MAX)) * aa; }
}
for(int aa=0; aa<n ; aa++) {
for(int bb=0; bb<n ;bb++) { printf("%.2f ",mat2[aa][bb]); } printf("\n"); } QueryPerformanceCounter(&t1); // multiply first and second matrix for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) { *(*(res + i) + j) = 0;
for (k = 0; k < n; k++) {
*(*(res + i) + j) = *(*(res + i) + j) + (*(*(mat1 + i) + k) * *(*(mat2 + k) + j)); }
} }
QueryPerformanceCounter(&t2);
elapsedTime = (t2.QuadPart - t1.QuadPart)*1000.0/ frequency.QuadPart;
printf("\n\n\n%f ms\n",elapsedTime); // print the result
printf("\nResult :\n"); for (i = 0; i < n; i++) { for (j = 0; j < n; j++) {
printf("%.2f ", *(*(res + i) + j)); }
68
}
free(mat1); free(mat2); free(res);
system("pause"); return 0;
}
Gauss Jordan Elimination
#include <stdio.h> #include <conio.h> #include <stdlib.h> #include <iostream> #include <windows.h> #include <math.h> #include <malloc.h> #include <windows.h>
int main() {
int i, j, n;
double **a, *b, *x; LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime;
void gauss_jordan(int n, double **a, double *b, double *x); printf("\nNumber of equations: ");
scanf("%d", &n); float aa = 10.0;
QueryPerformanceFrequency(&frequency);
x = (double *)malloc( (n+1)*sizeof(double) ); b = (double *)malloc( (n+1)*sizeof(double) ); a = (double **)malloc( (n+1)*sizeof(double *) ); for(i = 1; i <= n; i++)
a[i] = (double *)malloc( (n+1)*sizeof(double) );
for(i = 1; i <= n; i++) {
for(j = 1; j <= n; j++) {
//a[i][j]=rand()%10 + 1;
a[i][j]=((float)rand()/(float)(RAND_MAX)) * aa; }
//b[i]=rand()%10 + 1;
b[i]=((float)rand()/(float)(RAND_MAX)) * aa;
}
69
for(int bb = 1 ; bb<=n ; bb++) {
printf("%.1f ",a[aa][bb]); }
printf(" %.1f ",b[aa]); printf("\n");
}
printf("\n\n");
QueryPerformanceCounter(&t1); gauss_jordan(n, a, b, x); QueryPerformanceCounter(&t2);
elapsedTime = (t2.QuadPart - t1.QuadPart)*1000.0/ frequency.QuadPart;
printf("\n\n\n%f ms\n",elapsedTime);
printf("\nSolution\n");
printf("---\n"); printf("x = (");
for(i = 1; i <= n-1; i++) printf("%lf, ", x[i]); printf("%lf)\n\n", x[n]);
system("pause"); return(0);
}
void gauss_jordan(int n, double **a, double *b, double *x)
{
int i, j, k; int p;
double factor; double big, dummy;
for(k = 1; k <= n; k++) {
// pivoting if(k < n) {
p = k;
big = fabs(a[k][k]); for(i = k+1; i <= n; i++) {
if(big < fabs(a[i][k])) {
big = fabs(a[i][k]); p = i;
} }
if(p != k) {
for(j = 1; j <= n; j++) {
70
a[p][j] = a[k][j]; a[k][j] = dummy; }
dummy = b[p]; b[p] = b[k]; b[k] = dummy; }
}
// Gauss-Jordan elimination factor = a[k][k];
for(j = 1; j <= n; j++) a[k][j] /= factor;
b[k] /= factor;
for(i = 1; i <= n; i++) {
if(i == k) continue;
factor = a[i][k];
for(j = 1; j <= n; j++) a[i][j] -=
a[k][j]*factor;
b[i] -= b[k]*factor; }
}
for(i = 1; i <= n; i++) x[i] = b[i];
return; }
Source Code GPU Computing
Sorting
#include "cuda_runtime.h"
#include "device_launch_parameters.h" #include <iostream>
#include <windows.h> using namespace std; #include <cuda.h> #include <stdio.h> #include <stdlib.h> #include <conio.h>
#include <cuda_runtime_api.h> //#define NUM 8
__device__ inline void swap(float & a, float & b) {
float tmp = a; a = b;
71
__global__ void bitonicSort(float * values, float N) {
extern __shared__ float shared[]; const unsigned int tid = threadIdx.x; shared[tid] = values[tid];
for (unsigned int k = 2; k <= N; k *= 2) {
for (unsigned int j = k / 2; j>0; j /= 2) {
unsigned int ixj = tid ^ j; if (ixj > tid)
{
if ((tid & k) == 0) {
if (shared[tid] > shared[ixj]) { swap(shared[tid], shared[ixj]); } } else {
if (shared[tid] < shared[ixj]) { swap(shared[tid], shared[ixj]); } } } } }
values[tid] = shared[tid]; }
int main(void) {
cudaEvent_t start, stop; float time;
float * dvalues; float * values; double NUM; float aa = 5.0; scanf("%d",&NUM);
values = (float *)malloc( (NUM+1)*sizeof(float) ); size_t size = NUM * sizeof(int);
for(int i = 0; i < NUM; i++) {
//values[i]=rand()%10 + 1;
values[i] = ((float)rand()/(float)(RAND_MAX)) * aa;
}
/*printf("\n nilai awal: ");
72
cudaMemcpy(dvalues, values, size , cudaMemcpyHostToDevice); cudaEventCreate(&start);
cudaEventCreate(&stop); cudaEventRecord(start,0);
bitonicSort<<<1, NUM, size >>>(dvalues,NUM); cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
cudaMemcpy(values, dvalues, size, cudaMemcpyDeviceToHost); cudaFree(dvalues);
/*printf("\n hasil pengurutan: ");
for (int i=0; i<NUM; i++) printf(" %i",values[i]);*/ printf("%f ms\n",time); printf("\n"); system("pause"); } Binary Search #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> #include <conio.h> #include <stdlib.h> #include <iostream> #include <windows.h> #include <assert.h>
__device__ int get_index_to_check(int thread, int num_threads, int set_size, int offset) {
return (((set_size + num_threads) / num_threads) * thread) + offset;
}
__global__ void p_ary_search(float search, int array_length, int *arr, int *ret_val ) {
const int num_threads = blockDim.x * gridDim.x;
const int thread = blockIdx.x * blockDim.x + threadIdx.x; int set_size = array_length;
while(set_size != 0){
int offset = ret_val[1];
int index_to_check = get_index_to_check(thread, num_threads, set_size, offset);
if (index_to_check < array_length){ int next_index_to_check =
get_index_to_check(thread + 1, num_threads, set_size, offset);
if (next_index_to_check >= array_length){ next_index_to_check = array_length - 1; }
if (search > arr[index_to_check] && (search < arr[next_index_to_check])) {
73
else if (search == arr[index_to_check]) {
ret_val[0] = index_to_check; }
}
set_size = set_size / num_threads;
} }
float chop_position(float search, float *search_array, int array_length)
{
float time;
cudaEvent_t start, stop;
int array_size = array_length * sizeof(int); if (array_size == 0) return -1;
int *dev_arr;
cudaMalloc((void**)&dev_arr, array_size); cudaMemcpy(dev_arr, search_array, array_size, cudaMemcpyHostToDevice);
int *ret_val = (int*)malloc(sizeof(int) * 2);
ret_val[0] = -1; // return value ret_val[1] = 0; // offset
array_length = array_length % 2 == 0 ? array_length : array_length - 1; // array size
int *dev_ret_val;
cudaMalloc((void**)&dev_ret_val, sizeof(int) * 2); cudaMemcpy(dev_ret_val, ret_val, sizeof(int) * 2, cudaMemcpyHostToDevice);
// Launch kernel
cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0);
p_ary_search<<<16, 64>>>(search, array_length, dev_arr, dev_ret_val);
cudaEventRecord(stop,0); cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop); // Get results
cudaMemcpy(ret_val, dev_ret_val, 2 * sizeof(int), cudaMemcpyDeviceToHost);
int ret = ret_val[0];
printf("\nFound %i\n",ret_val[1]);
printf("\nElapsed Time : %f ms",time); // Free memory on device
74
free(ret_val);
return ret; }
static float * build_array(int length) {
float *ret_val = (float*)malloc(length * sizeof(float));
for (int i = 0; i < length; i++) {
ret_val[i] = (i * 2 + 0.5) - 1; //ret_val[i] = i;
printf("%.2f ",ret_val[i]); } return ret_val; }
static void test_array(int length, float search, float index) {
printf("Length %i Search %.2f\n", length, search);
assert(index == chop_position(search, build_array(length), length) && "test_small_array()");
}
static void test_arrays() {
int length; float search;
scanf("%d",&length); scanf("%f",&search);
75
#include <cuda_runtime_api.h>
#define BLOCK_SIZE 100
__global__ void gpuMM(float *A, float *B, float *C, int N) {
int row = blockIdx.y*blockDim.y + threadIdx.y; int col = blockIdx.x*blockDim.x + threadIdx.x;
float sum = 0.f;
for (int n = 0; n < N; ++n)
sum += A[row*N+n]*B[n*N+col];
C[row*N+col] = sum; }
int main(int argc, char *argv[]) { LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime; int N,K,L; awal: scanf("%d",&L); if(L < 1000) {
printf("Input must be greater than 1000\n"); goto awal;
}
K = L/100;
N = K*BLOCK_SIZE; float time;
cudaEvent_t start, stop; float *hA,*hB,*hC;
hA = new float[N*N]; hB = new float[N*N]; hC = new float[N*N]; float aa=5.0;
for (int j=0; j<N; j++){ for (int i=0; i<N; i++){
hA[j*N+i] = ((float)rand()/(float)(RAND_MAX)) * aa; hB[j*N+i] = ((float)rand()/(float)(RAND_MAX)) * aa;
} }
int size = N*N*sizeof(float); // Size of the memory in
76
// Copy matrices from the host to device
cudaMemcpy(dA,hA,size,cudaMemcpyHostToDevice); cudaMemcpy(dB,hB,size,cudaMemcpyHostToDevice);
//Execute the matrix multiplication kernel cudaEventCreate(&start);
cudaEventCreate(&stop); cudaEventRecord(start,0);
gpuMM<<<grid,threadBlock>>>(dA,dB,dC,N);
cudaEventRecord(stop,0); cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
float *C;
C = new float[N*N];
cudaMemcpy(C,dC,size,cudaMemcpyDeviceToHost);
cudaFree(dA); cudaFree(dB); cudaFree(dC);
printf("%f ms\n",time); system("pause");
}
Gauss Jordan Elimination main.cpp
#include<stdio.h> #include<conio.h> #include<stdlib.h> #include "Common.h"
int main(int argc , char **argv) {
float *a_h = NULL ; float *b_h = NULL ;
float *result , sum ,rvalue ; int numvar ,j ;
float aa = 5.0; numvar = 0;
scanf("%d",&numvar);
a_h = (float*)malloc(sizeof(float)*numvar*(numvar+1)); b_h = (float*)malloc(sizeof(float)*numvar*(numvar+1)); int ii=0;
for(int i = 1; i <= numvar; i++) {
77
{
//a_h[ii]=rand()%10 + 1;
a_h[ii]=((float)rand()/(float)(RAND_MAX)) * aa; ii++;
}
}
//Calling device function to copy data to device DeviceFunc(a_h , numvar , b_h);
//Showing the data printf("\n\n");
/*for(int i =0 ; i< numvar ;i++) {
for(int j =0 ; j< numvar+1; j++) {
printf("%.2f ",b_h[i*(numvar+1) + j]); }
printf("\n"); } */
//Using Back substitution method
result = (float*)malloc(sizeof(float)*(numvar)); for(int i = 0; i< numvar;i++)
{
result[i] = 1.0; }
for(int i=numvar-1 ; i>=0 ; i--) {
sum = 0.0 ;
for( j=numvar-1 ; j>i ;j--) {
sum = sum + result[j]*b_h[i*(numvar+1) + j]; }
rvalue = b_h[i*(numvar+1) + numvar] - sum ; result[i] = rvalue / b_h[i *(numvar+1) + j]; }
//Tampil hasil
/*for(int i =0;i<numvar;i++) {
78 #include <cuda.h> #include "Common.h" #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> #include <conio.h> #include <stdlib.h> #include <iostream> #include <windows.h>
__global__ void Kernel(float *, float * ,int );
void DeviceFunc(float *temp_h , int numvar , float *temp1_h) {
float time;
float *a_d , *b_d; LARGE_INTEGER frequency;
LARGE_INTEGER t1,t2; double elapsedTime; cudaEvent_t start, stop;
//Memory allocation on the device
cudaMalloc(&a_d,sizeof(float)*(numvar)*(numvar+1)); cudaMalloc(&b_d,sizeof(float)*(numvar)*(numvar+1));
//Copying data to device from host cudaMemcpy(a_d, temp_h,
sizeof(float)*numvar*(numvar+1),cudaMemcpyHostToDevice);
//Defining size of Thread Block dim3 dimBlock(numvar+1,numvar,1); dim3 dimGrid(1,1,1); //Kernel call cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0);
Kernel<<<dimGrid , dimBlock>>>(a_d , b_d , numvar); cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
//Coping data to host from device
cudaMemcpy(temp1_h,b_d,sizeof(float)*numvar*(numvar+1),cudaMemcpyD eviceToHost);
79
__global__ void Kernel(float *a_d , float *b_d ,int size) {
int idx = threadIdx.x ; int idy = threadIdx.y ; //int width = size ; //int height = size ;
//Allocating memory in the share memory of the device __shared__ float temp[16][16];
//Copying the data to the shared memory
temp[idy][idx] = a_d[(idy * (size+1)) + idx] ;
for(int i =1 ; i<size ;i++) {
if((idy + i) < size) {
float var1 =(-1)*( temp[i-1][i-1]/temp[i+idy][i-1]); temp[i+idy][idx] = temp[i-1][idx] +((var1) *
(temp[i+idy ][idx])); }
}
b_d[idy*(size+1) + idx] = temp[idy][idx]; }
Common.h
#ifndef __Common_H #define __Common_H #endif
void getvalue(float ** ,int *);
void DeviceFunc(float * , int , float *);
Source Code Cluster Computing
Sorting
#include <stdio.h> #include <stdlib.h> #include <mpi.h> #define DEBUG #define ROOT 0
#define ISPOWER2(x) (!((x)&((x)-1)))
float *merge(float array1[], float array2[], float size) { float *result = (float *)malloc(2*size*sizeof(float)); int i=0, j=0, k=0;
while ((i < size) && (j < size))
80
result[k++] = array1[i++]; while (j < size)
result[k++] = array2[j++]; return result;
}
float sorted(float array[], float size) { int i;
for (i=1; i<size; i++) if (array[i-1] > array[i]) return 0;
return 1; }
int compare(const void *p1, const void *p2) { return *(float *)p1 - *(float *)p2;
}
int main(int argc, char** argv) { int i, b=1, npes, myrank;
long datasize;
float localsize, *localdata, *otherdata, *data = NULL; int active = 1;
MPI_Status status;
double start, finish, p, s; MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &npes);
datasize = strtol(argv[1], argv, 10);
if (!ISPOWER2(npes)) {
if (myrank == ROOT) printf("Processor number must be power of two.\n");
return MPI_Finalize(); }
if (datasize%npes != 0) {
if (myrank == ROOT) printf("Datasize must be divisible by processor number.\n");
return MPI_Finalize(); }
if (myrank == ROOT) {
data = (float *)malloc(datasize * sizeof(float)); for (i = 0; i < datasize; i++)
data[i] = rand()%99 + 1; }
start = MPI_Wtime();
localsize = datasize / npes;
localdata = (float *) malloc(localsize * sizeof(float)); MPI_Scatter(data, localsize, MPI_INT, localdata, localsize, MPI_INT,
ROOT, MPI_COMM_WORLD);
81
while (b < npes) { if (active) {
if ((myrank/b)%2 == 1) {
MPI_Send(localdata, b * localsize, MPI_INT, myrank - b, 1, MPI_COMM_WORLD);
free(localdata); active = 0; } else {
otherdata = (float *) malloc(b * localsize * sizeof(float)); MPI_Recv(otherdata, b * localsize, MPI_INT, myrank + b, 1, MPI_COMM_WORLD, &status);
localdata = merge(localdata, otherdata, b * localsize); free(otherdata);
} }
b <<= 1; }
finish = MPI_Wtime();
if (myrank == ROOT) { #ifdef DEBUG
if (sorted(localdata, npes*localsize)) { printf("\nParallel sorting succeed.\n\n"); } else {
printf("\nParallel sorting failed.\n\n"); }
#endif
free(localdata); p = finish - start;
printf(" Parallel : %.8f\n", p);
/*start = MPI_Wtime();
qsort(data, datasize, sizeof(float), compare); finish = MPI_Wtime();*/
free(data); } return MPI_Finalize(); } Binary Search #include "mpi.h" #include <iostream> #include <math.h>
using namespace std;
int main(int argc,char **argv) {
82
const int Tag_Max=3; int max;
double MaxInAll; int MyId, P;
double* A;
int ArrSize, Target; int n, Start;
int i, x;
int Source, dest, Tag; int WorkersDone = 0 ;
double start, finish, p; MPI_Status RecvStatus;
MPI_Init(&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &MyId); MPI_Comm_size (MPI_COMM_WORLD, &P);
start = MPI_Wtime(); //start working.. if (MyId == Master) {
.
cout<<"This is the master process on "<<P<<" Processes\n"; MaxInAll=0;
int GlobIndx;
cout<<"Enter the number of elements you want to generate..";
cin>> ArrSize; ..
A = new double[ArrSize];
srand ( P ); /* initialize random seed: */ for ( i= 0; i<ArrSize; i++)
{
A[i] = i+1.25;
}
n = ArrSize/(P-1);
for( i = 1; i < P; i++) {
dest = i; if (i == P-1)
n = ArrSize - (n*(P-2)); Tag = Tag_Size;
MPI_Send(&n, 1, MPI_DOUBLE, dest, Tag, MPI_COMM_WORLD);
83
Start = (i - 1) * ( ArrSize/(P-1) ); MPI_Send(A+Start, n, MPI_DOUBLE, dest, Tag, MPI_COMM_WORLD);
}
WorkersDone = 0; int MaxIndex = 0;
while (WorkersDone < P-1 ) {
MPI_Recv(&x, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &RecvStatus);
Source = RecvStatus.MPI_SOURCE; Tag = RecvStatus.MPI_TAG; if (Tag == Tag_Max)/
{
GlobIndx = (Source - 1)*(ArrSize/(P-1) ) + x; if ( A[GlobIndx] > MaxInAll)
{
MaxInAll = A[GlobIndx]; MaxIndex = GlobIndx; } WorkersDone++; } } if(WorkersDone==P-1)
cout << "Process "<<Source<<" found the max of the array "<< MaxInAll<<" at index "<<MaxIndex;
delete [] A; }
else {
max=0;
cout<<"Process "<<MyId<<" is alive...\n"; Source = Master;
Tag = Tag_Size;
MPI_Recv(&n, 1, MPI_DOUBLE, Source, Tag, MPI_COMM_WORLD, &RecvStatus);
A = new double[n]; Tag = Tag_Data;
MPI_Recv(A, n, MPI_DOUBLE, Source, Tag, MPI_COMM_WORLD, &RecvStatus);
cout<<"Process "<<MyId<< "Received "<<n<<" data elements\n";
int max_i; i = 0;
while (i<n ) {
84
}
dest = Master; Tag = Tag_Max;
cout<<"Process "<<MyId<< " has max equals "<<max<<endl; MPI_Send(&max_i, 1, MPI_DOUBLE, dest, Tag,
MPI_COMM_WORLD); delete [] A; }
finish = MPI_Wtime(); if (MyId == 0)
{
p = finish - start;
printf(" Parallel : %.8f\n", p); }
MPI_Finalize(); return 0;
}
Matrix Multiplication
#include <stdio.h> #include "mpi.h"
#define N 5000 /* number of rows and columns in matrix */
MPI_Status status;
double a[N][N],b[N][N],c[N][N]; int main(int argc, char **argv) {
double start, finish, p; int
numtasks,taskid,numworkers,source,dest,rows,offset,i,j,k,remainPar t,originalRows;
//struct timeval start, stop; MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &taskid); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); numworkers = numtasks-1;
start = MPI_Wtime(); if (taskid == 0) { for (i=0; i<N; i++) { for (j=0; j<N; j++) { a[i][j]= 1.25;
85
}
//gettimeofday(&start, 0);
/* send matrix data to the worker tasks */ rows = N/numworkers;
offset = 0;
remainPart = N%numworkers;
for (dest=1; dest<=numworkers; dest++) {
if (remainPart > 0) {
originalRows = rows; ++rows;
remainPart--;
MPI_Send(&offset, 1, MPI_INT, dest, 1, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, 1, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*N, MPI_DOUBLE,dest,1, MPI_COMM_WORLD);
MPI_Send(&b, N*N, MPI_DOUBLE, dest, 1, MPI_COMM_WORLD); offset = offset + rows;
rows = originalRows; }
else {
MPI_Send(&offset, 1, MPI_INT, dest, 1, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, 1, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*N, MPI_DOUBLE,dest,1, MPI_COMM_WORLD);
MPI_Send(&b, N*N, MPI_DOUBLE, dest, 1, MPI_COMM_WORLD); offset = offset + rows;
} }
/* wait for results from all worker tasks */ for (i=1; i<=numworkers; i++)
{ source = i;
MPI_Recv(&offset, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status);
MPI_Recv(&c[offset][0], rows*N, MPI_DOUBLE, source, 2, MPI_COMM_WORLD, &status); } }
if (taskid > 0) { source = 0;
MPI_Recv(&offset, 1, MPI_INT, source, 1, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, source, 1, MPI_COMM_WORLD, &status);
MPI_Recv(&a, rows*N, MPI_DOUBLE, source, 1, MPI_COMM_WORLD, &status);
86
/* Matrix multiplication */ for (k=0; k<N; k++)
for (i=0; i<rows; i++) { c[i][k] = 0.0;
for (j=0; j<N; j++)
c[i][k] = c[i][k] + a[i][j] * b[j][k]; }
MPI_Send(&offset, 1, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Send(&c, rows*N, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD); }
finish = MPI_Wtime(); if (taskid == 0) {
p = finish - start;
printf(" Parallel : %.8f\n", p); }
MPI_Finalize();
}
Gauss Jordan Elimination
#include <stdlib.h> #include <stdio.h> #include <iostream> #include "mpi.h"
double serial_gaussian( double *A, double *b, double *y, int n ) {
int i, j, k;
double tstart = MPI_Wtime();
for( k=0; k<n; k++ ) { for( j=k+1; j<n; j++ ) { if( A[k*n+k] != 0)
A[k*n+j] = A[k*n+j] / A[k*n+k]; else
A[k*n+j] = 0; }
if( A[k*n+k] != 0 )
y[k] = b[k] / A[k*n+k]; else
87
A[k*n+k] = 1.0;
for( i=k+1; i<n; i++ ) { for( j=k+1; j<n; j++ )
A[i*n+j] -= A[i*n+k] * A[k*n+j];
b[i] -= A[i*n+k] * y[k]; A[i*n+k] = 0.0;
} }
return tstart; }
void print_equations( double *A, double *y, int n ) {
int i, j;
for( i=0; i<n; i++ ) { for( j=0; j<n; j++ ) { if( A[i*n+j] != 0 ) {
std::cout << A[i*n+j] << "x" << j; if( j<n-1 ) std::cout << " + "; }
else
std::cout << " "; }
std::cout << " = " << y[i] << std::endl; }
}
int main( int argc, char *argv[] ) {
double *A, *b, *y, *a, *tmp, *final_y; // var decls
int i, j, n, row, r;
double tstart, tfinish, TotalTime; // timing decls
float aa = 5.0; if( argc < 2 ) {
std::cout << "Usage\n";
std::cout << " Arg1 = number of equations / unkowns\n"; return -1;
}
n = atoi(argv[1]);
A = new double[n*n]; // space for matricies
b = new double[n]; y = new double[n];
for( i=0; i<n; i++ ) { // creates a matrix of random
b[i] = 0.0;
for( j=0; j<n; j++ ) {
r = ((float)rand()/(float)(RAND_MAX)) * aa; A[i*n+j] = r;
88
} }
MPI_Init (&argc,&argv); // Initialize MPI
MPI_Comm com = MPI_COMM_WORLD;
int size,rank; // Get rank/size info MPI_Comm_size(com,&size);
MPI_Comm_rank(com,&rank);
int manager = (rank == 0); if (size == 1)
tstart = serial_gaussian ( A, b, y, n); else
{
if ( ( n % size ) != 0 ) {
std::cout << "Unknowns must be multiple of processors." << std::endl;
return -1; }
int np = (int) n/size; a = new double[n*np]; tmp = new double[n*np];
if ( manager ) {
tstart = MPI_Wtime(); final_y = new double[n];
}
MPI_Scatter(A,n*np,MPI_INT,a,n*np,MPI_INT,0,com);
for ( i=0; i < (rank*np); i++ ) {
MPI_Bcast(tmp,n,MPI_INT,i/np,com); MPI_Bcast(&(y[i]),1,MPI_INT,i/np,com);
for (row=0; row<np; row++) {
for ( j=i+1; j<n; j++ )
a[row*n+j] = a[row*n+j] - a[row*n+i]*tmp[j]; b[rank*np+row] = b[rank*np+row] - a[row*n+i]*y[i]; a[row*n+i] = 0;
} }
for (row=0; row<np; row++) {
89
{
a[row*n+j] = a[row*n+j] / a[row*n+np*rank+row]; }
y[rank*np+row] = b[rank*np+row] / a[row*n+rank*np+row]; a[row*n+rank*np+row] = 1;
for ( i=0; i<n ; i++ ) tmp[i] = a[row*n+i];
MPI_Bcast (tmp,n,MPI_INT,rank,com);
MPI_Bcast (&(y[rank*np+row]),1,MPI_INT,rank,com);
for ( i=row+1; i<np; i++) {
for ( j=rank*np+row+1; j<n; j++ )
a[i*n+j] = a[i*n+j] - a[i*n+row+rank*np]*tmp[j]; b[rank*np+i] = b[rank*np+i] -
a[i*n+row+rank*np]*y[rank*np+row]; a[i*n+row+rank*np] = 0;
} }
for (i=(rank+1)*np ; i<n ; i++) {
MPI_Bcast (tmp,n,MPI_INT,i/np,com); MPI_Bcast (&(y[i]),1,MPI_INT,i/np,com); }
MPI_Barrier(com);
MPI_Gather(a,n*np,MPI_INT,A,n*np,MPI_INT,0,com);
MPI_Gather(&(y[rank*np]),np,MPI_INT,final_y,np,MPI_INT,0,com);
y = final_y;
}
if (manager || (size==1) ) {
tfinish = MPI_Wtime();
TotalTime = tfinish - tstart; printf("%f",TotalTime);
std::cout << std::endl;
}