Institutional Repository | Satya Wacana Christian University: Multitasking dengan Menggunakan Nvidia Cuda

(1)

45

LAMPIRAN A

Instalasi Microsoft Visual Studio

Gambar 1 Install Visual Studio

(2)

46

Gambar 3 Path Instalasi

Pada langkah ini tentukan dimana letak Visual Studio akan di install. Setelah itu klik Next dan tunggu proses instalasi sampai selesai.

(3)

47

Gambar 5 Restart Komputer

(4)

48

Gambar 7 Proses Instalasi Selesai Instalasi MPICH2

(5)

49

Gambar 9 Proses Instalasi dan Finishing Setup

Ikuti perintah next setelah window setup muncul sampai muncul window path installation, kemudian tentukan dimana MPICH2 akan di install. Klik next, maka proses instalasi akan dimulai, tunggu sampai selesai kemudian finish.

Gambar 10 Install smpd dan Validasi MPI

(6)

50

Setting MPI Pada Visual Studio

Gambar 11

Additional Include Directories.

Klik kanan

project

pada

Solution Manager,

kemudian pilih

properties.

Pada

Configuration Properties

,

expand

C/C++ pilih

general

, kemudian pada kolom

Additional Include Directories,

berikan

path

dari

folder include

OpenMPI, supaya

header

dapat terbaca oleh sistem.

(7)

51

Gambar 13

Additional Dependencies.

(8)

52

Setting Koneksi

Cluster

Konfigurasi

Firewall

Firewall

pada masing-masing komputer user harus terbuka, supaya koneksi dari

MPI yang dikirimkan dari komputer

cluster

tidak di block oleh komputer lainnya.

Gambar 14 Pencarian Firewall dengan

searchbox.

(9)

53

Gambar 16

Firewall Properties.

Kemudian status

Firewall State

pilih menjadi off. Sehingga

inbound

dan

outbound connections

tidak memblokir koneksi MPI pada saat mengirim data pada cluster atau pada

saat menerima data.

Konfigurasi IP dan User Credential

(10)

54

Gambar 18

Network and Sharing Center.

Pilih pada

Change adapter setting

, kemudian pada

Local Area Connection

klik

kanan dan pilih properties.

(11)

55

Gambar 20 IPV4

Properties.

(12)

56

PC 1

PC 2

Gambar 3.19

User Account

host dan client.

Nama user pada PC 1 dan PC 2 dan juga password harus identik, supaya pada

proses eksekusi OpenMPI PC 2 terdeteksi, dan MPI dapat melakukan transfer data antara

PC 1 dan PC2.

Setting Component Service

Pada

search box

start menu, ketikkan

dcomcnfg.exe

, tekan enter, pilih Component

service, kemudian masuk ke folder

Computer,

pada

my Computer

klik kanan pilih

properties.

(13)

57

Gambar 22

Limit COM Security

pada

My Computer Properties.

Klik

COM Security

pilih

edit limits

. Disini akan di konfigurasikan koneksi user ke

komputer utama, supaya

security PC

memberikan status

allow

pada user yang terhubung

pada komputer utama. Add terlebih dahulu user yang akan diberikan

permission

untuk

mengakses komputer utama.

Gambar 23

Search Select User.

(14)

58

Gambar 24

Advanced Select User.

Klik Find now untuk mencari jenis user, kemudian pilih

everyone,

lalu klik OK.

Gambar 25

Edit Permission

untuk

user

yang dipilih.

(15)

59 Tes Koneksi dan Eksekusi Aplikasi MPI

Gambar 26 Test Ping

Gunakan command ping dengan diikuti nomor IP komputer cluster untuk mengetahui koneksi cluster yang sudah terhubung.

Gambar 27 Eksekusi MPI dengan Menggunakan Command prompt

Aplikasi yang di implementasikan dengan MPI dijalankan menggunakan command prompt dengan perintah :

Local :

mpirun

–

np 2 file.exe

Angka 2 pada

command

tersebut digunakan untuk mensimulasikan jumlah proses

yang secara virtual berjalan pada local host, bisa diganti dengan angka yang berjumlah 2

n

Cluster :

mpirun

–

np 2

–

host host1,host2 file.exe

(16)

60

Gambar 28 Task Manager Komputer Cluster

Pastikan pada saat eksekusi dengan menggunakan MPI , CPU usage pada komputer cluster menunjukkan aktivitas pemrosesan. Hal ini menandakan ada data yang di proses di komputer cluster.

Setting Nvidia Nsight

Langkah awal dalam menggunakan Nvidia Nsight, adalah pada PC

user

sudah

terinstall visual studio, supaya pada waktu instalasi Nvidia

Toolkit

,

template

dari Nsight

dapat terintegrasi pada

new project

visual studio, sehingga dapat langsung digunakan oleh

user. Setelah Instalasi berhasil dilakukan, cek kompatibilitas dari hardware GPU,

support atau tidak

untuk memprogram dan menjalankan CUDA.

(17)

61

Gambar 30 Pencarian

Code Samples

untuk uji coba GPU.

Untuk mengetahui apakah GPU yang terpasang di PC mendukung

CUDA dapat

dilakukan pada NVIDIA CUDA

samples browser, search

dengan kata kunci particles

kemudian pada

smoke particles

klik

run.

Gambar 31

Smoke screen code samples.

(18)

62

Gambar 32 Template dari CUDA yang terintegrasi dengan Visual Studio.

Setelah proses instalasi selesai maka

installation summary

akan menampilkan

fitur-fitur dan komponen dari CUDA

Nsight

yang telah berhasil di integrasikan pada visual

studio dan pada PC user. Dan pada visual studio sudah terintegrasi template project CUDA

runtime.

Gambar 33 Path CUDA pada

environment variables.

(19)

63 Eksekusi Aplikasi CUDA

Pada saat CUDA di eksekusi pastikan GPU berjalan dengan menggunakan aplikasi GPU-z atau CUDA – z , pada aplikasi tersebut terdapat sensor dari processor GPU yang akan

menunjukkan kepada user.

Gambar 34 Eksekusi aplikasi CUDA

(20)

64

LAMPIRAN B

Source Code CPU Computing Sorting #include <stdio.h> #include <conio.h> #include <stdlib.h> #include <iostream> #include <windows.h>

void quicksort(float [10],int,int); int main() { LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime; QueryPerformanceFrequency(&frequency); int size,i; float *x;

float aa = 100.0;

printf("Enter size of the array: "); scanf("%d",&size);

x = (float *)malloc( (size+1)*sizeof(float) ); for(i=0;i<size;i++)

{

x[i]=((float)rand()/(float)(RAND_MAX)) * aa; }

QueryPerformanceCounter(&t1); quicksort(x,0,size-1);

QueryPerformanceCounter(&t2);

elapsedTime = (t2.QuadPart - t1.QuadPart)*1000.0/ frequency.QuadPart;

printf("\n\n%f ms\n",elapsedTime); system("pause");

return 0; }

(21)

65 if(i<j) { temp=x[i]; x[i]=x[j]; x[j]=temp; } } temp=x[pivot]; x[pivot]=x[j]; x[j]=temp; quicksort(x,first,j-1); quicksort(x,j+1,last); } } Binary Search #include <stdio.h> #include <conio.h> #include <stdlib.h> #include <iostream> #include <windows.h> int main() { LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime; int c,n;

int first, last, middle; float search;

double *array; float c2=1.25;

printf("number of elements\n"); scanf("%d",&n);

array = (double *)malloc((n+1) * sizeof(double)); //printf("Enter %d integers\n", n);

QueryPerformanceFrequency(&frequency); for ( c = 0 ; c < n ; c++ )

{

array[c]=c2; c2=c2+1.25; }

printf("\nvalue to find\n"); scanf("%f",&search);

first = 0;

last = n - 1;

middle = (first+last)/2; QueryPerformanceCounter(&t1); while( first <= last )

{

if ( array[middle] < search ){ first = middle + 1;}

else if ( array[middle] == search ){

(22)

66

break;} else

{

last = middle - 1; }

middle = (first + last)/2; }

if ( first > last )

{ printf("Not found! %d is not present in the list.\n",

search); }

elapsedTime = (t2.QuadPart - t1.QuadPart)*1000.0/ frequency.QuadPart; printf("\n\n\n%f ms\n",elapsedTime); system("pause"); return 0; } Matrix Multiplication #include <stdio.h> #include <conio.h> #include <stdlib.h> #include <iostream> #include <windows.h> int main()

{ //FLOATING

int i, j, k;

double **mat1, **mat2, **res; long n;

float aa = 5.0;

LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime;

// get the order of the matrix from the user printf("Size of matrix:");

scanf("%d", &n);

QueryPerformanceFrequency(&frequency);

// dyamically allocate memory to store elements mat1 = (double **)malloc(sizeof(double) * n); mat2 = (double **)malloc(sizeof(double) * n); res = (double **) malloc(sizeof(double) * n);

for (i = 0; i < n; i++) {

mat1[i] = (double *)malloc(sizeof(double) * n); mat2[i] = (double *)malloc(sizeof(double) * n); res[i] = (double *)malloc(sizeof(double) * n); }

// get the input matrix printf("\n");

for (i = 0; i < n; i++) {

(23)

67

//mat1[i][j] = rand() % 10 +1; mat1[i][j] =

((float)rand()/(float)(RAND_MAX)) * aa; }

}

printf("matrix 1:\n"); for(int aa=0; aa<n ; aa++) {

for(int bb=0; bb<n ;bb++) { printf("%.2f ",mat1[aa][bb]); } printf("\n"); } printf("\n");

// get the input for second matrix from the user printf("matrix 2:\n");

for (i = 0; i < n; i++) {

for (j = 0; j < n; j++) {

//mat2[i][j] = rand() % 10 +1;

mat2[i][j]=((float)rand()/(float)(RAND_MAX)) * aa; }

}

for(int aa=0; aa<n ; aa++) {

for(int bb=0; bb<n ;bb++) { printf("%.2f ",mat2[aa][bb]); } printf("\n"); } QueryPerformanceCounter(&t1); // multiply first and second matrix for (i = 0; i < n; i++) {

for (j = 0; j < n; j++) { *(*(res + i) + j) = 0;

for (k = 0; k < n; k++) {

*(*(res + i) + j) = *(*(res + i) + j) + (*(*(mat1 + i) + k) * *(*(mat2 + k) + j)); }

} }

printf("\n\n\n%f ms\n",elapsedTime); // print the result

printf("\nResult :\n"); for (i = 0; i < n; i++) { for (j = 0; j < n; j++) {

printf("%.2f ", *(*(res + i) + j)); }

(24)

68

}

free(mat1); free(mat2); free(res);

system("pause"); return 0;

}

Gauss Jordan Elimination

#include <stdio.h> #include <conio.h> #include <stdlib.h> #include <iostream> #include <windows.h> #include <math.h> #include <malloc.h> #include <windows.h>

int main() {

int i, j, n;

double **a, *b, *x; LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime;

void gauss_jordan(int n, double **a, double *b, double *x); printf("\nNumber of equations: ");

scanf("%d", &n); float aa = 10.0;

QueryPerformanceFrequency(&frequency);

x = (double *)malloc( (n+1)*sizeof(double) ); b = (double *)malloc( (n+1)*sizeof(double) ); a = (double **)malloc( (n+1)*sizeof(double *) ); for(i = 1; i <= n; i++)

a[i] = (double *)malloc( (n+1)*sizeof(double) );

for(i = 1; i <= n; i++) {

for(j = 1; j <= n; j++) {

//a[i][j]=rand()%10 + 1;

a[i][j]=((float)rand()/(float)(RAND_MAX)) * aa; }

//b[i]=rand()%10 + 1;

b[i]=((float)rand()/(float)(RAND_MAX)) * aa;

}

(25)

69

for(int bb = 1 ; bb<=n ; bb++) {

printf("%.1f ",a[aa][bb]); }

printf(" %.1f ",b[aa]); printf("\n");

}

printf("\n\n");

QueryPerformanceCounter(&t1); gauss_jordan(n, a, b, x); QueryPerformanceCounter(&t2);

printf("\n\n\n%f ms\n",elapsedTime);

printf("\nSolution\n");

printf("---\n"); printf("x = (");

for(i = 1; i <= n-1; i++) printf("%lf, ", x[i]); printf("%lf)\n\n", x[n]);

system("pause"); return(0);

}

void gauss_jordan(int n, double **a, double *b, double *x)

{

int i, j, k; int p;

double factor; double big, dummy;

for(k = 1; k <= n; k++) {

// pivoting if(k < n) {

p = k;

big = fabs(a[k][k]); for(i = k+1; i <= n; i++) {

if(big < fabs(a[i][k])) {

big = fabs(a[i][k]); p = i;

} }

if(p != k) {

for(j = 1; j <= n; j++) {

(26)

70

a[p][j] = a[k][j]; a[k][j] = dummy; }

dummy = b[p]; b[p] = b[k]; b[k] = dummy; }

}

// Gauss-Jordan elimination factor = a[k][k];

for(j = 1; j <= n; j++) a[k][j] /= factor;

b[k] /= factor;

for(i = 1; i <= n; i++) {

if(i == k) continue;

factor = a[i][k];

for(j = 1; j <= n; j++) a[i][j] -=

a[k][j]*factor;

b[i] -= b[k]*factor; }

}

for(i = 1; i <= n; i++) x[i] = b[i];

return; }

Source Code GPU Computing

Sorting

#include "cuda_runtime.h"

#include "device_launch_parameters.h" #include <iostream>

#include <windows.h> using namespace std; #include <cuda.h> #include <stdio.h> #include <stdlib.h> #include <conio.h>

#include <cuda_runtime_api.h> //#define NUM 8

__device__ inline void swap(float & a, float & b) {

float tmp = a; a = b;

(27)

71

__global__ void bitonicSort(float * values, float N) {

extern __shared__ float shared[]; const unsigned int tid = threadIdx.x; shared[tid] = values[tid];

for (unsigned int k = 2; k <= N; k *= 2) {

for (unsigned int j = k / 2; j>0; j /= 2) {

unsigned int ixj = tid ^ j; if (ixj > tid)

{

if ((tid & k) == 0) {

if (shared[tid] > shared[ixj]) { swap(shared[tid], shared[ixj]); } } else {

if (shared[tid] < shared[ixj]) { swap(shared[tid], shared[ixj]); } } } } }

values[tid] = shared[tid]; }

int main(void) {

cudaEvent_t start, stop; float time;

float * dvalues; float * values; double NUM; float aa = 5.0; scanf("%d",&NUM);

values = (float *)malloc( (NUM+1)*sizeof(float) ); size_t size = NUM * sizeof(int);

for(int i = 0; i < NUM; i++) {

//values[i]=rand()%10 + 1;

values[i] = ((float)rand()/(float)(RAND_MAX)) * aa;

}

/*printf("\n nilai awal: ");

(28)

72

cudaMemcpy(dvalues, values, size , cudaMemcpyHostToDevice); cudaEventCreate(&start);

cudaEventCreate(&stop); cudaEventRecord(start,0);

bitonicSort<<<1, NUM, size >>>(dvalues,NUM); cudaEventRecord(stop,0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&time, start, stop);

cudaMemcpy(values, dvalues, size, cudaMemcpyDeviceToHost); cudaFree(dvalues);

/*printf("\n hasil pengurutan: ");

for (int i=0; i<NUM; i++) printf(" %i",values[i]);*/ printf("%f ms\n",time); printf("\n"); system("pause"); } Binary Search #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> #include <conio.h> #include <stdlib.h> #include <iostream> #include <windows.h> #include <assert.h>

__device__ int get_index_to_check(int thread, int num_threads, int set_size, int offset) {

return (((set_size + num_threads) / num_threads) * thread) + offset;

}

__global__ void p_ary_search(float search, int array_length, int *arr, int *ret_val ) {

const int num_threads = blockDim.x * gridDim.x;

const int thread = blockIdx.x * blockDim.x + threadIdx.x; int set_size = array_length;

while(set_size != 0){

int offset = ret_val[1];

int index_to_check = get_index_to_check(thread, num_threads, set_size, offset);

if (index_to_check < array_length){ int next_index_to_check =

get_index_to_check(thread + 1, num_threads, set_size, offset);

if (next_index_to_check >= array_length){ next_index_to_check = array_length - 1; }

if (search > arr[index_to_check] && (search < arr[next_index_to_check])) {

(29)

73

else if (search == arr[index_to_check]) {

ret_val[0] = index_to_check; }

}

set_size = set_size / num_threads;

} }

float chop_position(float search, float *search_array, int array_length)

{

float time;

cudaEvent_t start, stop;

int array_size = array_length * sizeof(int); if (array_size == 0) return -1;

int *dev_arr;

cudaMalloc((void**)&dev_arr, array_size); cudaMemcpy(dev_arr, search_array, array_size, cudaMemcpyHostToDevice);

int *ret_val = (int*)malloc(sizeof(int) * 2);

ret_val[0] = -1; // return value ret_val[1] = 0; // offset

array_length = array_length % 2 == 0 ? array_length : array_length - 1; // array size

int *dev_ret_val;

cudaMalloc((void**)&dev_ret_val, sizeof(int) * 2); cudaMemcpy(dev_ret_val, ret_val, sizeof(int) * 2, cudaMemcpyHostToDevice);

// Launch kernel

cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0);

p_ary_search<<<16, 64>>>(search, array_length, dev_arr, dev_ret_val);

cudaEventRecord(stop,0); cudaEventSynchronize(stop);

cudaEventElapsedTime(&time, start, stop); // Get results

cudaMemcpy(ret_val, dev_ret_val, 2 * sizeof(int), cudaMemcpyDeviceToHost);

int ret = ret_val[0];

printf("\nFound %i\n",ret_val[1]);

printf("\nElapsed Time : %f ms",time); // Free memory on device

(30)

74

free(ret_val);

return ret; }

static float * build_array(int length) {

float *ret_val = (float*)malloc(length * sizeof(float));

for (int i = 0; i < length; i++) {

ret_val[i] = (i * 2 + 0.5) - 1; //ret_val[i] = i;

printf("%.2f ",ret_val[i]); } return ret_val; }

static void test_array(int length, float search, float index) {

printf("Length %i Search %.2f\n", length, search);

assert(index == chop_position(search, build_array(length), length) && "test_small_array()");

}

static void test_arrays() {

int length; float search;

scanf("%d",&length); scanf("%f",&search);

(31)

75

#include <cuda_runtime_api.h>

#define BLOCK_SIZE 100

__global__ void gpuMM(float *A, float *B, float *C, int N) {

int row = blockIdx.y*blockDim.y + threadIdx.y; int col = blockIdx.x*blockDim.x + threadIdx.x;

float sum = 0.f;

for (int n = 0; n < N; ++n)

sum += A[row*N+n]*B[n*N+col];

C[row*N+col] = sum; }

int main(int argc, char *argv[]) { LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime; int N,K,L; awal: scanf("%d",&L); if(L < 1000) {

printf("Input must be greater than 1000\n"); goto awal;

}

K = L/100;

N = K*BLOCK_SIZE; float time;

cudaEvent_t start, stop; float *hA,*hB,*hC;

hA = new float[N*N]; hB = new float[N*N]; hC = new float[N*N]; float aa=5.0;

for (int j=0; j<N; j++){ for (int i=0; i<N; i++){

hA[j*N+i] = ((float)rand()/(float)(RAND_MAX)) * aa; hB[j*N+i] = ((float)rand()/(float)(RAND_MAX)) * aa;

} }

int size = N*N*sizeof(float); // Size of the memory in

(32)

76

// Copy matrices from the host to device

cudaMemcpy(dA,hA,size,cudaMemcpyHostToDevice); cudaMemcpy(dB,hB,size,cudaMemcpyHostToDevice);

//Execute the matrix multiplication kernel cudaEventCreate(&start);

cudaEventCreate(&stop); cudaEventRecord(start,0);

gpuMM<<<grid,threadBlock>>>(dA,dB,dC,N);

cudaEventRecord(stop,0); cudaEventSynchronize(stop);

float *C;

C = new float[N*N];

cudaMemcpy(C,dC,size,cudaMemcpyDeviceToHost);

cudaFree(dA); cudaFree(dB); cudaFree(dC);

printf("%f ms\n",time); system("pause");

}

Gauss Jordan Elimination main.cpp

#include<stdio.h> #include<conio.h> #include<stdlib.h> #include "Common.h"

int main(int argc , char **argv) {

float *a_h = NULL ; float *b_h = NULL ;

float *result , sum ,rvalue ; int numvar ,j ;

float aa = 5.0; numvar = 0;

scanf("%d",&numvar);

a_h = (float*)malloc(sizeof(float)*numvar*(numvar+1)); b_h = (float*)malloc(sizeof(float)*numvar*(numvar+1)); int ii=0;

for(int i = 1; i <= numvar; i++) {

(33)

77

{

//a_h[ii]=rand()%10 + 1;

a_h[ii]=((float)rand()/(float)(RAND_MAX)) * aa; ii++;

}

//Calling device function to copy data to device DeviceFunc(a_h , numvar , b_h);

//Showing the data printf("\n\n");

/*for(int i =0 ; i< numvar ;i++) {

for(int j =0 ; j< numvar+1; j++) {

printf("%.2f ",b_h[i*(numvar+1) + j]); }

printf("\n"); } */

//Using Back substitution method

result = (float*)malloc(sizeof(float)*(numvar)); for(int i = 0; i< numvar;i++)

{

result[i] = 1.0; }

for(int i=numvar-1 ; i>=0 ; i--) {

sum = 0.0 ;

for( j=numvar-1 ; j>i ;j--) {

sum = sum + result[j]*b_h[i*(numvar+1) + j]; }

rvalue = b_h[i*(numvar+1) + numvar] - sum ; result[i] = rvalue / b_h[i *(numvar+1) + j]; }

//Tampil hasil

/*for(int i =0;i<numvar;i++) {

(34)

78 #include <cuda.h> #include "Common.h" #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> #include <conio.h> #include <stdlib.h> #include <iostream> #include <windows.h>

__global__ void Kernel(float *, float * ,int );

void DeviceFunc(float *temp_h , int numvar , float *temp1_h) {

float time;

float *a_d , *b_d; LARGE_INTEGER frequency;

LARGE_INTEGER t1,t2; double elapsedTime; cudaEvent_t start, stop;

//Memory allocation on the device

cudaMalloc(&a_d,sizeof(float)*(numvar)*(numvar+1)); cudaMalloc(&b_d,sizeof(float)*(numvar)*(numvar+1));

//Copying data to device from host cudaMemcpy(a_d, temp_h,

sizeof(float)*numvar*(numvar+1),cudaMemcpyHostToDevice);

//Defining size of Thread Block dim3 dimBlock(numvar+1,numvar,1); dim3 dimGrid(1,1,1); //Kernel call cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0);

Kernel<<<dimGrid , dimBlock>>>(a_d , b_d , numvar); cudaEventRecord(stop,0);

cudaEventSynchronize(stop);

//Coping data to host from device

cudaMemcpy(temp1_h,b_d,sizeof(float)*numvar*(numvar+1),cudaMemcpyD eviceToHost);

(35)

79

__global__ void Kernel(float *a_d , float *b_d ,int size) {

int idx = threadIdx.x ; int idy = threadIdx.y ; //int width = size ; //int height = size ;

//Allocating memory in the share memory of the device __shared__ float temp[16][16];

//Copying the data to the shared memory

temp[idy][idx] = a_d[(idy * (size+1)) + idx] ;

for(int i =1 ; i<size ;i++) {

if((idy + i) < size) {

float var1 =(-1)*( temp[i-1][i-1]/temp[i+idy][i-1]); temp[i+idy][idx] = temp[i-1][idx] +((var1) *

(temp[i+idy ][idx])); }

}

b_d[idy*(size+1) + idx] = temp[idy][idx]; }

Common.h

#ifndef __Common_H #define __Common_H #endif

void getvalue(float ** ,int *);

void DeviceFunc(float * , int , float *);

Source Code Cluster Computing

Sorting

#include <stdio.h> #include <stdlib.h> #include <mpi.h> #define DEBUG #define ROOT 0

#define ISPOWER2(x) (!((x)&((x)-1)))

float *merge(float array1[], float array2[], float size) { float *result = (float *)malloc(2*size*sizeof(float)); int i=0, j=0, k=0;

while ((i < size) && (j < size))

(36)

80

result[k++] = array1[i++]; while (j < size)

result[k++] = array2[j++]; return result;

}

float sorted(float array[], float size) { int i;

for (i=1; i<size; i++) if (array[i-1] > array[i]) return 0;

return 1; }

int compare(const void *p1, const void *p2) { return *(float *)p1 - *(float *)p2;

}

int main(int argc, char** argv) { int i, b=1, npes, myrank;

long datasize;

float localsize, *localdata, *otherdata, *data = NULL; int active = 1;

MPI_Status status;

double start, finish, p, s; MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &npes);

datasize = strtol(argv[1], argv, 10);

if (!ISPOWER2(npes)) {

if (myrank == ROOT) printf("Processor number must be power of two.\n");

return MPI_Finalize(); }

if (datasize%npes != 0) {

if (myrank == ROOT) printf("Datasize must be divisible by processor number.\n");

return MPI_Finalize(); }

if (myrank == ROOT) {

data = (float *)malloc(datasize * sizeof(float)); for (i = 0; i < datasize; i++)

data[i] = rand()%99 + 1; }

start = MPI_Wtime();

localsize = datasize / npes;

localdata = (float *) malloc(localsize * sizeof(float)); MPI_Scatter(data, localsize, MPI_INT, localdata, localsize, MPI_INT,

ROOT, MPI_COMM_WORLD);

(37)

81

while (b < npes) { if (active) {

if ((myrank/b)%2 == 1) {

MPI_Send(localdata, b * localsize, MPI_INT, myrank - b, 1, MPI_COMM_WORLD);

free(localdata); active = 0; } else {

otherdata = (float *) malloc(b * localsize * sizeof(float)); MPI_Recv(otherdata, b * localsize, MPI_INT, myrank + b, 1, MPI_COMM_WORLD, &status);

localdata = merge(localdata, otherdata, b * localsize); free(otherdata);

} }

b <<= 1; }

finish = MPI_Wtime();

if (myrank == ROOT) { #ifdef DEBUG

if (sorted(localdata, npes*localsize)) { printf("\nParallel sorting succeed.\n\n"); } else {

printf("\nParallel sorting failed.\n\n"); }

#endif

free(localdata); p = finish - start;

printf(" Parallel : %.8f\n", p);

/*start = MPI_Wtime();

qsort(data, datasize, sizeof(float), compare); finish = MPI_Wtime();*/

free(data); } return MPI_Finalize(); } Binary Search #include "mpi.h" #include <iostream> #include <math.h>

using namespace std;

int main(int argc,char **argv) {

(38)

82

const int Tag_Max=3; int max;

double MaxInAll; int MyId, P;

double* A;

int ArrSize, Target; int n, Start;

int i, x;

int Source, dest, Tag; int WorkersDone = 0 ;

double start, finish, p; MPI_Status RecvStatus;

MPI_Init(&argc, &argv);

MPI_Comm_rank (MPI_COMM_WORLD, &MyId); MPI_Comm_size (MPI_COMM_WORLD, &P);

start = MPI_Wtime(); //start working.. if (MyId == Master) {

.

cout<<"This is the master process on "<<P<<" Processes\n"; MaxInAll=0;

int GlobIndx;

cout<<"Enter the number of elements you want to generate..";

cin>> ArrSize; ..

A = new double[ArrSize];

srand ( P ); /* initialize random seed: */ for ( i= 0; i<ArrSize; i++)

{

A[i] = i+1.25;

}

n = ArrSize/(P-1);

for( i = 1; i < P; i++) {

dest = i; if (i == P-1)

n = ArrSize - (n*(P-2)); Tag = Tag_Size;

MPI_Send(&n, 1, MPI_DOUBLE, dest, Tag, MPI_COMM_WORLD);

(39)

83

Start = (i - 1) * ( ArrSize/(P-1) ); MPI_Send(A+Start, n, MPI_DOUBLE, dest, Tag, MPI_COMM_WORLD);

}

WorkersDone = 0; int MaxIndex = 0;

while (WorkersDone < P-1 ) {

MPI_Recv(&x, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &RecvStatus);

Source = RecvStatus.MPI_SOURCE; Tag = RecvStatus.MPI_TAG; if (Tag == Tag_Max)/

{

GlobIndx = (Source - 1)*(ArrSize/(P-1) ) + x; if ( A[GlobIndx] > MaxInAll)

{

MaxInAll = A[GlobIndx]; MaxIndex = GlobIndx; } WorkersDone++; } } if(WorkersDone==P-1)

cout << "Process "<<Source<<" found the max of the array "<< MaxInAll<<" at index "<<MaxIndex;

delete [] A; }

else {

max=0;

cout<<"Process "<<MyId<<" is alive...\n"; Source = Master;

Tag = Tag_Size;

MPI_Recv(&n, 1, MPI_DOUBLE, Source, Tag, MPI_COMM_WORLD, &RecvStatus);

A = new double[n]; Tag = Tag_Data;

MPI_Recv(A, n, MPI_DOUBLE, Source, Tag, MPI_COMM_WORLD, &RecvStatus);

cout<<"Process "<<MyId<< "Received "<<n<<" data elements\n";

int max_i; i = 0;

while (i<n ) {

(40)

84

}

dest = Master; Tag = Tag_Max;

cout<<"Process "<<MyId<< " has max equals "<<max<<endl; MPI_Send(&max_i, 1, MPI_DOUBLE, dest, Tag,

MPI_COMM_WORLD); delete [] A; }

finish = MPI_Wtime(); if (MyId == 0)

{

p = finish - start;

printf(" Parallel : %.8f\n", p); }

MPI_Finalize(); return 0;

}

Matrix Multiplication

#include <stdio.h> #include "mpi.h"

#define N 5000 /* number of rows and columns in matrix */

MPI_Status status;

double a[N][N],b[N][N],c[N][N]; int main(int argc, char **argv) {

double start, finish, p; int

numtasks,taskid,numworkers,source,dest,rows,offset,i,j,k,remainPar t,originalRows;

//struct timeval start, stop; MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &taskid); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); numworkers = numtasks-1;

start = MPI_Wtime(); if (taskid == 0) { for (i=0; i<N; i++) { for (j=0; j<N; j++) { a[i][j]= 1.25;

(41)

85

}

//gettimeofday(&start, 0);

/* send matrix data to the worker tasks */ rows = N/numworkers;

offset = 0;

remainPart = N%numworkers;

for (dest=1; dest<=numworkers; dest++) {

if (remainPart > 0) {

originalRows = rows; ++rows;

remainPart--;

MPI_Send(&offset, 1, MPI_INT, dest, 1, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, 1, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*N, MPI_DOUBLE,dest,1, MPI_COMM_WORLD);

MPI_Send(&b, N*N, MPI_DOUBLE, dest, 1, MPI_COMM_WORLD); offset = offset + rows;

rows = originalRows; }

else {

MPI_Send(&offset, 1, MPI_INT, dest, 1, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, 1, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*N, MPI_DOUBLE,dest,1, MPI_COMM_WORLD);

MPI_Send(&b, N*N, MPI_DOUBLE, dest, 1, MPI_COMM_WORLD); offset = offset + rows;

} }

/* wait for results from all worker tasks */ for (i=1; i<=numworkers; i++)

{ source = i;

MPI_Recv(&offset, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status);

MPI_Recv(&rows, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status);

MPI_Recv(&c[offset][0], rows*N, MPI_DOUBLE, source, 2, MPI_COMM_WORLD, &status); } }

if (taskid > 0) { source = 0;

MPI_Recv(&offset, 1, MPI_INT, source, 1, MPI_COMM_WORLD, &status);

MPI_Recv(&rows, 1, MPI_INT, source, 1, MPI_COMM_WORLD, &status);

MPI_Recv(&a, rows*N, MPI_DOUBLE, source, 1, MPI_COMM_WORLD, &status);

(42)

86

/* Matrix multiplication */ for (k=0; k<N; k++)

for (i=0; i<rows; i++) { c[i][k] = 0.0;

for (j=0; j<N; j++)

c[i][k] = c[i][k] + a[i][j] * b[j][k]; }

MPI_Send(&offset, 1, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Send(&c, rows*N, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD); }

finish = MPI_Wtime(); if (taskid == 0) {

p = finish - start;

printf(" Parallel : %.8f\n", p); }

MPI_Finalize();

}

Gauss Jordan Elimination

#include <stdlib.h> #include <stdio.h> #include <iostream> #include "mpi.h"

double serial_gaussian( double *A, double *b, double *y, int n ) {

int i, j, k;

double tstart = MPI_Wtime();

for( k=0; k<n; k++ ) { for( j=k+1; j<n; j++ ) { if( A[k*n+k] != 0)

A[k*n+j] = A[k*n+j] / A[k*n+k]; else

A[k*n+j] = 0; }

if( A[k*n+k] != 0 )

y[k] = b[k] / A[k*n+k]; else

(43)

87

A[k*n+k] = 1.0;

for( i=k+1; i<n; i++ ) { for( j=k+1; j<n; j++ )

A[i*n+j] -= A[i*n+k] * A[k*n+j];

b[i] -= A[i*n+k] * y[k]; A[i*n+k] = 0.0;

} }

return tstart; }

void print_equations( double *A, double *y, int n ) {

int i, j;

for( i=0; i<n; i++ ) { for( j=0; j<n; j++ ) { if( A[i*n+j] != 0 ) {

std::cout << A[i*n+j] << "x" << j; if( j<n-1 ) std::cout << " + "; }

else

std::cout << " "; }

std::cout << " = " << y[i] << std::endl; }

}

int main( int argc, char *argv[] ) {

double *A, *b, *y, *a, *tmp, *final_y; // var decls

int i, j, n, row, r;

double tstart, tfinish, TotalTime; // timing decls

float aa = 5.0; if( argc < 2 ) {

std::cout << "Usage\n";

std::cout << " Arg1 = number of equations / unkowns\n"; return -1;

}

n = atoi(argv[1]);

A = new double[n*n]; // space for matricies

b = new double[n]; y = new double[n];

for( i=0; i<n; i++ ) { // creates a matrix of random

b[i] = 0.0;

for( j=0; j<n; j++ ) {

r = ((float)rand()/(float)(RAND_MAX)) * aa; A[i*n+j] = r;

(44)

88

} }

MPI_Init (&argc,&argv); // Initialize MPI

MPI_Comm com = MPI_COMM_WORLD;

int size,rank; // Get rank/size info MPI_Comm_size(com,&size);

MPI_Comm_rank(com,&rank);

int manager = (rank == 0); if (size == 1)

tstart = serial_gaussian ( A, b, y, n); else

{

if ( ( n % size ) != 0 ) {

std::cout << "Unknowns must be multiple of processors." << std::endl;

return -1; }

int np = (int) n/size; a = new double[n*np]; tmp = new double[n*np];

if ( manager ) {

tstart = MPI_Wtime(); final_y = new double[n];

}

MPI_Scatter(A,n*np,MPI_INT,a,n*np,MPI_INT,0,com);

for ( i=0; i < (rank*np); i++ ) {

MPI_Bcast(tmp,n,MPI_INT,i/np,com); MPI_Bcast(&(y[i]),1,MPI_INT,i/np,com);

for (row=0; row<np; row++) {

for ( j=i+1; j<n; j++ )

a[row*n+j] = a[row*n+j] - a[row*n+i]*tmp[j]; b[rank*np+row] = b[rank*np+row] - a[row*n+i]*y[i]; a[row*n+i] = 0;

} }

for (row=0; row<np; row++) {

(45)

89

{

a[row*n+j] = a[row*n+j] / a[row*n+np*rank+row]; }

y[rank*np+row] = b[rank*np+row] / a[row*n+rank*np+row]; a[row*n+rank*np+row] = 1;

for ( i=0; i<n ; i++ ) tmp[i] = a[row*n+i];

MPI_Bcast (tmp,n,MPI_INT,rank,com);

MPI_Bcast (&(y[rank*np+row]),1,MPI_INT,rank,com);

for ( i=row+1; i<np; i++) {

for ( j=rank*np+row+1; j<n; j++ )

a[i*n+j] = a[i*n+j] - a[i*n+row+rank*np]*tmp[j]; b[rank*np+i] = b[rank*np+i] -

a[i*n+row+rank*np]*y[rank*np+row]; a[i*n+row+rank*np] = 0;

} }

for (i=(rank+1)*np ; i<n ; i++) {

MPI_Bcast (tmp,n,MPI_INT,i/np,com); MPI_Bcast (&(y[i]),1,MPI_INT,i/np,com); }

MPI_Barrier(com);

MPI_Gather(a,n*np,MPI_INT,A,n*np,MPI_INT,0,com);

MPI_Gather(&(y[rank*np]),np,MPI_INT,final_y,np,MPI_INT,0,com);

y = final_y;

}

if (manager || (size==1) ) {

tfinish = MPI_Wtime();

TotalTime = tfinish - tstart; printf("%f",TotalTime);

std::cout << std::endl;

}