High Performance Computing using a Parallella Board Cluster

(1)

High Performance Computing using a

Parallella Board Cluster

By: Michael Kruger

Supervisor: Dr. Karen Bradshaw

(2)

Recap

Aim:

To build a cluster of parallella boards (x4) Why?

● Parallella boards provide decent performance

● Use very little power to run

● Are relatively cheap

(3)

Recap: the Parallella

● 16 core epiphany coprocessor

(up to 1 GHz)

● <5 Watt typical

power consumption

● R1250 per board

● 32 GFLOPS

(4)

What was Proposed

● Build a high performance computer using multiple Parallella boards connected over a network

● Install an operating system and software to set up the cluster

● Compare performance of the Parallella cluster with other similarly priced systems

● Discover the limitations of the Parallella cluster

(5)

The Building

(6)

The Building Cont.

(7)

Seizing Power without resorting to

Butchery

(8)

Software

● Get Internet and IP addresses

● SSH

● NFS

● OpenMPI

(9)

Creating Programs

(10)

Creating Programs Cont.

(11)

Creating Programs Cont.

(12)

First Benchmark Cont.

Multiply Floats 100 000 000 times:

1 Arm core (667 MHz) :

Time taken 16.258456 Seconds 1 Parallella Board (600 MHz):

Iterations per Core: 6 250 000 Time taken 1.239766 Seconds

(13)

First Benchmark

4 Parallella Boards (600 MHz):

Iterations per Core: 1 562 500 Time taken 0.325880 Seconds

(14)

Problems & Limitations so far

● No Division. Long Division Reigns

● rand() is slow

● No nfs-kernel-server

● DHCP server was needed

(15)

Division

1 Arm core:

Number of iterations : 1000000 Time taken 0.162627 Seconds

1 Parallella:

Number of iterations : 1000000 Iterations per Core: 62 500

Time taken 10.762318 Seconds

(16)

Division Cont.

Cluster (4 parallella):

Number of iterations : 1000000 Iterations per Core: 15626

Time taken 2.722688 Seconds

(17)

(18)

Epiphany III

Features

● 16 High Performance RISC CPU Cores

● 1 GHz Operating Frequency

● 32 GFLOPS Peak Performance

● 512GB/s Local Memory Bandwidth

● 64GB/s Network-On-Chip Bisection Bandwidth

● 8 GB/s Off-Chip Bandwidth

● 0.5 MB On-Chip Distributed Shared Memory

● 2 Watt Maximum Chip Power Consumption

● IEEE Floating Point Instruction Set

● Fully-featured ANSI-C/C++

programmable

(19)

Dodgy Flopage

1 Parallella Board (Theoretical 32 Gflops) :

~ amount of FLOPS: 80.6 Mflop/s 1 Arm Core:

~ amount of FLOPS: 6.1 Mflop/s

4 Parallella Boards (Theoretical 128 Gflops) :

~ amount of FLOPS: 306.8 Mflop/s

(20)

Compiling

# Build HOST side application

mpicc FloaterP.c -o Debug/FloaterP.elf ${EINCS} ${ELIBS} -le-hal -le-loader - lpthread

# Build DEVICE side program

e-gcc -T ${ELDF} eFloater.c -o Debug/eFloater.elf -le-lib

# Convert ebinary to SREC file

e-objcopy --srec-forceS3 --output-target srec Debug/eFloater.elf Debug/eFloater.srec

(21)

Epiphany Usage

e_platform_t platform;

e_epiphany_t dev;

//Initalize Epiphany device e_init(NULL);

e_reset_system(); //reset Epiphany e_get_platform_info(&platform);

e_open(&dev, 0, 0, platform.rows, platform.cols); //open all cores ---

e_load_group("eFloater.srec", &dev, 0, 0, platform.rows, platform.cols, E_TRUE);

or

e_load("e_hello_world.srec", &dev, 0, 0, E_TRUE)

(22)

MPI usage

mpiexe -hostfile [] -x ESDK -x EHD ./[]

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &numprocs);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Get_processor_name(processor_name, &namelen);

MPI_Finalize();