High Performance Computing using a
Parallella Board Cluster
By: Michael Kruger
Supervisor: Dr. Karen Bradshaw
Recap
Aim:
To build a cluster of parallella boards (x4) Why?
● Parallella boards provide decent performance
● Use very little power to run
● Are relatively cheap
Recap: the Parallella
● 16 core epiphany coprocessor
(up to 1 GHz)
● <5 Watt typical
power consumption
● R1250 per board
● 32 GFLOPS
What was Proposed
● Build a high performance computer using multiple Parallella boards connected over a network
● Install an operating system and software to set up the cluster
● Compare performance of the Parallella cluster with other similarly priced systems
● Discover the limitations of the Parallella cluster
The Building
The Building Cont.
Seizing Power without resorting to
Butchery
Software
● Get Internet and IP addresses
● SSH
● NFS
● OpenMPI
Creating Programs
Creating Programs Cont.
Creating Programs Cont.
First Benchmark Cont.
Multiply Floats 100 000 000 times:
1 Arm core (667 MHz) :
Time taken 16.258456 Seconds 1 Parallella Board (600 MHz):
Iterations per Core: 6 250 000 Time taken 1.239766 Seconds
First Benchmark
4 Parallella Boards (600 MHz):
Iterations per Core: 1 562 500 Time taken 0.325880 Seconds
Problems & Limitations so far
● No Division. Long Division Reigns
● rand() is slow
● No nfs-kernel-server
● DHCP server was needed
Division
1 Arm core:
Number of iterations : 1000000 Time taken 0.162627 Seconds
1 Parallella:
Number of iterations : 1000000 Iterations per Core: 62 500
Time taken 10.762318 Seconds
Division Cont.
Cluster (4 parallella):
Number of iterations : 1000000 Iterations per Core: 15626
Time taken 2.722688 Seconds
Epiphany III
Features
● 16 High Performance RISC CPU Cores
● 1 GHz Operating Frequency
● 32 GFLOPS Peak Performance
● 512GB/s Local Memory Bandwidth
● 64GB/s Network-On-Chip Bisection Bandwidth
● 8 GB/s Off-Chip Bandwidth
● 0.5 MB On-Chip Distributed Shared Memory
● 2 Watt Maximum Chip Power Consumption
● IEEE Floating Point Instruction Set
● Fully-featured ANSI-C/C++
programmable
Dodgy Flopage
1 Parallella Board (Theoretical 32 Gflops) :
~ amount of FLOPS: 80.6 Mflop/s 1 Arm Core:
~ amount of FLOPS: 6.1 Mflop/s
4 Parallella Boards (Theoretical 128 Gflops) :
~ amount of FLOPS: 306.8 Mflop/s
Compiling
# Build HOST side application
mpicc FloaterP.c -o Debug/FloaterP.elf ${EINCS} ${ELIBS} -le-hal -le-loader - lpthread
# Build DEVICE side program
e-gcc -T ${ELDF} eFloater.c -o Debug/eFloater.elf -le-lib
# Convert ebinary to SREC file
e-objcopy --srec-forceS3 --output-target srec Debug/eFloater.elf Debug/eFloater.srec
Epiphany Usage
e_platform_t platform;
e_epiphany_t dev;
//Initalize Epiphany device e_init(NULL);
e_reset_system(); //reset Epiphany e_get_platform_info(&platform);
e_open(&dev, 0, 0, platform.rows, platform.cols); //open all cores ---
e_load_group("eFloater.srec", &dev, 0, 0, platform.rows, platform.cols, E_TRUE);
or
e_load("e_hello_world.srec", &dev, 0, 0, E_TRUE)
MPI usage
mpiexe -hostfile [] -x ESDK -x EHD ./[]
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
MPI_Finalize();