Parallel Program Scheduling in AMPs

(1)

Jyothi Krishna V S

Guides: Shankar Balachandran and Rupesh Nasre [email protected]

IIT Madras April 20, 2018

(2)

CPU: Power Wall

CPU

Processor

Processor Processor Processor Processor

CPU

• Moore’s Law hit the Power Wall

• Not practical to keep on increasing the frequency

• Move to Multicore environment, Multiple core working at lower frequency

• Move to parallel programming to keep on reaping benefit

1/32

(3)

CPU: Power Wall

CPU

Processor

CPU

1/32

(4)

CPU: Power Wall

CPU

Processor

CPU

1/32

(5)

High Static Cost of CPU

Processor Processor

Friendly TasksCPU

Friendly Tasks Memory Static

Dynamic

Energy Consumption Energy Consumption

DVFS

Processor Processor

Friendly TasksCPU Static

Dynamic

Friendly Tasks Memory

• CPU Friendly Tasks: Mostly ALU operation (active CPU cycles)

• Memory Friendly Tasks: Mostly memory/ Branch operations

• DVFS: reduced the negative effect

• Still had high static cost to keep modules on.

2/32

(6)

High Static Cost of CPU

Processor Processor

Friendly TasksCPU

Friendly Tasks Memory Static

Dynamic

DVFS

Processor Processor

Friendly Tasks CPU Static

Dynamic

• CPU Friendly Tasks: Mostly ALU operation (active CPU cycles)

• Memory Friendly Tasks: Mostly memory/ Branch operations

• DVFS: reduced the negative effect

• Still had high static cost to keep modules on.

2/32

(7)

Asymmetric Multicore Processors (AMPs)

Static Dynamic

Processors

Friendly TasksCPU Processor

Processor

• CPU friendly tasks to big & powerful cores

• Memory friendly tasks to small & power efficient cores eg. big.LITTLE from ARM, Tegra from NVIDIA

3/32

(8)

AMP: Challenges with a multithreaded program

• What would be the optimal Hardware Configuration to run a Parallel Program?

• Which type of core to run each thread on?

• Does an AMP environment help? (Use one type of core at a time)

• Number of resources of each type?

• Optimal frequency configuration for each core type?

CHOAMP, SIAM*

• Given a AMP environment what would be the optimal scheduling for a given parallel program?

CES, SIAM*

• How to maintain fairness among parallel threads?

• How to make efficiently manage the Memory System (Prefetching, Variable Cache sizes etc)?

1

1* To submit

4/32

(9)

AMP: Challenges with a multithreaded program

CHOAMP, SIAM*

CES, SIAM*

1

1* To submit

4/32

(10)

AMP: Challenges with a multithreaded program

CHOAMP, SIAM*

CES, SIAM*

1

1* To submit

4/32

(11)

Compiler Enhanced Scheduling (CES): Recap

• Uniform Parallel Workload

• Online Performance Evaluation Model(PEM)

40 40

20 20

44 44

16 16

R E E N T R Y

A B C

Parallel for

Shared worklist Itr stealing

P E M

big core

LITTLE core 40 Iterations in worklist

• Non uniform Parallel Workload

• Offline Affinity and Normalization Scheduling

PEM

I II IIIIV

IV I wl⁰im wlim> wl⁰im OpenMPNo

for

Yes wlim=wl⁰im Sections

Affinity Allocation

Normalization Stage Affinity

Victim Section

wlim

5/32

(12)

Compiler Enhanced Scheduling (CES): Recap

• Uniform Parallel Workload

• Online Performance Evaluation Model(PEM)

40 40

20 20

44 44

16 16

R E E N T R Y

A B C

Parallel for

Shared worklist Itr stealing

P E M

big core

LITTLE core 40 Iterations in worklist

• Non uniform Parallel Workload

• Offline Affinity and Normalization Scheduling

PEM

III IIIIV

IV I wl⁰im wlim> wl⁰im OpenMPNo

for

Yes wlim=wl⁰im Sections

Affinity Allocation

Normalization Stage Affinity

Victim Section

wlim

5/32

(13)

CHOAMP: Cost Based Hardware Optimization for Asymmetric Multicore Processors

6/32

(14)

Optimal Hardware

Given an input program, a user selected cost function and the possible hardware configurations, can we find an optimal hardware configuration which can run the program with minimal cost.

• Static/Compile time: Chen and John ²

• Dynamic: Execution time decisions : HMP Scheduling³

• Hybrid: Compile Time Instrumentation and Runtime Decision:

PIE⁴, CES

2J. Chen and L. K. John. Efficient program scheduling for heterogeneous multi-core processors. in DAC ’09

3http://www.arm.com

4K. Van Craeynest et al. Scheduling Heterogeneous Multi-cores Through Performance Impact Estimation (PIE), ISCA ’12

7/32

(15)

Static v/s Dynamic Scheduling

• Static

• Unknown variables

• Unknown Thread-to-Core Mapping

• Hardware Resource Reservation

• Scalable over large hardware configurations

• Dynamic

• Runtime overhead

• Scalability of the engine

• Accurate knowledge of workload

8/32

(16)

Static v/s Dynamic Scheduling

• Static

• Unknown variables

• Unknown Thread-to-Core Mapping

• Hardware Resource Reservation

• Scalable over large hardware configurations

• Dynamic

• Runtime overhead

• Scalability of the engine

• Accurate knowledge of workload

8/32

(17)

Static Scheduling

• Unknown Loop bounds: Estimate unknown variables based on known variables.

• Unknown wasted cycles: Estimate the expected delay by learning for large set of examples.

CostFunction

ALU Operations Memory

Operations

Con trol

operations

HC1 Optimal Hardware Configuration

HC2 HC3

HC4

APredictortrained with a set of program features (selected based on architectural differences in AMP cores and the choice of parallel program) and hardware configurations to identify the optimal hardware configuration

9/32

(18)

Static Scheduling

• Unknown Loop bounds: Estimate unknown variables based on known variables.

• Unknown wasted cycles: Estimate the expected delay by learning for large set of examples.

CostFunction

ALU Operations Memory

Operations

Con trol

operations

HC1 Optimal Hardware Configuration

HC2 HC3

HC4

APredictortrained with a set of program features (selected based on architectural differences in AMP cores and the choice of parallel program) and hardware configurations to identify the optimal hardware configuration

9/32

(19)

CHOAMP: Training Phase

Training

Hardware configurations Feature

vectors Features

Knowledge base

Cost functions

T raining

Microbenchmarks

10/32

(20)

CHOAMP: Compilation Phase

Knowledge base

Obl

Predictor

Load Balance Feature

vectors

User Input

Transformed program Input program

Obl

config

Compilation

11/32

(21)

CHOAMP: The Big Picture

Training

Hardware configurations Feature

vectors Features

Knowledge base

Cost functions

Obl

Predictor

vectors

User Input

Obl

config

T raining Compilation

Microbenchmarks

(22)

Test Environment : big.LITTLE

Cortex A15/A57 Cortex A7 /A53

Cache Coherent Interconnect

L2 Cache L2 Cache

GPU

Memory

big.LITTLE System Design

Peripheral Interconnect

Core Types Cortex-A7 Cortex-A15 Pipeline simple 8 stage in-

order

Out of Order, multi-issue Frequency 600 - 1400 MHz 800 - 2000 MHz

Speed 1.9 DMIPS 4.01 DMIPS

Instruction Set

Thumb-2

A15 Single Core Four Cores Freq Idle Peak Idle Peak 2.0 GHz 0.95 2.40 1.15 5.28 1.8 GHz 0.70 1.65 0.70 3.50 1.4 GHz 0.38 0.85 0.40 2.30 1.2 GHz 0.30 0.70 0.32 1.80 A7 Single Core Four Cores Freq Idle Peak Idle Peak

1.4 GHz 0.20 0.50 0.22 1.40 Parallel Program : OpenMP

13/32

(23)

Test Environment : big.LITTLE

L2 Cache L2 Cache

GPU

Memory

order

Instruction Set

Thumb-2

A15 Single Core Four Cores Freq Idle Peak Idle Peak 2.0 GHz 0.95 2.40 1.15 5.28 1.8 GHz 0.70 1.65 0.70 3.50 1.4 GHz 0.38 0.85 0.40 2.30 1.2 GHz 0.30 0.70 0.32 1.80 A7 Single Core Four Cores Freq Idle Peak Idle Peak 1.4 GHz 0.20 0.50 0.22 1.40

Parallel Program : OpenMP

13/32

(24)

Test Environment : big.LITTLE

L2 Cache L2 Cache

GPU

Memory

order

Instruction Set

Thumb-2

A15 Single Core Four Cores Freq Idle Peak Idle Peak 2.0 GHz 0.95 2.40 1.15 5.28 1.8 GHz 0.70 1.65 0.70 3.50 1.4 GHz 0.38 0.85 0.40 2.30 1.2 GHz 0.30 0.70 0.32 1.80 A7 Single Core Four Cores Freq Idle Peak Idle Peak 1.4 GHz 0.20 0.50 0.22 1.40

Parallel Program : OpenMP _13/32

(25)

Feature Selection

• Low level : based on Hardware dissimilarities of bigand LITTLE

• ALU operations

• Memory operations

• Branch operations

• False sharing

• High Level: synchronization constructs provided by OpenMP:

• Barriers

• Atomics

• Critical sections

• Flush operations

• Reduction operations

14/32

(26)

Feature Selection

• Low level : based on Hardware dissimilarities of bigand LITTLE

• ALU operations

• Memory operations

• Branch operations

• False sharing

• High Level: synchronization constructs provided by OpenMP:

• Barriers

• Atomics

• Critical sections

• Flush operations

• Reduction operations

14/32

(27)

Implementation

• Hardware Configurations: Odroid XU3 ⁵

• Core Configurations: 4L4b, 0L4b, 4L2b, 2L2b, 4L0b

• big Frequencies : 2GHz, 1.8GHz, 1.6GHz, 1.4 GHz, 1.2GHz

• Micro-benchmark generation: Python 2.7

• Dynamic Scheduling: HMP and CES⁶ for dynamic load balancing

• Analysis and Transformation: IMOP⁷

• Regression Tool: Java Scientific Library⁸

• Regression Engines: Linear, Quadratic and Gaussian

• Benchmark : NPB⁹

5http://www.hardkernel.com/main/products/prdt_info.php?g_code=g140448267127

6J. K. Viswakaran Sreelatha, and S. Balachandran. 2016. Compiler Enhanced Scheduling for OpenMP for Heterogeneous Multiprocessors. EEHCO ’16

7Nougrahiya and Nandivada, IMOP: http://www.cse.iitm.ac.in/~amannoug/imop/

8Flanagan M, http://www.ee.ucl.ac.uk/~mflanaga/java/index.html

9Bailey et al. The NAS parallel benchmarks-summary and preliminary results

15/32

(28)

Walk Through Example

# d e f i n e M 5 0 0 0 0

int f(int *s, int A[], int cumSum[], int L) {

int MAX = 0 , l o c a l S u m = 0 , t e m p = L / 1 2 8 ; int N = M - t e m p ; /* U n k n o w n V a r i a b l e */

#pragma omp parallel { int i , j ;

#pragma omp for reduction(+:localSum) for ( i = 0; i < N ; i ++) {

l o c a l S u m += A[ i ]; cumSum[ i ] = 0;

for ( j =0; j < N ; j ++) {

if ( j <= i ) cumSum[ i ] += A[ j ];

}

#pragma omp critical

{ if ( MAX < A[ i ]) MAX = A[ i ]; } }}

return MAX;

}

16/32

(29)

Walk Through Example

# d e f i n e M 5 0 0 0 0

for ( j =0; j < N ; j ++) {

if ( j <= i ) cumSum[ i ] += A[ j ];

}

#pragma omp critical

{ if ( MAX < A[ i ]) MAX = A[ i ]; } }}

return MAX;

}

M L 128

temp

Dependence Graph

N

16/32

(30)

Walk Through Example

# d e f i n e M 5 0 0 0 0

for ( j =0; j < N ; j ++) {

if ( j <= i ) cumSum[ i ] += A[ j ];

}

#pragma omp critical

{ if ( MAX < A[ i ]) MAX = A[ i ]; } }}

return MAX;

}

M L 128

temp

Dependence Graph

N

16/32

(31)

Walk Through Example

Op Result range

i + j range(MAX(R(i)) +MAX(R(j))) i * j range(MAX(R(i))∗MAX(R(j))) i - j range(MAX(R(i)) - MIN(R(j))) i / j range(MAX(R(i)) / MIN(R(j)))

i>>j i - MAX(R(j))

i % j R(j)

i = j MAX(R(i),R(j))

MAX(R(i)): Known maximum value in range i R(k): Range to which the value k belongs

• L into largest non empty bucket.

• ^L^R4−5

• temp = 50000/128 =R₂₋₃

• N = 50000 - R₂₋₃=R₄₋₅= 50000

M L 128

temp

Dependence Graph

N

16/32

(32)

Prediction v/s real world

Scaled Prediction of Energy consumption byPredictorusing different learning kernels for our example compared to actual (Bold Blue). Shows how good the Predictoris.

17/32

(33)

CHOAMP: EXECUTION TIME

• On an average 28% improvement with Linear Regression Oracle

18/32

(34)

CHOAMP: EXECUTION TIME

• BT: Selecting the same configuration as base case

• DC: Unknown Parallel Operations.

• EP: Highly parallelizable, ↑Configuration ↑Gain

• SP: High Sync costs,↓ Threads↑ Gain

19/32

(35)

CHOAMP: ENERGY

• On an average 65% improvement with Linear RegressionPredictor

• Mostly Lower Configuration than base yields higher energy gains

20/32

(36)

CHOAMP: ENERGY

• EP: Highly parallel

21/32

(37)

CHOAMP: EDP

• On an average 54% improvement with Linear Regression Oracle

• Mostly a mid configuration yields good results

22/32

(38)

CHOAMP: EDP

• LU: Gain higher than Energy, Execution Time negative

• SP: Rounded off

23/32

(39)

CES + CHOAMP

• The trained Oracle without knowledge of CES

• On an average 14% improvement in execution time

24/32

(40)

CES + CHOAMP

• BT: Base configuration selected

• EP, IS, LU: Loss Amplified, Better Load balancing

• SP: Gain reduced, Lower Sync costs

25/32

(41)

CHOAMP:Conclusions

• On an average 28% improvement in execution time

• Average 65% improvement in energy consumption

• Average 58% improvement in EDP

• Average 14% improvement in execution time with CES

• Works well in tandem with dynamic scheduling algorithms (i) HMP (ii) CES.

• Sliding window regression: No significant gains.

26/32

(42)

SIAM: Scheduling Irregular Workloads on Asymmetric Multicore processors

27/32

(43)

Irregular Parallel Workloads: Graph Algorithms

• Parallelism depends on the input graphs.

• Include Graph into training the oracle.

• How to represent the graph with set of properties?

• Should capture:

• Overall Workload

• Workload distribution

• Memory accesses/Atomics

• Training for Algorithm-Graph pair

28/32

(44)

SIAM: Workflow

Algorithm Features

Knowledge base

Obl

Predictor Feature

vectors

O_bl config

TrainingCompilation

Microbenchmarks

Training Graphs

Training

Hardware configurations

Preprocessing

vectors Feature vectors

Input Graph´

Feature Extraction

aed,s

Compaction aed´

Input Graph

29/32

(45)

Average EDP gain over Algorithms

-40 -20 0 20 40 60 80 100

com con pR sssp tc Avg

Percentage Improvement

Graphs Percentage EDP improvement

Predpre Pred+pre

41.9 20.84 62.12 34.17 11.41 34.09

21.17 22.9 39.85 29.08 17.37 26.07

55.99 56.23 70.91 50.85 29.68 52.73

Higher the better.

30/32

(46)

SIAM: Conclusion

• Graph properties are required to fully capture the workload.

• Compaction: Good representation of the input graph can reduce both energy consumption and execution time.

• Predictor shows on an average 52% improvement in EDP consumption.

• Add edge weight Properties : for weight based algorithms.

SSSP, MST, Max flow etc

31/32