• Tidak ada hasil yang ditemukan

Parallel Program Scheduling in AMPs - CSE-IITM

N/A
N/A
Protected

Academic year: 2024

Membagikan "Parallel Program Scheduling in AMPs - CSE-IITM"

Copied!
46
0
0

Teks penuh

(1)

Parallel Program Scheduling in AMPs

Jyothi Krishna V S

Guides: Shankar Balachandran and Rupesh Nasre [email protected]

IIT Madras April 20, 2018

(2)

CPU: Power Wall

CPU

Processor

Processor Processor Processor Processor

CPU

• Moore’s Law hit the Power Wall

• Not practical to keep on increasing the frequency

• Move to Multicore environment, Multiple core working at lower frequency

• Move to parallel programming to keep on reaping benefit

1/32

(3)

CPU: Power Wall

CPU

Processor

Processor Processor Processor Processor

CPU

• Moore’s Law hit the Power Wall

• Not practical to keep on increasing the frequency

• Move to Multicore environment, Multiple core working at lower frequency

• Move to parallel programming to keep on reaping benefit

1/32

(4)

CPU: Power Wall

CPU

Processor

Processor Processor Processor Processor

CPU

• Moore’s Law hit the Power Wall

• Not practical to keep on increasing the frequency

• Move to Multicore environment, Multiple core working at lower frequency

• Move to parallel programming to keep on reaping benefit

1/32

(5)

High Static Cost of CPU

Processor Processor

Processor Processor

Friendly TasksCPU

Friendly Tasks Memory Static

Dynamic

Energy Consumption Energy Consumption

DVFS

Processor Processor

Processor Processor

Friendly TasksCPU Static

Dynamic

Energy Consumption Energy Consumption

Friendly Tasks Memory

• CPU Friendly Tasks: Mostly ALU operation (active CPU cycles)

• Memory Friendly Tasks: Mostly memory/ Branch operations

• DVFS: reduced the negative effect

• Still had high static cost to keep modules on.

2/32

(6)

High Static Cost of CPU

Processor Processor

Processor Processor

Friendly TasksCPU

Friendly Tasks Memory Static

Dynamic

Energy Consumption Energy Consumption

DVFS

Processor Processor

Processor Processor

Friendly Tasks CPU Static

Dynamic

Energy Consumption Energy Consumption

Friendly Tasks Memory

• CPU Friendly Tasks: Mostly ALU operation (active CPU cycles)

• Memory Friendly Tasks: Mostly memory/ Branch operations

• DVFS: reduced the negative effect

• Still had high static cost to keep modules on.

2/32

(7)

Asymmetric Multicore Processors (AMPs)

Static Dynamic

Processors

Friendly TasksCPU Processor

Processor

Friendly Tasks Memory

Energy Consumption Energy Consumption

• CPU friendly tasks to big & powerful cores

• Memory friendly tasks to small & power efficient cores eg. big.LITTLE from ARM, Tegra from NVIDIA

3/32

(8)

AMP: Challenges with a multithreaded program

What would be the optimal Hardware Configuration to run a Parallel Program?

Which type of core to run each thread on?

Does an AMP environment help? (Use one type of core at a time)

Number of resources of each type?

Optimal frequency configuration for each core type?

CHOAMP, SIAM*

Given a AMP environment what would be the optimal scheduling for a given parallel program?

CES, SIAM*

How to maintain fairness among parallel threads?

How to make efficiently manage the Memory System (Prefetching, Variable Cache sizes etc)?

1

1* To submit

4/32

(9)

AMP: Challenges with a multithreaded program

What would be the optimal Hardware Configuration to run a Parallel Program?

Which type of core to run each thread on?

Does an AMP environment help? (Use one type of core at a time)

Number of resources of each type?

Optimal frequency configuration for each core type?

CHOAMP, SIAM*

Given a AMP environment what would be the optimal scheduling for a given parallel program?

CES, SIAM*

How to maintain fairness among parallel threads?

How to make efficiently manage the Memory System (Prefetching, Variable Cache sizes etc)?

1

1* To submit

4/32

(10)

AMP: Challenges with a multithreaded program

What would be the optimal Hardware Configuration to run a Parallel Program?

Which type of core to run each thread on?

Does an AMP environment help? (Use one type of core at a time)

Number of resources of each type?

Optimal frequency configuration for each core type?

CHOAMP, SIAM*

Given a AMP environment what would be the optimal scheduling for a given parallel program?

CES, SIAM*

How to maintain fairness among parallel threads?

How to make efficiently manage the Memory System (Prefetching, Variable Cache sizes etc)?

1

1* To submit

4/32

(11)

Compiler Enhanced Scheduling (CES): Recap

Uniform Parallel Workload

Online Performance Evaluation Model(PEM)

40 40

20 20

44 44

16 16

R E E N T R Y

A B C

Parallel for

Shared worklist Itr stealing

P E M

big core

LITTLE core 40 Iterations in worklist

Non uniform Parallel Workload

Offline Affinity and Normalization Scheduling

PEM

I II IIIIV

IV I wl0im wlim> wl0im OpenMPNo

for

Yes wlim=wl0im Sections

Affinity Allocation

Normalization Stage Affinity

Victim Section

wlim

5/32

(12)

Compiler Enhanced Scheduling (CES): Recap

Uniform Parallel Workload

Online Performance Evaluation Model(PEM)

40 40

20 20

44 44

16 16

R E E N T R Y

A B C

Parallel for

Shared worklist Itr stealing

P E M

big core

LITTLE core 40 Iterations in worklist

Non uniform Parallel Workload

Offline Affinity and Normalization Scheduling

PEM

III IIIIV

IV I wl0im wlim> wl0im OpenMPNo

for

Yes wlim=wl0im Sections

Affinity Allocation

Normalization Stage Affinity

Victim Section

wlim

5/32

(13)

CHOAMP: Cost Based Hardware Optimization for Asymmetric Multicore Processors

6/32

(14)

Optimal Hardware

Given an input program, a user selected cost function and the possible hardware configurations, can we find an optimal hardware configuration which can run the program with minimal cost.

• Static/Compile time: Chen and John 2

• Dynamic: Execution time decisions : HMP Scheduling3

• Hybrid: Compile Time Instrumentation and Runtime Decision:

PIE4, CES

2J. Chen and L. K. John. Efficient program scheduling for heterogeneous multi-core processors. in DAC ’09

3http://www.arm.com

4K. Van Craeynest et al. Scheduling Heterogeneous Multi-cores Through Performance Impact Estimation (PIE), ISCA ’12

7/32

(15)

Static v/s Dynamic Scheduling

• Static

Unknown variables

Unknown Thread-to-Core Mapping

Hardware Resource Reservation

Scalable over large hardware configurations

• Dynamic

Runtime overhead

Scalability of the engine

Accurate knowledge of workload

8/32

(16)

Static v/s Dynamic Scheduling

• Static

Unknown variables

Unknown Thread-to-Core Mapping

Hardware Resource Reservation

Scalable over large hardware configurations

• Dynamic

Runtime overhead

Scalability of the engine

Accurate knowledge of workload

8/32

(17)

Static Scheduling

Unknown Loop bounds: Estimate unknown variables based on known variables.

Unknown wasted cycles: Estimate the expected delay by learning for large set of examples.

CostFunction

ALU Operations Memory

Operations

Con trol

operations

HC1 Optimal Hardware Configuration

HC2 HC3

HC4

APredictortrained with a set of program features (selected based on architectural differences in AMP cores and the choice of parallel program) and hardware configurations to identify the optimal hardware configuration

9/32

(18)

Static Scheduling

Unknown Loop bounds: Estimate unknown variables based on known variables.

Unknown wasted cycles: Estimate the expected delay by learning for large set of examples.

CostFunction

ALU Operations Memory

Operations

Con trol

operations

HC1 Optimal Hardware Configuration

HC2 HC3

HC4

APredictortrained with a set of program features (selected based on architectural differences in AMP cores and the choice of parallel program) and hardware configurations to identify the optimal hardware configuration

9/32

(19)

CHOAMP: Training Phase

Training

Hardware configurations Feature

vectors Features

Knowledge base

Cost functions

T raining

Microbenchmarks

10/32

(20)

CHOAMP: Compilation Phase

Knowledge base

Obl

Predictor

Load Balance Feature

vectors

User Input

Transformed program Input program

Obl

config

Compilation

11/32

(21)

CHOAMP: The Big Picture

Training

Hardware configurations Feature

vectors Features

Knowledge base

Cost functions

Obl

Predictor

Load Balance Feature

vectors

User Input

Transformed program Input program

Obl

config

T raining Compilation

Microbenchmarks

(22)

Test Environment : big.LITTLE

Cortex A15/A57 Cortex A7 /A53

Cache Coherent Interconnect

L2 Cache L2 Cache

GPU

Memory

big.LITTLE System Design

Peripheral Interconnect

Core Types Cortex-A7 Cortex-A15 Pipeline simple 8 stage in-

order

Out of Order, multi-issue Frequency 600 - 1400 MHz 800 - 2000 MHz

Speed 1.9 DMIPS 4.01 DMIPS

Instruction Set

Thumb-2

A15 Single Core Four Cores Freq Idle Peak Idle Peak 2.0 GHz 0.95 2.40 1.15 5.28 1.8 GHz 0.70 1.65 0.70 3.50 1.4 GHz 0.38 0.85 0.40 2.30 1.2 GHz 0.30 0.70 0.32 1.80 A7 Single Core Four Cores Freq Idle Peak Idle Peak

1.4 GHz 0.20 0.50 0.22 1.40 Parallel Program : OpenMP

13/32

(23)

Test Environment : big.LITTLE

Cortex A15/A57 Cortex A7 /A53

Cache Coherent Interconnect

L2 Cache L2 Cache

GPU

Memory

big.LITTLE System Design

Peripheral Interconnect

Core Types Cortex-A7 Cortex-A15 Pipeline simple 8 stage in-

order

Out of Order, multi-issue Frequency 600 - 1400 MHz 800 - 2000 MHz

Speed 1.9 DMIPS 4.01 DMIPS

Instruction Set

Thumb-2

A15 Single Core Four Cores Freq Idle Peak Idle Peak 2.0 GHz 0.95 2.40 1.15 5.28 1.8 GHz 0.70 1.65 0.70 3.50 1.4 GHz 0.38 0.85 0.40 2.30 1.2 GHz 0.30 0.70 0.32 1.80 A7 Single Core Four Cores Freq Idle Peak Idle Peak 1.4 GHz 0.20 0.50 0.22 1.40

Parallel Program : OpenMP

13/32

(24)

Test Environment : big.LITTLE

Cortex A15/A57 Cortex A7 /A53

Cache Coherent Interconnect

L2 Cache L2 Cache

GPU

Memory

big.LITTLE System Design

Peripheral Interconnect

Core Types Cortex-A7 Cortex-A15 Pipeline simple 8 stage in-

order

Out of Order, multi-issue Frequency 600 - 1400 MHz 800 - 2000 MHz

Speed 1.9 DMIPS 4.01 DMIPS

Instruction Set

Thumb-2

A15 Single Core Four Cores Freq Idle Peak Idle Peak 2.0 GHz 0.95 2.40 1.15 5.28 1.8 GHz 0.70 1.65 0.70 3.50 1.4 GHz 0.38 0.85 0.40 2.30 1.2 GHz 0.30 0.70 0.32 1.80 A7 Single Core Four Cores Freq Idle Peak Idle Peak 1.4 GHz 0.20 0.50 0.22 1.40

Parallel Program : OpenMP 13/32

(25)

Feature Selection

• Low level : based on Hardware dissimilarities of bigand LITTLE

ALU operations

Memory operations

Branch operations

False sharing

• High Level: synchronization constructs provided by OpenMP:

Barriers

Atomics

Critical sections

Flush operations

Reduction operations

14/32

(26)

Feature Selection

• Low level : based on Hardware dissimilarities of bigand LITTLE

ALU operations

Memory operations

Branch operations

False sharing

• High Level: synchronization constructs provided by OpenMP:

Barriers

Atomics

Critical sections

Flush operations

Reduction operations

14/32

(27)

Implementation

• Hardware Configurations: Odroid XU3 5

Core Configurations: 4L4b, 0L4b, 4L2b, 2L2b, 4L0b

big Frequencies : 2GHz, 1.8GHz, 1.6GHz, 1.4 GHz, 1.2GHz

• Micro-benchmark generation: Python 2.7

• Dynamic Scheduling: HMP and CES6 for dynamic load balancing

• Analysis and Transformation: IMOP7

• Regression Tool: Java Scientific Library8

Regression Engines: Linear, Quadratic and Gaussian

• Benchmark : NPB9

5http://www.hardkernel.com/main/products/prdt_info.php?g_code=g140448267127

6J. K. Viswakaran Sreelatha, and S. Balachandran. 2016. Compiler Enhanced Scheduling for OpenMP for Heterogeneous Multiprocessors. EEHCO ’16

7Nougrahiya and Nandivada, IMOP: http://www.cse.iitm.ac.in/~amannoug/imop/

8Flanagan M, http://www.ee.ucl.ac.uk/~mflanaga/java/index.html

9Bailey et al. The NAS parallel benchmarks-summary and preliminary results

15/32

(28)

Walk Through Example

# d e f i n e M 5 0 0 0 0

int f(int *s, int A[], int cumSum[], int L) {

int MAX = 0 , l o c a l S u m = 0 , t e m p = L / 1 2 8 ; int N = M - t e m p ; /* U n k n o w n V a r i a b l e */

#pragma omp parallel { int i , j ;

#pragma omp for reduction(+:localSum) for ( i = 0; i < N ; i ++) {

l o c a l S u m += A[ i ]; cumSum[ i ] = 0;

for ( j =0; j < N ; j ++) {

if ( j <= i ) cumSum[ i ] += A[ j ];

}

#pragma omp critical

{ if ( MAX < A[ i ]) MAX = A[ i ]; } }}

return MAX;

}

16/32

(29)

Walk Through Example

# d e f i n e M 5 0 0 0 0

int f(int *s, int A[], int cumSum[], int L) {

int MAX = 0 , l o c a l S u m = 0 , t e m p = L / 1 2 8 ; int N = M - t e m p ; /* U n k n o w n V a r i a b l e */

#pragma omp parallel { int i , j ;

#pragma omp for reduction(+:localSum) for ( i = 0; i < N ; i ++) {

l o c a l S u m += A[ i ]; cumSum[ i ] = 0;

for ( j =0; j < N ; j ++) {

if ( j <= i ) cumSum[ i ] += A[ j ];

}

#pragma omp critical

{ if ( MAX < A[ i ]) MAX = A[ i ]; } }}

return MAX;

}

M L 128

temp

Dependence Graph

N

16/32

(30)

Walk Through Example

# d e f i n e M 5 0 0 0 0

int f(int *s, int A[], int cumSum[], int L) {

int MAX = 0 , l o c a l S u m = 0 , t e m p = L / 1 2 8 ; int N = M - t e m p ; /* U n k n o w n V a r i a b l e */

#pragma omp parallel { int i , j ;

#pragma omp for reduction(+:localSum) for ( i = 0; i < N ; i ++) {

l o c a l S u m += A[ i ]; cumSum[ i ] = 0;

for ( j =0; j < N ; j ++) {

if ( j <= i ) cumSum[ i ] += A[ j ];

}

#pragma omp critical

{ if ( MAX < A[ i ]) MAX = A[ i ]; } }}

return MAX;

}

M L 128

temp

Dependence Graph

N

16/32

(31)

Walk Through Example

Op Result range

i + j range(MAX(R(i)) +MAX(R(j))) i * j range(MAX(R(i))MAX(R(j))) i - j range(MAX(R(i)) - MIN(R(j))) i / j range(MAX(R(i)) / MIN(R(j)))

i>>j i - MAX(R(j))

i % j R(j)

i = j MAX(R(i),R(j))

MAX(R(i)): Known maximum value in range i R(k): Range to which the value k belongs

L into largest non empty bucket.

LR4−5

temp = 50000/128 =R2−3

N = 50000 - R2−3=R4−5= 50000

M L 128

temp

Dependence Graph

N

16/32

(32)

Prediction v/s real world

Scaled Prediction of Energy consumption byPredictorusing different learning kernels for our example compared to actual (Bold Blue). Shows how good the Predictoris.

17/32

(33)

CHOAMP: EXECUTION TIME

On an average 28% improvement with Linear Regression Oracle

18/32

(34)

CHOAMP: EXECUTION TIME

BT: Selecting the same configuration as base case

DC: Unknown Parallel Operations.

EP: Highly parallelizable, Configuration Gain

SP: High Sync costs, Threads Gain

19/32

(35)

CHOAMP: ENERGY

On an average 65% improvement with Linear RegressionPredictor

Mostly Lower Configuration than base yields higher energy gains

20/32

(36)

CHOAMP: ENERGY

• EP: Highly parallel

21/32

(37)

CHOAMP: EDP

• On an average 54% improvement with Linear Regression Oracle

• Mostly a mid configuration yields good results

22/32

(38)

CHOAMP: EDP

• LU: Gain higher than Energy, Execution Time negative

• SP: Rounded off

23/32

(39)

CES + CHOAMP

• The trained Oracle without knowledge of CES

• On an average 14% improvement in execution time

24/32

(40)

CES + CHOAMP

• BT: Base configuration selected

• EP, IS, LU: Loss Amplified, Better Load balancing

• SP: Gain reduced, Lower Sync costs

25/32

(41)

CHOAMP:Conclusions

• On an average 28% improvement in execution time

• Average 65% improvement in energy consumption

• Average 58% improvement in EDP

• Average 14% improvement in execution time with CES

• Works well in tandem with dynamic scheduling algorithms (i) HMP (ii) CES.

• Sliding window regression: No significant gains.

26/32

(42)

SIAM: Scheduling Irregular Workloads on Asymmetric Multicore processors

27/32

(43)

Irregular Parallel Workloads: Graph Algorithms

• Parallelism depends on the input graphs.

• Include Graph into training the oracle.

• How to represent the graph with set of properties?

• Should capture:

Overall Workload

Workload distribution

Memory accesses/Atomics

• Training for Algorithm-Graph pair

28/32

(44)

SIAM: Workflow

Algorithm Features

Knowledge base

Obl

Predictor Feature

vectors

Transformed program Input program

Obl config

TrainingCompilation

Microbenchmarks

Training Graphs

Training

Hardware configurations

Preprocessing

Load Balance Feature

vectors Feature vectors

Input Graph´

Feature Extraction

aed,s

Compaction aed´

Input Graph

29/32

(45)

Average EDP gain over Algorithms

-40 -20 0 20 40 60 80 100

com con pR sssp tc Avg

Percentage Improvement

Graphs Percentage EDP improvement

Predpre Pred+pre

41.9 20.84 62.12 34.17 11.41 34.09

21.17 22.9 39.85 29.08 17.37 26.07

55.99 56.23 70.91 50.85 29.68 52.73

Higher the better.

30/32

(46)

SIAM: Conclusion

• Graph properties are required to fully capture the workload.

• Compaction: Good representation of the input graph can reduce both energy consumption and execution time.

• Predictor shows on an average 52% improvement in EDP consumption.

• Add edge weight Properties : for weight based algorithms.

SSSP, MST, Max flow etc

31/32

Referensi

Dokumen terkait