panel-plw-2007.ppt 340KB Jun 23 2011 12:05:46 PM

(1)

Lizy Kurian John, LCA, UT Aust in

1

The University of Texas at Austin

What Programming

Language/Compiler

Researchers should Know

about Computer Architecture

Lizy Kurian John

Department of Electrical and Computer Engineering

(2)

Lizy Kurian John, LC A, UT Austin

2

Somebody once said

“

Computers are dumb actors

(3)

3

Computer Architecture

Basics



ISAs



RISC vs CISC



Assembly language coding



Datapath (ALU) and controller



Pipelining



Caches



Out of order execution

(4)

4

Basics



ILP



DLP



TLP



Massive parallelism



SIMD/MIMD



VLIW



Performance and Power metrics

(5)

5

The Bottomline

Programming Language choice

affects performance and power

eg: Java

(6)

6

A Java Hardware

Interpreter



_{Radhakrishnan, Ph. D 2000 (ISCA2000, ICS2001)}



_{This technique used by Nazomi}

Communications, Parthus (Chicory Systems)

Java class file

Native executable

Fetch Hardware _bytecode

translator

Decode Execute

bytecodes

(7)

7

HardInt Performance

4-way performance 44 .8 10 9.

3 ₁₄9.

7 93 4. 1 91 1. 7 60 .4 13 5. 9 85 .2 12 7. 7 49 2. 2 71 .0 13 3. 7 22 1. 5 98 9. 4 86 7. 8 59 .8 10 8.

8 ₁₄6.

2 14 6. 1 32 1. 9 16 .0 27 .7 28 .8 25 0. 2 12 0. 0 0 50 100 150 200 250 300 350 400

db javac jess mpeg mtrt

e x e c u ti o n c y c le s ( m il li o n s )

J DK 1.1.6 Interpreter J DK 1.1.6 J IT J DK 1.2 Interpreter J DK 1.2 J IT Hard- Int

• Hard-Int performs consistently better than the interpreter • In JIT mode, significant performance boost in 4 of 5

(8)

8

Compiler and Power

A B D F C E A B D F A B D F C _C E E

DDG _{Peak Power = 3}

Energy = 6

Peak Power = 2 Energy = 6

(9)

9

Valluri et al 2001 HPCA

workshop



Quantitative Study



Influence of state-of-the-art optimizations

on energy and power of the processor

examined



Optimizations studied

 Standard –O1 to –O4 of DEC Alpha’s cc compiler  Four individual optimizations – simple

basic-block instruction scheduling, loop unrolling,

function inlining, and aggressive global

(10)

10

Standard Optimizations on

Power

Benchmark opt level Energy Exec Time Insts Avg Power IPC

O0 100 100 100 100 100 O1 74.48 81.55 81.52 91.33 99.96 O2 75.13 81.44 82.04 92.25 100.73 O3 75.13 81.44 82.04 92.25 100.73 O4 79.01 82.77 86.11 95.45 104.03 O0 100 100 100 100 100 O1 66.2 64.13 68.94 103.23 107.5 O2 62.62 61.31 63.01 102.14 102.78 O3 62.62 61.31 63.01 102.14 102.78 O4 63.67 62.19 63.75 102.38 102.51 O0 100 100 100 100 100 O1 81.32 83.66 83.18 97.2 99.42 O2 79.6 75.97 82.97 104.78 109.21 O3 79.6 75.97 82.97 104.78 109.21 O4 85.71 77.89 90.96 110.05 116.78

compress

go

(11)

11

Somebody once said

“

Computers are dumb actors

(12)

12

A large part of modern

out of order processors

(13)

13

Let me get more arrogant

A large part of modern out of

order processors was designed

because

computer architects thought

(14)

14

Value Prediction

Is a slap on your face

(15)

15

Value Locality



Likelihood that an instruction’s

computed result or a similar

predictable result will occur soon



Observation – a limited set of

(16)

16

(17)

17

Causes of value locality



Data redundancy – many 0s, sparse

matrices, white space in files, empty

cells in spread sheets



Program constants –



Computed branches – base address for

jump tables is a run-time constant



Virtual function calls – involve code to

(18)

18

Causes of value locality



Memory alias resolution – compiler

conservatively generates code – may

contain stores that alias with loads



Register spill code – stores and

subsequent loads



Convergent algorithms – convergence in

parts of algorithms before global

convergence

(19)

19

2 Extremist Views

Anything that can be done in

hardware should be done in

hardware.

(20)

20

What do we need?

The Dumb actor

Or the

(21)

21

Challenging all compiler

writers

The last 15 years was the defiant actor’s era

What about the next 15? TLP,

Multithreading, Parallelizing compilers –

It’s time for a lot more dumb acting from

the architect’s side.

(22)

Lizy Kurian John, LCA, UT Aust in

22

The University of Texas at Austin

(23)

23

Compiler Optimzations



cc

-

Native C compiler on Dec

Alpha 21064 running OSF1

operating system



gcc –

Used to study the effect of

(24)

24

Std Optimizations Levels

on

cc

-O0 – No optimizations performed

-O1 – Local optimizations such as CSE,

copy propagation, IVE etc

-O2 – Inline expansion of static procedures

and global optimizations such as loop

unrolling, instruction scheduling

(25)

25

Std Optimizations Levels

on g

cc

-O0 – No optimizations performed

-O1 – Local optimizations such as CSE, copy propagation, dead-code elimination etc -O2 – aggressive instruction scheduling -O3 – Inlining of procedures

 _{Almost same optimizations in each level of}_cc_and_gcc  _In_cc_and_gcc_{, optimizations that increase ILP are in}

levels -O2, -O3, and -O4

 cc used where ever possible, gcc used used where specific hooks are required

(26)

26

Individual Optimizations



Four

gcc

optimizations, all optimizations

applied on top -O1



-

fschedule-insns

–

local register allocation

followed by basic-block list scheduling



-

fschedule-insns2

– Postpass scheduling

done



-

finline-functions –

Integrated all simple

functions into their callers



-funroll-loops

– Perform the optimization

(27)

27

Some observations



Energy consumption reduces when

# of instructions is reduced, i.e.,

when the total work done is less,

energy is less



Power dissipation is directly

(28)

28

Observations (contd.)



Function inlining was found to be

good for both power and energy



Unrolling was found to be good for

(29)

29

MMX/SIMD

(30)

30

Standard Optimizations on

Power (Contd)

Benchmark opt level Energy Exec Time Insts Avg Power IPC

O0 100 100 100 100 100 O1 97.38 100.24 92.49 97.15 92.27 O2 97.69 99.38 92.49 98.3 93.07 O3 97.69 99.38 92.49 98.3 93.07 O4 98.31 99.27 92.84 99.02 93.51 O0 100 100 100 100 100 O1 42.09 51.04 33.21 82.46 65.06 O2 40.99 47.52 33.1 86.28 69.67 O3 40.99 46.37 33.1 87.65 71.38 O0 100 100 100 100 100 O1 30.1 36.64 20.01 82.15 5463 O2 28.93 34.01 19.05 85.06 56.01 O3 28.93 34.01 19.05 85.06 56.01

su2cor