Lizy Kurian John, LCA, UT Aust in
1
The University of Texas at Austin
What Programming
Language/Compiler
Researchers should Know
about Computer Architecture
Lizy Kurian John
Department of Electrical and Computer Engineering
Lizy Kurian John, LC A, UT Austin
2
Somebody once said
“
Computers are dumb actors
Lizy Kurian John, LC A, UT Austin
3
Computer Architecture
Basics
ISAs
RISC vs CISC
Assembly language coding
Datapath (ALU) and controller
Pipelining
Caches
Out of order execution
Lizy Kurian John, LC A, UT Austin
4
Basics
ILP
DLP
TLP
Massive parallelism
SIMD/MIMD
VLIW
Performance and Power metrics
Lizy Kurian John, LC A, UT Austin
5
The Bottomline
Programming Language choice
affects performance and power
eg: Java
Lizy Kurian John, LC A, UT Austin
6
A Java Hardware
Interpreter
Radhakrishnan, Ph. D 2000 (ISCA2000, ICS2001)
This technique used by NazomiCommunications, Parthus (Chicory Systems)
Java class file
Native executable
Fetch Hardware bytecode
translator
Decode Execute
bytecodes
Lizy Kurian John, LC A, UT Austin
7
HardInt Performance
4-way performance 44 .8 10 9.3 149.
7 93 4. 1 91 1. 7 60 .4 13 5. 9 85 .2 12 7. 7 49 2. 2 71 .0 13 3. 7 22 1. 5 98 9. 4 86 7. 8 59 .8 10 8.
8 146.
2 14 6. 1 32 1. 9 16 .0 27 .7 28 .8 25 0. 2 12 0. 0 0 50 100 150 200 250 300 350 400
db javac jess mpeg mtrt
e x e c u ti o n c y c le s ( m il li o n s )
J DK 1.1.6 Interpreter J DK 1.1.6 J IT J DK 1.2 Interpreter J DK 1.2 J IT Hard- Int
• Hard-Int performs consistently better than the interpreter • In JIT mode, significant performance boost in 4 of 5
Lizy Kurian John, LC A, UT Austin
8
Compiler and Power
A B D F C E A B D F A B D F C C E E
DDG Peak Power = 3
Energy = 6
Peak Power = 2 Energy = 6
Lizy Kurian John, LC A, UT Austin
9
Valluri et al 2001 HPCA
workshop
Quantitative Study
Influence of state-of-the-art optimizations
on energy and power of the processor
examined
Optimizations studied
Standard –O1 to –O4 of DEC Alpha’s cc compiler Four individual optimizations – simple
basic-block instruction scheduling, loop unrolling,
function inlining, and aggressive global
Lizy Kurian John, LC A, UT Austin
10
Standard Optimizations on
Power
Benchmark opt level Energy Exec Time Insts Avg Power IPC
O0 100 100 100 100 100 O1 74.48 81.55 81.52 91.33 99.96 O2 75.13 81.44 82.04 92.25 100.73 O3 75.13 81.44 82.04 92.25 100.73 O4 79.01 82.77 86.11 95.45 104.03 O0 100 100 100 100 100 O1 66.2 64.13 68.94 103.23 107.5 O2 62.62 61.31 63.01 102.14 102.78 O3 62.62 61.31 63.01 102.14 102.78 O4 63.67 62.19 63.75 102.38 102.51 O0 100 100 100 100 100 O1 81.32 83.66 83.18 97.2 99.42 O2 79.6 75.97 82.97 104.78 109.21 O3 79.6 75.97 82.97 104.78 109.21 O4 85.71 77.89 90.96 110.05 116.78
compress
go
Lizy Kurian John, LC A, UT Austin
11
Somebody once said
“
Computers are dumb actors
Lizy Kurian John, LC A, UT Austin
12
A large part of modern
out of order processors
Lizy Kurian John, LC A, UT Austin
13
Let me get more arrogant
A large part of modern out of
order processors was designed
because
computer architects thought
Lizy Kurian John, LC A, UT Austin
14
Value Prediction
Is a slap on your face
Lizy Kurian John, LC A, UT Austin
15
Value Locality
Likelihood that an instruction’s
computed result or a similar
predictable result will occur soon
Observation – a limited set of
Lizy Kurian John, LC A, UT Austin
16
Lizy Kurian John, LC A, UT Austin
17
Causes of value locality
Data redundancy – many 0s, sparse
matrices, white space in files, empty
cells in spread sheets
Program constants –
Computed branches – base address for
jump tables is a run-time constant
Virtual function calls – involve code to
Lizy Kurian John, LC A, UT Austin
18
Causes of value locality
Memory alias resolution – compiler
conservatively generates code – may
contain stores that alias with loads
Register spill code – stores and
subsequent loads
Convergent algorithms – convergence in
parts of algorithms before global
convergence
Lizy Kurian John, LC A, UT Austin
19
2 Extremist Views
Anything that can be done in
hardware should be done in
hardware.
Lizy Kurian John, LC A, UT Austin
20
What do we need?
The Dumb actor
Or the
Lizy Kurian John, LC A, UT Austin
21
Challenging all compiler
writers
The last 15 years was the defiant actor’s era
What about the next 15? TLP,
Multithreading, Parallelizing compilers –
It’s time for a lot more dumb acting from
the architect’s side.
Lizy Kurian John, LCA, UT Aust in
22
The University of Texas at Austin
Lizy Kurian John, LC A, UT Austin
23
Compiler Optimzations
cc
-
Native C compiler on Dec
Alpha 21064 running OSF1
operating system
gcc –
Used to study the effect of
Lizy Kurian John, LC A, UT Austin
24
Std Optimizations Levels
on
cc
-O0 – No optimizations performed
-O1 – Local optimizations such as CSE,
copy propagation, IVE etc
-O2 – Inline expansion of static procedures
and global optimizations such as loop
unrolling, instruction scheduling
Lizy Kurian John, LC A, UT Austin
25
Std Optimizations Levels
on g
cc
-O0 – No optimizations performed
-O1 – Local optimizations such as CSE, copy propagation, dead-code elimination etc -O2 – aggressive instruction scheduling -O3 – Inlining of procedures
Almost same optimizations in each level of cc and gcc In cc and gcc, optimizations that increase ILP are in
levels -O2, -O3, and -O4
cc used where ever possible, gcc used used where specific hooks are required
Lizy Kurian John, LC A, UT Austin
26
Individual Optimizations
Four
gcc
optimizations, all optimizations
applied on top -O1
-
fschedule-insns
–
local register allocation
followed by basic-block list scheduling
-
fschedule-insns2
– Postpass scheduling
done
-
finline-functions –
Integrated all simple
functions into their callers
-funroll-loops
– Perform the optimization
Lizy Kurian John, LC A, UT Austin
27
Some observations
Energy consumption reduces when
# of instructions is reduced, i.e.,
when the total work done is less,
energy is less
Power dissipation is directly
Lizy Kurian John, LC A, UT Austin
28
Observations (contd.)
Function inlining was found to be
good for both power and energy
Unrolling was found to be good for
Lizy Kurian John, LC A, UT Austin
29
MMX/SIMD
Lizy Kurian John, LC A, UT Austin
30
Standard Optimizations on
Power (Contd)
Benchmark opt level Energy Exec Time Insts Avg Power IPC
O0 100 100 100 100 100 O1 97.38 100.24 92.49 97.15 92.27 O2 97.69 99.38 92.49 98.3 93.07 O3 97.69 99.38 92.49 98.3 93.07 O4 98.31 99.27 92.84 99.02 93.51 O0 100 100 100 100 100 O1 42.09 51.04 33.21 82.46 65.06 O2 40.99 47.52 33.1 86.28 69.67 O3 40.99 46.37 33.1 87.65 71.38 O0 100 100 100 100 100 O1 30.1 36.64 20.01 82.15 5463 O2 28.93 34.01 19.05 85.06 56.01 O3 28.93 34.01 19.05 85.06 56.01
su2cor