• Tidak ada hasil yang ditemukan

panel-plw-2007.ppt 340KB Jun 23 2011 12:05:46 PM

N/A
N/A
Protected

Academic year: 2017

Membagikan "panel-plw-2007.ppt 340KB Jun 23 2011 12:05:46 PM"

Copied!
30
0
0

Teks penuh

(1)

Lizy Kurian John, LCA, UT Aust in

1

The University of Texas at Austin

What Programming

Language/Compiler

Researchers should Know

about Computer Architecture

Lizy Kurian John

Department of Electrical and Computer Engineering

(2)

Lizy Kurian John, LC A, UT Austin

2

Somebody once said

Computers are dumb actors

(3)

Lizy Kurian John, LC A, UT Austin

3

Computer Architecture

Basics

ISAs

RISC vs CISC

Assembly language coding

Datapath (ALU) and controller

Pipelining

Caches

Out of order execution

(4)

Lizy Kurian John, LC A, UT Austin

4

Basics

ILP

DLP

TLP

Massive parallelism

SIMD/MIMD

VLIW

Performance and Power metrics

(5)

Lizy Kurian John, LC A, UT Austin

5

The Bottomline

Programming Language choice

affects performance and power

eg: Java

(6)

Lizy Kurian John, LC A, UT Austin

6

A Java Hardware

Interpreter

Radhakrishnan, Ph. D 2000 (ISCA2000, ICS2001)

This technique used by Nazomi

Communications, Parthus (Chicory Systems)

Java class file

Native executable

Fetch Hardware bytecode

translator

Decode Execute

bytecodes

(7)

Lizy Kurian John, LC A, UT Austin

7

HardInt Performance

4-way performance 44 .8 10 9.

3 149.

7 93 4. 1 91 1. 7 60 .4 13 5. 9 85 .2 12 7. 7 49 2. 2 71 .0 13 3. 7 22 1. 5 98 9. 4 86 7. 8 59 .8 10 8.

8 146.

2 14 6. 1 32 1. 9 16 .0 27 .7 28 .8 25 0. 2 12 0. 0 0 50 100 150 200 250 300 350 400

db javac jess mpeg mtrt

e x e c u ti o n c y c le s ( m il li o n s )

J DK 1.1.6 Interpreter J DK 1.1.6 J IT J DK 1.2 Interpreter J DK 1.2 J IT Hard- Int

• Hard-Int performs consistently better than the interpreter • In JIT mode, significant performance boost in 4 of 5

(8)

Lizy Kurian John, LC A, UT Austin

8

Compiler and Power

A B D F C E A B D F A B D F C C E E

DDG Peak Power = 3

Energy = 6

Peak Power = 2 Energy = 6

(9)

Lizy Kurian John, LC A, UT Austin

9

Valluri et al 2001 HPCA

workshop

Quantitative Study

Influence of state-of-the-art optimizations

on energy and power of the processor

examined

Optimizations studied

 Standard –O1 to –O4 of DEC Alpha’s cc compiler  Four individual optimizations – simple

basic-block instruction scheduling, loop unrolling,

function inlining, and aggressive global

(10)

Lizy Kurian John, LC A, UT Austin

10

Standard Optimizations on

Power

Benchmark opt level Energy Exec Time Insts Avg Power IPC

O0 100 100 100 100 100 O1 74.48 81.55 81.52 91.33 99.96 O2 75.13 81.44 82.04 92.25 100.73 O3 75.13 81.44 82.04 92.25 100.73 O4 79.01 82.77 86.11 95.45 104.03 O0 100 100 100 100 100 O1 66.2 64.13 68.94 103.23 107.5 O2 62.62 61.31 63.01 102.14 102.78 O3 62.62 61.31 63.01 102.14 102.78 O4 63.67 62.19 63.75 102.38 102.51 O0 100 100 100 100 100 O1 81.32 83.66 83.18 97.2 99.42 O2 79.6 75.97 82.97 104.78 109.21 O3 79.6 75.97 82.97 104.78 109.21 O4 85.71 77.89 90.96 110.05 116.78

compress

go

(11)

Lizy Kurian John, LC A, UT Austin

11

Somebody once said

Computers are dumb actors

(12)

Lizy Kurian John, LC A, UT Austin

12

A large part of modern

out of order processors

(13)

Lizy Kurian John, LC A, UT Austin

13

Let me get more arrogant

A large part of modern out of

order processors was designed

because

computer architects thought

(14)

Lizy Kurian John, LC A, UT Austin

14

Value Prediction

Is a slap on your face

(15)

Lizy Kurian John, LC A, UT Austin

15

Value Locality

Likelihood that an instruction’s

computed result or a similar

predictable result will occur soon

Observation – a limited set of

(16)

Lizy Kurian John, LC A, UT Austin

16

(17)

Lizy Kurian John, LC A, UT Austin

17

Causes of value locality

Data redundancy – many 0s, sparse

matrices, white space in files, empty

cells in spread sheets

Program constants –

Computed branches – base address for

jump tables is a run-time constant

Virtual function calls – involve code to

(18)

Lizy Kurian John, LC A, UT Austin

18

Causes of value locality

Memory alias resolution – compiler

conservatively generates code – may

contain stores that alias with loads

Register spill code – stores and

subsequent loads

Convergent algorithms – convergence in

parts of algorithms before global

convergence

(19)

Lizy Kurian John, LC A, UT Austin

19

2 Extremist Views

Anything that can be done in

hardware should be done in

hardware.

(20)

Lizy Kurian John, LC A, UT Austin

20

What do we need?

The Dumb actor

Or the

(21)

Lizy Kurian John, LC A, UT Austin

21

Challenging all compiler

writers

The last 15 years was the defiant actor’s era

What about the next 15? TLP,

Multithreading, Parallelizing compilers –

It’s time for a lot more dumb acting from

the architect’s side.

(22)

Lizy Kurian John, LCA, UT Aust in

22

The University of Texas at Austin

(23)

Lizy Kurian John, LC A, UT Austin

23

Compiler Optimzations

cc

-

Native C compiler on Dec

Alpha 21064 running OSF1

operating system

gcc –

Used to study the effect of

(24)

Lizy Kurian John, LC A, UT Austin

24

Std Optimizations Levels

on

cc

-O0 – No optimizations performed

-O1 – Local optimizations such as CSE,

copy propagation, IVE etc

-O2 – Inline expansion of static procedures

and global optimizations such as loop

unrolling, instruction scheduling

(25)

Lizy Kurian John, LC A, UT Austin

25

Std Optimizations Levels

on g

cc

-O0 – No optimizations performed

-O1 – Local optimizations such as CSE, copy propagation, dead-code elimination etc -O2 – aggressive instruction scheduling -O3 – Inlining of procedures

Almost same optimizations in each level of cc and gccIn cc and gcc, optimizations that increase ILP are in

levels -O2, -O3, and -O4

cc used where ever possible, gcc used used where specific hooks are required

(26)

Lizy Kurian John, LC A, UT Austin

26

Individual Optimizations

Four

gcc

optimizations, all optimizations

applied on top -O1

-

fschedule-insns

local register allocation

followed by basic-block list scheduling

-

fschedule-insns2

– Postpass scheduling

done

-

finline-functions –

Integrated all simple

functions into their callers

-funroll-loops

– Perform the optimization

(27)

Lizy Kurian John, LC A, UT Austin

27

Some observations

Energy consumption reduces when

# of instructions is reduced, i.e.,

when the total work done is less,

energy is less

Power dissipation is directly

(28)

Lizy Kurian John, LC A, UT Austin

28

Observations (contd.)

Function inlining was found to be

good for both power and energy

Unrolling was found to be good for

(29)

Lizy Kurian John, LC A, UT Austin

29

MMX/SIMD

(30)

Lizy Kurian John, LC A, UT Austin

30

Standard Optimizations on

Power (Contd)

Benchmark opt level Energy Exec Time Insts Avg Power IPC

O0 100 100 100 100 100 O1 97.38 100.24 92.49 97.15 92.27 O2 97.69 99.38 92.49 98.3 93.07 O3 97.69 99.38 92.49 98.3 93.07 O4 98.31 99.27 92.84 99.02 93.51 O0 100 100 100 100 100 O1 42.09 51.04 33.21 82.46 65.06 O2 40.99 47.52 33.1 86.28 69.67 O3 40.99 46.37 33.1 87.65 71.38 O0 100 100 100 100 100 O1 30.1 36.64 20.01 82.15 5463 O2 28.93 34.01 19.05 85.06 56.01 O3 28.93 34.01 19.05 85.06 56.01

su2cor

Referensi

Dokumen terkait

Decision Making Model for Assessing Risk of TP Strategy.. © John Wile y

Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading)..

Cyber security and information assurance refer to measures for protecting computer systems, networks, and information.. systems from disruption or unauthorized access,

• Character data is composed of letters, symbols, and numbers that will not be used in arithmetic operations. – Numeric data is used in arithmetic calculations, and is

or she may face both academic sanctions imposed by the instructor of the course and disciplinary sanctions imposed either by the provost of his or her college or by the

computer in the home compared to 80% of residents 18 to 34 years. • Fewer than 3% of residents

SCIENTIFIC RESEARCH & DISCOVERY REPRESENTATIVE DISCIPLINE EXAMPLE UNITS MRI MRI Heart Heart Neuron Neuron Structure Structure Sequence Sequence Protease Protease

this technique requires two healthy hands and 27 reliable keys mobile devices like phones, remote control or PDA operate with a smaller number of physical keys most often 12