William Stallings Computer Organization d ... - Direktori File UPI

(1)

William Stallings William Stallings

Computer Organization d A hit t

and Architecture

Chapter 13

Instruction Level Parallelism Instruction Level Parallelism and Superscalar Processors

(2)

What is Superscalar?

aCommon instructions (arithmetic, load/store, conditional branch) can be initiated and

t d i d d tl executed independently

aEqually applicable to RISC & CISC

aI ti ll RISC

aIn practice usually RISC

(3)

Why Superscalar?

aMost operations are on scalar quantities (see RISC notes)

aImprove these operations to get an overall improvement

(4)

General Superscalar Organization

Organization

(5)

Superpipelined Superpipelined

aMany pipeline stages need less than half a clock cycle

aDouble internal clock speed gets two tasks per external clock cycle

aS l ll ll l f t h t

aSuperscalar allows parallel fetch execute

(6)

Superscalar v

Superpipeline

(7)

Limitations Limitations

aInstruction level parallelism aInstruction level parallelism aCompiler based optimisation aHardware techniques

aHardware techniques aLimited by

`True data dependency

`Procedural dependency

`Resource conflicts

`Output dependency

`Antidependency

(8)

True Data Dependency True Data Dependency

aADD r1, r2 (r1 := r1+r2;) aMOVE r3,r1 (r3 := r1;)

aCan fetch and decode second instruction in parallel with first

aCan NOT execute second instruction until first is finished

(9)

Procedural Dependency Procedural Dependency

aCan not execute instructions after a branch in parallel with instructions before a branch

aAlso, if instruction length is not fixed,

instructions have to be decoded to find out how many fetches are needed

many fetches are needed

aThis prevents simultaneous fetches

(10)

Resource Conflict Resource Conflict

aTwo or more instructions requiring access to the same resource at the same time

h

`e.g. two arithmetic instructions

aCan duplicate resources

` h t ith ti it

`e.g. have two arithmetic units

(11)

D d i

Dependencies

(12)

Design Issues Design Issues

aInstruction level parallelism

`Instructions in a sequence are independent

`Execution can be overlapped

`Governed by data and procedural dependency

aMachine Pa allelism aMachine Parallelism

`Ability to take advantage of instruction level parallelism

parallelism

`Governed by number of parallel pipelines

(13)

Instruction Issue Policy Instruction Issue Policy

aOrder in which instructions are fetched aOrder in which instructions are executed

aOrder in which instructions change registers and memory

(14)

In-Order Issue

In Order Completion In-Order Completion

aIssue instructions in the order they occur aNot very efficient

aMay fetch >1 instruction

aInstructions must stall if necessary

(15)

In-Order Issue In-Order Completion (Diagram)

Completion (Diagram)

(16)

In-Order Issue

Out of Order Completion Out-of-Order Completion

aOutput dependency

`R3:= R3 + R5; (I1)

`R4:= R3 + 1; (I2)

`R3:= R5 + 1; (I3)

`I2 depends on result of I1 data dependency

`I2 depends on result of I1 - data dependency

`If I3 completes before I1, the result from I1 will be wrong - output (read-write) dependencyg p ( ) p y

(17)

In-Order Issue Out-of-Order Completion (Diagram)

Completion (Diagram)

(18)

Out-of-Order Issue

Out of Order Completion Out-of-Order Completion

aDecouple decode pipeline from execution pipeline

aCan continue to fetch and decode until this pipeline is full

aWh f ti l it b il bl

aWhen a functional unit becomes available an instruction can be executed

aSi i t ti h b d d d

aSince instructions have been decoded, processor can look ahead

(19)

Out-of-Order Issue Out-of-Order Completion (Diagram)

Completion (Diagram)

(20)

Antidependency Antidependency

aWrite-write dependency

`R3:=R3 + R5; (I1)

`R4:=R3 + 1; (I2)

`R3:=R5 + 1; (I3)

`R7:=R3 + R4; (I4)

`I3 can not complete before I2 starts as I2 needs a value in R3 and I3 changes R3g

(21)

Register Renaming Register Renaming

aOutput and antidependencies occur because register contents may not reflect the correct

d i f th

ordering from the program aMay result in a pipeline stall aR i t ll t d d i ll aRegisters allocated dynamically

`i.e. registers are not specifically named

(22)

Register Renaming example Register Renaming example

aR3b:=R3a + R5a (I1) aR3b: R3a + R5a (I1) aR4b:=R3b + 1 (I2) aR3c:=R5a + 1 (I3) aR3c:=R5a + 1 (I3) aR7b:=R3c + R4b (I4)

aWithout subscript refers to logical register in aWithout subscript refers to logical register in

instruction

aWith subscript is hardware register allocated aWith subscript is hardware register allocated aNote R3a R3b R3c

(23)

Machine Parallelism Machine Parallelism

aDuplication of Resources aOut of order issue

aRenaming

aNot worth duplication functions without register renaming

aNeed instruction window large enough (more than 8)

(24)

Branch Prediction Branch Prediction

a80486 fetches both next sequential instruction after branch and branch target instruction

aGives two cycle delay if branch taken

(25)

RISC Delayed Branch RISC - Delayed Branch

aCalculate result of branch before unusable instructions pre-fetched

aAlways execute single instruction immediately following branch

aK i li f ll hil f t hi i t ti

aKeeps pipeline full while fetching new instruction stream

aN t d f l aNot as good for superscalar

`Multiple instructions need to execute in delay slot

`Instruction dependence problems

aRevert to branch prediction

(26)

Superscalar Execution

(27)

Superscalar Implementation Superscalar Implementation

aSimultaneously fetch multiple instructions

aLogic to determine true dependencies involving register values

aMechanisms to communicate these values

aMechanisms to initiate multiple instructions in parallel

aResources for parallel execution of multiple instructions

aM h i f i i i

aMechanisms for committing process state in correct order

(28)

Required Reading Required Reading

aStallings chapter 13

aManufacturers web sites aIMPACT web site

`research on predicated execution