p10-MT1.ppt 216KB Jun 23 2011 12:33:32 PM

(1)

Advance Computer

Architecture – fall 2003,

Technion

1

Multithreaded Architectures

Dr. Avi Mendelson

(2)

Advance Computer

Technion

2

Overview



Multithreaded Architecture



Multithreaded Micro-Architecture

(3)

Advance Computer

Technion

3

References

 _{“Asynchrony in Parallel Computing: From Dataflow to Multithreading”}_{by Jurij Silc, Borut Robic, and Theo Ungerer, Parallel and}

Distributed Computing Practices Vol.1, No.1, March 1998. http://www-csd.ijs.si/silc/pdcp.html

 _{R.S. Nikhil, G.M. Papadopoulos and Arvind. *T: A Multithreaded Massively Parallel Architecture.}_{In Proc. 19th Annual International}

Symposium on Computer Architecture, pp. 156-167, 1992.

 _{A. Agarwal, J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeung, G.D'Souza, M. Parkin. Sparcle: An Evolutionary Processor Design for Large-Scale}

Multiprocessors. IEEE Micro, vol. 13, no. 3, pp. 48-61, June 1993.

 _{R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, B. Smith. The Tera Computer System.}_{In Proc. 1990 International}

Conference on Supercomputing, pp. 1-6, June 1990

 _{H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase and T. Nishizawa. An elementary Processor Architecture with}

Simultaneous Instruction Issuing from Multiple Threads. ISCA ‘19, pp. 136-145, 1992.

 _{G.S. Sohi, S. E. Breach, T.N. Vijaykumar. Multiscalar Processors}_{. ISCA ‘22,}_1995.

 _{M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang. Y. Gurevich, W.S. Lee. The M-Machine Multicomputer. In Proc. 28th Annual Inter.}

Sym. on Microarchitecture, pp 146-156, 1995

 _{"Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy. ISCA’95.}

 _{D. M. Tullsen, S. J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L. Stamm. Exploiting choice: Instruction Fetch and Issue on an Implementable}

Simultaneous Multithreading Processor. ISCA ‘23, 1996.

 _{“Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading” by Lo, Eggers, Emer, Levy, Stamm}

and Tullsen in ACM Transactions on Computer Systems, August 1997.

 _{“Simultaneous Multithreading: A Platform for Next-Generation Prcoessors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro,}

October, 1997.

 _{“Simultaneous Multithreading: Multiplying Alpha Performance”, Joel Emer, MicroProcessor Forum 1999.}

http://www.alphapowered.com/simu-multi-thread.ppt

 _{“A Dynamic Multithreading Processor” by H. Akkary (Intel), M. Driscoll (Portland State Univ.). Micro-31, Nov ‘1999.}  _{"Speculative Data-Driven Multithreading" by Amir Roth, Guri Sohi. Submitted to ASPLOS ‘00.}

 _{“MicroUnity Lifts Veil on MediaProcessor: New Architecture Designed for Broadband Communications”, Michael Slater,}

http://www.mpronline.com/mpr/h/19951023/091402.html, Microprocessor report 10/23/95

(4)

Advance Computer

Technion

4

Goals of Multithreaded Architecture



Successful MTA must have:



Minimal impact on the conventional design



Improved throughput on multiple thread workloads



Multiple thread = multithreaded or multiprogrammed workload



Good cost/throughput



Minimal impact on single-thread performance



Would also like



Performance gain on multithreaded applications

(5)

Advance Computer

Technion

5

Kinds of Multithreaded Architectures



Two dimensions



Primary: Front-end interleaving



Fine-grain (cycle-by-cycle) vs. coarse-grain (longer intervals)



Secondary: Back-end interleaving



Time multiplexing vs. space multiplexing



Depends on front-end



Three valid combinations



Blocked MT: coarse-grain FE + time-mplexed BE



Interleaved MT: fine-grain FE + time-mplexed BE



Simultaneous MT: fine-grain FE + space-mplexed BE

(6)

Advance Computer

Technion

6

Throughput vs. Utilization



Not the same thing



Throughput: how many instructions complete per cycle



Utilization: how many resources busy per cycle



Can increase one without the other



Can increase one while decreasing the other

(7)

Advance Computer

Technion

7

Scalar Execution

(8)

Advance Computer

Technion

8

Superscalar Execution

(9)

Advance Computer

Technion

9

Predication

Generally increases utilization, increases throughput less

(much of the utilization is thrown away)

(10)

Advance Computer

Technion

10

(11)

Advance Computer

Technion

11

Blocked Multithreading

May increase utilization and throughput, but must switch when current

thread goes to low utilization/throughput section (e.g. L2 cache miss)

(12)

Advance Computer

Technion

12

Fine Grained Multithreading

(13)

Advance Computer

Technion

13

Simultaneous Multithreading

(14)

Advance Computer

Technion

14

Blocked Multithreading



Critical decision: when to switch threads



Answer: when current thread’s utilization/thput is about to drop



Primary example: L2 cache miss



Requirements for throughput:



Thread-switch + pipe-fill time << blocking latency



Would like to get some work done before other thread comes back



Fast thread-switch: multiple register banks



Fast pipe-fill: short pipe



Examples



Macro-dataflow machine



MIT Alewife

(15)

Advance Computer

Technion

15

Interleaved Multithreading



Critical decision: none?



Requirements for throughput:



Enough threads to eliminate intra-thread hazards



Increasing number of threads reduces single-thread

performance



Examples:



HEP Denelcor: 8 threads (latencies were shorter then)

(16)

Advance Computer

Architecture – fall 2003,

Technion

16

Simultaneous Multi-threading



Critical decision: fetch-interleaving policy



Requirements for throughput:



Enough threads to utilize resources



Notice, many fewer than needed to stretch dependences



Examples:



Compaq Alpha EV8

(17)

Advance Computer

Technion

17

SMT Case Study: EV8



8-issue OOO processor



SMT Support



Multiple sequencers (PC): 4



More physical registers



Thread tags on all sequential resources: ROB, LSQ,

etc.



Process tags on all address space resources:

caches, TLB’s, etc.



Notice: none of these things are in the core

(18)

Advance Computer

(19)

Advance Computer

(20)

Advance Computer

Technion

20

SMT Performance Study (U. Wash.)



Execution resources



IQ size - 2x32



EU - 3 FP, 4 Int/Mem, 2 Int



Fetch/rename/retire bandwidth - 8 instructions



Speedup of 4 threads - 2.1



Experiments



fetch 4 instr from 2 threads



fetch priority to threads with fewest of instructions in

Decoder/Renamer/IQ



Influence of MT on branch prediction and caches



Where are the bottlenecks

(21)

Advance Computer

Technion

21

Performance Scalability

0% 50% 100% 150% 200% 250%

SpecInt SpecFP Mixed Int/FP

1T 2T 3T 4T 0% 50% 100% 150% 200% 250%

Turb3d Swm256 Tomcatv

1T 2T 3T 4T 0% 50% 100% 150% 200% 250% 300%

Barnes Chess Sort TP

1T 2T 4T

Decomposed SPEC95 Applications

Multiprogrammed Workload

(22)

Advance Computer

Technion

22

Fetch Interleaving on SMT



What if one thread gets “stuck”?



Round-robin: eventually it will fill up the machine (not good)



ICOUNT: thread with fewest instructions in pipe has priority



Translation: thread doesn’t get to fetch until it gets “unstuck”



Variation: what if one thread is spinning?



Not really stuck, gets to keep fetching

(23)

Advance Computer

Technion

23

Improving Performance on MT Apps



Shared memory apps:



Communicate through caches

(24)

Advance Computer

Technion

24

Summary



Multithreaded Software



Multithreaded Architecture



Advantageous cost/throughput



Blocked MT



Good single thread performance



Good throughput



Needs fast thread switch and short pipe



Interleaved MT



Bad single thread performance



Good throughput



Needs many threads



Simultaneous MT



Good throughput



Good single thread performance



Good utilization