Advance Computer
Architecture – fall 2003,
Technion
1
Multithreaded Architectures
Dr. Avi Mendelson
Advance Computer
Architecture – fall 2003,
Technion
2
Overview
Multithreaded Architecture
Multithreaded Micro-Architecture
Advance Computer
Architecture – fall 2003,
Technion
3
References
“Asynchrony in Parallel Computing: From Dataflow to Multithreading” by Jurij Silc, Borut Robic, and Theo Ungerer, Parallel and
Distributed Computing Practices Vol.1, No.1, March 1998. http://www-csd.ijs.si/silc/pdcp.html
R.S. Nikhil, G.M. Papadopoulos and Arvind. *T: A Multithreaded Massively Parallel Architecture. In Proc. 19th Annual International
Symposium on Computer Architecture, pp. 156-167, 1992.
A. Agarwal, J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeung, G.D'Souza, M. Parkin. Sparcle: An Evolutionary Processor Design for Large-Scale
Multiprocessors. IEEE Micro, vol. 13, no. 3, pp. 48-61, June 1993.
R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, B. Smith. The Tera Computer System. In Proc. 1990 International
Conference on Supercomputing, pp. 1-6, June 1990
H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase and T. Nishizawa. An elementary Processor Architecture with
Simultaneous Instruction Issuing from Multiple Threads. ISCA ‘19, pp. 136-145, 1992.
G.S. Sohi, S. E. Breach, T.N. Vijaykumar. Multiscalar Processors. ISCA ‘22, 1995.
M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang. Y. Gurevich, W.S. Lee. The M-Machine Multicomputer. In Proc. 28th Annual Inter.
Sym. on Microarchitecture, pp 146-156, 1995
"Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy. ISCA’95.
D. M. Tullsen, S. J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L. Stamm. Exploiting choice: Instruction Fetch and Issue on an Implementable
Simultaneous Multithreading Processor. ISCA ‘23, 1996.
“Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading” by Lo, Eggers, Emer, Levy, Stamm
and Tullsen in ACM Transactions on Computer Systems, August 1997.
“Simultaneous Multithreading: A Platform for Next-Generation Prcoessors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro,
October, 1997.
“Simultaneous Multithreading: Multiplying Alpha Performance”, Joel Emer, MicroProcessor Forum 1999.
http://www.alphapowered.com/simu-multi-thread.ppt
“A Dynamic Multithreading Processor” by H. Akkary (Intel), M. Driscoll (Portland State Univ.). Micro-31, Nov ‘1999. "Speculative Data-Driven Multithreading" by Amir Roth, Guri Sohi. Submitted to ASPLOS ‘00.
“MicroUnity Lifts Veil on MediaProcessor: New Architecture Designed for Broadband Communications”, Michael Slater,
http://www.mpronline.com/mpr/h/19951023/091402.html, Microprocessor report 10/23/95
Advance Computer
Architecture – fall 2003,
Technion
4
Goals of Multithreaded Architecture
Successful MTA must have:
Minimal impact on the conventional design
Improved throughput on multiple thread workloads
Multiple thread = multithreaded or multiprogrammed workload
Good cost/throughput
Minimal impact on single-thread performance
Would also like
Performance gain on multithreaded applications
Advance Computer
Architecture – fall 2003,
Technion
5
Kinds of Multithreaded Architectures
Two dimensions
Primary: Front-end interleaving
Fine-grain (cycle-by-cycle) vs. coarse-grain (longer intervals)
Secondary: Back-end interleaving
Time multiplexing vs. space multiplexing
Depends on front-end
Three valid combinations
Blocked MT: coarse-grain FE + time-mplexed BE
Interleaved MT: fine-grain FE + time-mplexed BE
Simultaneous MT: fine-grain FE + space-mplexed BE
Advance Computer
Architecture – fall 2003,
Technion
6
Throughput vs. Utilization
Not the same thing
Throughput: how many instructions complete per cycle
Utilization: how many resources busy per cycle
Can increase one without the other
Can increase one while decreasing the other
Advance Computer
Architecture – fall 2003,
Technion
7
Scalar Execution
Advance Computer
Architecture – fall 2003,
Technion
8
Superscalar Execution
Advance Computer
Architecture – fall 2003,
Technion
9
Predication
Generally increases utilization, increases throughput less
(much of the utilization is thrown away)
Advance Computer
Architecture – fall 2003,
Technion
10
Advance Computer
Architecture – fall 2003,
Technion
11
Blocked Multithreading
May increase utilization and throughput, but must switch when current
thread goes to low utilization/throughput section (e.g. L2 cache miss)
Advance Computer
Architecture – fall 2003,
Technion
12
Fine Grained Multithreading
Advance Computer
Architecture – fall 2003,
Technion
13
Simultaneous Multithreading
Advance Computer
Architecture – fall 2003,
Technion
14
Blocked Multithreading
Critical decision: when to switch threads
Answer: when current thread’s utilization/thput is about to drop
Primary example: L2 cache miss
Requirements for throughput:
Thread-switch + pipe-fill time << blocking latency
Would like to get some work done before other thread comes back
Fast thread-switch: multiple register banks
Fast pipe-fill: short pipe
Examples
Macro-dataflow machine
MIT Alewife
Advance Computer
Architecture – fall 2003,
Technion
15
Interleaved Multithreading
Critical decision: none?
Requirements for throughput:
Enough threads to eliminate intra-thread hazards
Increasing number of threads reduces single-thread
performance
Examples:
HEP Denelcor: 8 threads (latencies were shorter then)
Advance Computer
Architecture – fall 2003,
Technion
16
Simultaneous Multi-threading
Critical decision: fetch-interleaving policy
Requirements for throughput:
Enough threads to utilize resources
Notice, many fewer than needed to stretch dependences
Examples:
Compaq Alpha EV8
Advance Computer
Architecture – fall 2003,
Technion
17
SMT Case Study: EV8
8-issue OOO processor
SMT Support
Multiple sequencers (PC): 4
More physical registers
Thread tags on all sequential resources: ROB, LSQ,
etc.
Process tags on all address space resources:
caches, TLB’s, etc.
Notice: none of these things are in the core
Advance Computer
Advance Computer
Advance Computer
Architecture – fall 2003,
Technion
20
SMT Performance Study (U. Wash.)
Execution resources
IQ size - 2x32
EU - 3 FP, 4 Int/Mem, 2 Int
Fetch/rename/retire bandwidth - 8 instructions
Speedup of 4 threads - 2.1
Experiments
fetch 4 instr from 2 threads
fetch priority to threads with fewest of instructions in
Decoder/Renamer/IQ
Influence of MT on branch prediction and caches
Where are the bottlenecks
Advance Computer
Architecture – fall 2003,
Technion
21Performance Scalability
0% 50% 100% 150% 200% 250%SpecInt SpecFP Mixed Int/FP
1T 2T 3T 4T 0% 50% 100% 150% 200% 250%
Turb3d Swm256 Tomcatv
1T 2T 3T 4T 0% 50% 100% 150% 200% 250% 300%
Barnes Chess Sort TP
1T 2T 4T
Decomposed SPEC95 Applications
Multiprogrammed Workload
Advance Computer
Architecture – fall 2003,
Technion
22
Fetch Interleaving on SMT
What if one thread gets “stuck”?
Round-robin: eventually it will fill up the machine (not good)
ICOUNT: thread with fewest instructions in pipe has priority
Translation: thread doesn’t get to fetch until it gets “unstuck”
Variation: what if one thread is spinning?
Not really stuck, gets to keep fetching
Advance Computer
Architecture – fall 2003,
Technion
23
Improving Performance on MT Apps
Shared memory apps:
Communicate through caches
Advance Computer
Architecture – fall 2003,
Technion
24
Summary
Multithreaded Software
Multithreaded Architecture
Advantageous cost/throughput
Blocked MT
Good single thread performance
Good throughput
Needs fast thread switch and short pipe
Interleaved MT
Bad single thread performance
Good throughput
Needs many threads
Simultaneous MT
Good throughput
Good single thread performance
Good utilization