• Tidak ada hasil yang ditemukan

p10-MT1.ppt 216KB Jun 23 2011 12:33:32 PM

N/A
N/A
Protected

Academic year: 2017

Membagikan "p10-MT1.ppt 216KB Jun 23 2011 12:33:32 PM"

Copied!
24
0
0

Teks penuh

(1)

Advance Computer

Architecture – fall 2003,

Technion

1

Multithreaded Architectures

Dr. Avi Mendelson

(2)

Advance Computer

Architecture – fall 2003,

Technion

2

Overview

Multithreaded Architecture

Multithreaded Micro-Architecture

(3)

Advance Computer

Architecture – fall 2003,

Technion

3

References

“Asynchrony in Parallel Computing: From Dataflow to Multithreading” by Jurij Silc, Borut Robic, and Theo Ungerer, Parallel and

Distributed Computing Practices Vol.1, No.1, March 1998. http://www-csd.ijs.si/silc/pdcp.html

R.S. Nikhil, G.M. Papadopoulos and Arvind. *T: A Multithreaded Massively Parallel Architecture. In Proc. 19th Annual International

Symposium on Computer Architecture, pp. 156-167, 1992.

A. Agarwal, J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeung, G.D'Souza, M. Parkin. Sparcle: An Evolutionary Processor Design for Large-Scale

Multiprocessors. IEEE Micro, vol. 13, no. 3, pp. 48-61, June 1993.

R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, B. Smith. The Tera Computer System. In Proc. 1990 International

Conference on Supercomputing, pp. 1-6, June 1990

H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase and T. Nishizawa. An elementary Processor Architecture with

Simultaneous Instruction Issuing from Multiple Threads. ISCA ‘19, pp. 136-145, 1992.

G.S. Sohi, S. E. Breach, T.N. Vijaykumar. Multiscalar Processors. ISCA ‘22, 1995.

M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang. Y. Gurevich, W.S. Lee. The M-Machine Multicomputer. In Proc. 28th Annual Inter.

Sym. on Microarchitecture, pp 146-156, 1995

"Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy. ISCA’95.

D. M. Tullsen, S. J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L. Stamm. Exploiting choice: Instruction Fetch and Issue on an Implementable

Simultaneous Multithreading Processor. ISCA ‘23, 1996.

“Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading” by Lo, Eggers, Emer, Levy, Stamm

and Tullsen in ACM Transactions on Computer Systems, August 1997.

“Simultaneous Multithreading: A Platform for Next-Generation Prcoessors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro,

October, 1997.

“Simultaneous Multithreading: Multiplying Alpha Performance”, Joel Emer, MicroProcessor Forum 1999.

http://www.alphapowered.com/simu-multi-thread.ppt

“A Dynamic Multithreading Processor” by H. Akkary (Intel), M. Driscoll (Portland State Univ.). Micro-31, Nov ‘1999."Speculative Data-Driven Multithreading" by Amir Roth, Guri Sohi. Submitted to ASPLOS ‘00.

“MicroUnity Lifts Veil on MediaProcessor: New Architecture Designed for Broadband Communications”, Michael Slater,

http://www.mpronline.com/mpr/h/19951023/091402.html, Microprocessor report 10/23/95

(4)

Advance Computer

Architecture – fall 2003,

Technion

4

Goals of Multithreaded Architecture

Successful MTA must have:

Minimal impact on the conventional design

Improved throughput on multiple thread workloads

Multiple thread = multithreaded or multiprogrammed workload

Good cost/throughput

Minimal impact on single-thread performance

Would also like

Performance gain on multithreaded applications

(5)

Advance Computer

Architecture – fall 2003,

Technion

5

Kinds of Multithreaded Architectures

Two dimensions

Primary: Front-end interleaving

Fine-grain (cycle-by-cycle) vs. coarse-grain (longer intervals)

Secondary: Back-end interleaving

Time multiplexing vs. space multiplexing

Depends on front-end

Three valid combinations

Blocked MT: coarse-grain FE + time-mplexed BE

Interleaved MT: fine-grain FE + time-mplexed BE

Simultaneous MT: fine-grain FE + space-mplexed BE

(6)

Advance Computer

Architecture – fall 2003,

Technion

6

Throughput vs. Utilization

Not the same thing

Throughput: how many instructions complete per cycle

Utilization: how many resources busy per cycle

Can increase one without the other

Can increase one while decreasing the other

(7)

Advance Computer

Architecture – fall 2003,

Technion

7

Scalar Execution

(8)

Advance Computer

Architecture – fall 2003,

Technion

8

Superscalar Execution

(9)

Advance Computer

Architecture – fall 2003,

Technion

9

Predication

Generally increases utilization, increases throughput less

(much of the utilization is thrown away)

(10)

Advance Computer

Architecture – fall 2003,

Technion

10

(11)

Advance Computer

Architecture – fall 2003,

Technion

11

Blocked Multithreading

May increase utilization and throughput, but must switch when current

thread goes to low utilization/throughput section (e.g. L2 cache miss)

(12)

Advance Computer

Architecture – fall 2003,

Technion

12

Fine Grained Multithreading

(13)

Advance Computer

Architecture – fall 2003,

Technion

13

Simultaneous Multithreading

(14)

Advance Computer

Architecture – fall 2003,

Technion

14

Blocked Multithreading

Critical decision: when to switch threads

Answer: when current thread’s utilization/thput is about to drop

Primary example: L2 cache miss

Requirements for throughput:

Thread-switch + pipe-fill time << blocking latency

Would like to get some work done before other thread comes back

Fast thread-switch: multiple register banks

Fast pipe-fill: short pipe

Examples

Macro-dataflow machine

MIT Alewife

(15)

Advance Computer

Architecture – fall 2003,

Technion

15

Interleaved Multithreading

Critical decision: none?

Requirements for throughput:

Enough threads to eliminate intra-thread hazards

Increasing number of threads reduces single-thread

performance

Examples:

HEP Denelcor: 8 threads (latencies were shorter then)

(16)

Advance Computer

Architecture – fall 2003,

Technion

16

Simultaneous Multi-threading

Critical decision: fetch-interleaving policy

Requirements for throughput:

Enough threads to utilize resources

Notice, many fewer than needed to stretch dependences

Examples:

Compaq Alpha EV8

(17)

Advance Computer

Architecture – fall 2003,

Technion

17

SMT Case Study: EV8

8-issue OOO processor

SMT Support

Multiple sequencers (PC): 4

More physical registers

Thread tags on all sequential resources: ROB, LSQ,

etc.

Process tags on all address space resources:

caches, TLB’s, etc.

Notice: none of these things are in the core

(18)

Advance Computer

(19)

Advance Computer

(20)

Advance Computer

Architecture – fall 2003,

Technion

20

SMT Performance Study (U. Wash.)

Execution resources

IQ size - 2x32

EU - 3 FP, 4 Int/Mem, 2 Int

Fetch/rename/retire bandwidth - 8 instructions

Speedup of 4 threads - 2.1

Experiments

fetch 4 instr from 2 threads

fetch priority to threads with fewest of instructions in

Decoder/Renamer/IQ

Influence of MT on branch prediction and caches

Where are the bottlenecks

(21)

Advance Computer

Architecture – fall 2003,

Technion

21

Performance Scalability

0% 50% 100% 150% 200% 250%

SpecInt SpecFP Mixed Int/FP

1T 2T 3T 4T 0% 50% 100% 150% 200% 250%

Turb3d Swm256 Tomcatv

1T 2T 3T 4T 0% 50% 100% 150% 200% 250% 300%

Barnes Chess Sort TP

1T 2T 4T

Decomposed SPEC95 Applications

Multiprogrammed Workload

(22)

Advance Computer

Architecture – fall 2003,

Technion

22

Fetch Interleaving on SMT

What if one thread gets “stuck”?

Round-robin: eventually it will fill up the machine (not good)

ICOUNT: thread with fewest instructions in pipe has priority

Translation: thread doesn’t get to fetch until it gets “unstuck”

Variation: what if one thread is spinning?

Not really stuck, gets to keep fetching

(23)

Advance Computer

Architecture – fall 2003,

Technion

23

Improving Performance on MT Apps

Shared memory apps:

Communicate through caches

(24)

Advance Computer

Architecture – fall 2003,

Technion

24

Summary

Multithreaded Software

Multithreaded Architecture

Advantageous cost/throughput

Blocked MT

Good single thread performance

Good throughput

Needs fast thread switch and short pipe

Interleaved MT

Bad single thread performance

Good throughput

Needs many threads

Simultaneous MT

Good throughput

Good single thread performance

Good utilization

Referensi

Dokumen terkait

Ayah menyiangi rumput separuh dari kebunnya dan anaknya mengerjakan sepertiganya?. dengan luas kebun adalah

Allah Maha Pengasih kepada semua mahluk, tetapi Maha Penyayang hanya kepada orang yang .... Asmaul Husna

Metoda evaluasi yang dipakai adalah sistem gugur dengan ambang batas teknis baik pada unsur-unsur maupun nilai total teknis dimana setiap dokumen yang dinyatakan tidak

MANDALA INDONESIA TECHNOLOGY Divisi Training IT, WINTECH, adalah merupakan unit tugas yang harus diikuti oleh setiap mahasiswa Desain Komunikasi Visual di

dengan judulnya Colorful dengan konsep pewarnaan yang disukai anak kecil, eye catching, dan menyenangkan. Tentu saja dengan bentukan desain yang sangat familiar

Pada hari ini Senin tanggal Lima belas Bulan Desember Tahun Dua ribu empat belas Pukul 09.00 s/d 10.00 WIB bertempat di Panti Sosial Bina Laras Dharma Guna

Untuk membuat file baru, Anda dapat memilih menu File‐New atau tekan tombol atau tekan

Hubungan harga dan jumlah barang atau jasa yang diminta oleh konsumen dilukiskan dalam kurva permintaan, garis singgung penawaran dan permintaan menghasilkan harga jual bagi