podc00_architecture.ppt 228KB Jun 23 2011 12:32:04 PM

(1)

How Computer Architecture Trends

May Affect Future Distributed Systems

Mark D. Hill

Computer Sciences Department

University of Wisconsin--Madison

(2)

Three Questions

• What is a System Area Network (SAN)

and how will it affect clusters?

– E.g., InfiniBand

• How fat will multiprocessor servers be

and how to we build larger ones?

– E.g. Wisconsin Multifacet’s Multicast & Timestamp Snooping

• Future of multiprocessor servers & clusters?

(3)

Outline

• Motivation

• System Area Networks

(4)

Technology Push: Moore’s Law

• What do following intervals have in common?

– Prehistory to 2000 – 2001 to 2002

• Answer:

Equal progress in absolute processor speed

(and more doubling 2003-4, 2005-6, etc.)

– Consider salary doubling

• Corollary:

Cost halves every two years

(5)

Application Pull

• Should use computers in currently wasteful ways

– Already computers in electric razors & greeting cards

• New business models

– B2C, B2B, C2B, C2C – Mass customization

• More proactive (beyond interactive) [Tennenhouse]

– Today: P2C where P==Person & C==Computer – More C2P: mattress adjusts to save your back – More C2C: Agents surf the web for optimal deal – More sensors (physical/logic worlds coupled) – More hidden computers (c.f., electric motors)

(6)

The Internet Iceberg

• Internet Components

– Clients -- mobile, wireless

– “On Ramp” -- LANs/DSL/Cable Modems – WAN Backbone -- IPv6, massive BW – and ...

• SERVICES

(7)

Outline

• Motivation

– What is a SAN?

– InfiniBand

– Virtualizing I/O with Queue Pairs – Predictions

(8)

Regarding Storage/Bandwidth

• Currently resides on I/O Bus (PCI)

– HW & SW protocol stacks

– Must add hosts to add storage/bandwidth

bridge

i/o bus

i/o slot 0

i/o slot n-1

proc

memory interconnect

proc

(9)

Want System Area Network (SAN)

• SAN vs. Local Area Nework (LAN)

– Higher bandwidth (10 Gbps)

– Lower latency (few microseconds or less) – More limited size

– Other (e.g., single administrative domain, short distance) – Examples: Tandem Servernet & Myricom Myrinet

• Emerging Standard:

InfiniBand

– www.inifinibandTA.org w/ spec 1.0 Summer 2000

– Compaq, Dell, HP, IBM, Intel, Microsoft, Sun, & others

(10)

InfiniBand Model (from website)

HCA (host channel adapter)

switch

link

T

C

A

target

(disks)

(11)

Inifiniband Advantages

• Storage/Network made orthogonal from Computation

• Reduce “hardware” stack -- no i/o bridge

• Reduce “software” stack; hardware support for

– Connected Reliable

– Connected Unreliable – Datagram

– Reliable Datagram – Raw Datagram

(12)

Virtualizing InfiniBand

• I/O traditionally virtualized with system call

– System enforces isolation

– System permits authorized sharing

• Memory virtualized

– System trap/call for setup

– Virtual memory hardware for common-case translation

• Infiniband exploits

“queue pairs”

(QPs) in memory

– C.f., Intel Virtual Interface Architecture (VIA) [IEEE Micro, Mar/Apr ‘98]

(13)

Queue Pair

proc

HCA

Main

Memory

• QP setup system call

– Connect with process – Connect with remote QP

(not shown here)

• QP placed in “pinned”

virtual memory

• User directly access QP

– E.g., sends, receives & remote DMA reads/writes

(14)

InfiniBand, cont.

• Roadmap

– NGIO/FIO merger in ‘99 – Spec in ‘00

– Products in ‘03-’10

• My Assessment

– PCI needs successor

– InfiniBand has the necessary features (but also many others) – InifiniBand has considerable industry buy-in (but it is recent) – Gigabit Ethernet will be only competitor

• Good name with backing from Cisco et al. • But TCP/IP is a killer

(15)

InfiniBand Research Issues

• Software Wide Open

– Industry will do local optimization

(e.g., still have device driver virtualized with system calls) – But what is the “right” way to do software?

– Is there a theoretical model for this software?

• Other SAN Issues

– A theoretical model of a service-providers site? – How to trade performance and availability?

(16)

Outline

• Motivation

• Designing Multiprocessor Servers

– How Fat?

– Coherence for Servers – E.g., Multicast Snooping – E.g., Timestamp Snooping

(17)

How Fat Should Servers Be?

• Use

– PCs -- cheap but small

– Workgroup servers -- medium cost; medium size – Large servers -- premium cost & size

• One answer: “yes”

PCs w/

“soft” state

Servers running

databases for

(18)

How Do We Build the Big Servers?

• (Industry knows how to build the small ones)

• A key problem is the memory system

– Memory Wall: E.g., 100ns memory access =

400 instruction opportunities for 4-way 1GHz processor

• Use per-processor caches to reduce

– Effective Latency

– Effective Bandwidth Used

(19)

Coherence 101

interconnection network

P0

cache

memory

P1

cache

Pn-1

cache

r0<-m[100]

100 : 4

r1<-m[100]

m[100]<-5

r2<-m[100]

r3<-m[100]

“4”

X 5

“?”

(20)

Broadcast Snooping

Mem

P0

P1

P2

Data Network

Ordered Address Network

P1:GETX P2:GETX

P1:GETX

P2:GETX

P2:GETX P2:GETX P2:GETX

P2:GETX P1:GETX P1:GETX P1:GETX

P1:GETX

P2:GETX

data

(21)

Broadcast Snooping

• Symmetric Multiprocessor (SMP)

– Most commercially-successful parallel computer architecture

– Performs well by finding data directly

– Scales poorly

• Improvements, e.g., Sun E10000

– Split address & data transactions

– Split address & data network (e.g., bus & crossbar)

– Multiple address buses (e.g., four multiplexed by address) – Address bus is broadcast tree (not shared wires)

• But…

(22)

Dir/Mem

P0

P1

P2

Data Network

Address Network

Directories

P1:GETX P2:GETX

data

P2:GETX

P1:GETX

P2:GETX

(23)

Directories

• Directory Based Cache Coherence

– E.g., SGI/Cray Origin2000

– Allows arbitrary point-to-point interconnection network

– Scales up well

• But

– Cache-to-cache transfers common in demanding apps (55-62% sharing misses for OLTP [Barroso ISCA ‘98]) – Many applications can’t use 100s of processors

(24)

Wisconsin Multifacet: Big Picture

• Build Servers For Internet economy

– Moderate multiprocessor sizes: 2-8 then 16-64, but not 1K – Optimize for these workloads (e.g. cache-to-cache transfers)

• Key Tool: Multiprocessor Prediction & Speculation

– Make a guess... verify it later

– Uniprocessor predecessors: branch & set predictors – Recent multiprocessor work: [Mukherjee/Hill ISCA98],

[Kaxiras/Goodman HPCA99] & [Lai/Falsafi ISCA99] – Multicast Snooping

(25)

Comparison of Coherence Methods

Coherence Attribute Snooping Directories Find previous owner directly? Yes Sometimes Always broadcast? Yes No Ordering w/o acks? Yes No Stateless at memory? Yes No Ordered network? Yes No Coherence Attribute

Snooping Directories Multicast Snooping Find previous

owner directly?

Yes Sometimes Usually

(good) Always

broadcast?

Yes No No (good)

Ordering w/o acks?

Yes No Yes (good)

Stateless at memory?

Yes No No but

simpler Ordered

network?

Yes No Yes, a

challenge

(26)

Multicast Snooping

• On cache miss

– Predict "multicast mask" (e.g., bit vector of processors) – Issue transaction on multicast address network

• Networks

– Address network that totally-orders address multicasts

– Separate point-to-point data network

• Processors snoop all incoming transactions

– If it's your own, it "occurs" now

– If another's, then invalidate and/or respond

• Simplified directory (at memory)

(27)

• Performed at Requesting Processor

– Include owner (GETS/GETX) & all sharers (GETX only) – Exclude most other processors

• Techniques

– Many straightforward cases (e.g., stack, code, space-sharing)

– Many options (network load, PC, software, local/global)

Predicting Masks

Mask Predictor

block address

feedback

(28)

Implementing an Ordered Multicast Network

• Address Network

– Must create the illusion of total order of multicasts

– May deliver a multicast to destinations at different times

• Wish List

– High throughput for multicasts – No centralized bottlenecks

– Low latency and cost (~ pipelined broadcast tree) – ...

• Sample Solutions

– Isotach Networks [Reynolds et al., IEEE TPDS 4/97] – Indirect Fat Tree [ISCA `99]

(29)

Indirect Fat Tree [ISCA ‘99]

(30)

Indirect Fat Tree, cont.

• Basic Idea

– Processors send transactions up to roots

– Roots send transactions down with logical timestamp – Switches stall transactions to keep in order

– Null transaction sent to avoid deadlock

• Assessment

– Viable & high cross-section bandwidth

– Many "backplane" ASICs means higher cost – Often stalls transactions

• Want

– Lower cost of direct connections

(31)

Direct 2-D Torus (work in progress)

• Features

– Each processor is switch – Switches directly connected

– E.g., network of Compaq 21364

• Network order?

– Broadcasts unordered

– Snooping needs total order

• Solution

– Create order with logical timestamps instead of network delivery order

– Called Timestamp Snooping [ASPLOS ‘00]

15 14

(32)

Timestamp Snooping

• Timestamp Snooping

– Snooping with order determined by logical timestamps

– Broadcast (not multicast) in ASPLOS ‘00

• Basic Idea

– Assign timestamp to coherence transactions at sender – Broadcast transactions over unordered network ASAP – Transaction carry timestamp (2 bits)

(33)

Timestamp Snooping Issues

• More address bandwidth

– For 16-processors, 4-ary butterfly, 64-byte blocks – Directory: 3*8 + 3*72 + more = 240 + more

– Timestamp Snooping 21*8 + 3*72 = 384 (< 60% more)

• Network must guarantee timestamps

– Assert future transactions will have greater timestamps (so processor can processor older transactions)

– Isotach [Reynolds IEEE TPDS 4/97] more aggressively

• Other

(34)

Initial Multifacet Results

• Multicast Snooping [ISCA ‘99]

– Ordered multicast of coherence transactions

– Find data directly from memory or caches – Reduce bandwidth to permit some scaling

– 32-processor results show 2-6 destinations per multicast

• Timestamp Snooping [ASPLOS ‘00]

– Broadcast snooping with “order” determined by

logical timestamps carried by coherence transactions

– No bus: Allows arbitrary memory interconnects – No directory or directory indirection

(35)

Selected Issues

• Multicast Snooping

– What program property are mask predictors exploiting? – Why is there no good model of locality

or the “90-10” rule in general?

– How does one build multicast networks? – What about fault tolerance?

• Timestamp Snooping

– What is an optimal network topology? – What about buffering, deadlock, etc.?

(36)

Outline

• Motivation

• Designing Multiprocessor Servers

• Server & Cluster Trends

– Out-of-box and highly-available servers

(37)

Multiprocessor Servers

• High-Performance Communication

“within box”

– SMPs (e.g., Intel PentiumPro Quads) – Directory-based (SGI Origin2000)

• Trend toward hierarchical

“out of box”

solutions

– Build bigger servers from smaller ones

– Intel Profusion, Sequent NUMA-Q, Sun WildFire (pictured)

SMP

(38)

Multiprocessor Servers, cont.

• Traditionally had poor error isolation

– Double-bit ECC error crashes everything – Kernel error crashes everything

– Poor match for highly available Internet infrastructure

• Improve error isolation

(39)

Clusters

• Traditionally

– Good error isolation

– Poor communication performance (especially latency) – LANs are not optimized for clusters

• Enter Early SANs

– Berkeley NOW w/ Myricom Myrinet – IBM SP w/ proprietary network

(40)

A Prediction

• Blurring of cluster & server boundaries

• Clusters

– High communication performance

• Servers

– Better error isolation – Multi-box solutions

• Use same hardware & configure in the field

• Issues

– How do we model these hybrids?

(41)

Three Questions

• What is a System Area Network (SAN)

and how will it affect clusters?

– E.g., InfiniBand

– Make computation, storage, & network orthogonal

• How fat will multiprocessor servers be

and how to we build larger ones?

– Varying sizes for soft & hard state

– E.g., Multicast Snooping & Timestamp Snooping

• Future of multiprocessor servers & clusters?