Willia m St a llings Com put e r Orga niza t ion a nd Arc hit e c t ure 7 t h Edit ion
Cha pt e r 4 Ca c he M e m ory
Cha ra c t e rist ic s
• Locat ion
• Capacit y
• Unit of t r ansfer
• Access m et hod
• Per for m ance
• Physical t ype
• Physical char act er ist ics
• Or ganisat ion
Loc a t ion
• CPU
• I nt er nal
• Ext er nal
Ca pa c it y
• Wor d size
—The nat ural unit of organisat ion • Num ber of w or ds
—or Byt es
U nit of T ra nsfe r
• I nt er nal
—Usually governed by dat a bus w idt h • Ext er nal
—Usually a block w hich is m uch larger t han a w ord
• Addr essable unit
—Sm allest locat ion w hich can be uniquely addressed
—Word int ernally —Clust er on M$ disks
Ac c e ss M e t hods (1 )
• Sequent ial
—St art at t he beginning and read t hrough in order
—Access t im e depends on locat ion of dat a and previous locat ion
—e. g. t ape • Dir ect
—I ndividual blocks have unique address —Access is by j um ping t o vicinit y plus
sequent ial search
Ac c e ss M e t hods (2 )
• Random
—I ndividual addresses ident ify locat ions exact ly —Access t im e is independent of locat ion or
previous access —e. g. RAM • Associat ive
—Dat a is locat ed by a com parison w it h cont ent s of a port ion of t he st ore
—Access t im e is independent of locat ion or previous access
—e. g. cache
M e m ory H ie ra rc hy
• Regist er s
—I n CPU
• I nt er nal or Main m em or y
—May include one or m ore levels of cache —“ RAM”
• Ext er nal m em or y
—Backing st ore
M e m ory H ie ra rc hy - Dia gra m Pe rform a nc e
• Access t im e
—Tim e bet w een present ing t he address and get t ing t he valid dat a
• Mem or y Cycle t im e
—Tim e m ay be required for t he m em ory t o “ recover” before next access
—Cycle t im e is access + recovery • Tr ansfer Rat e
—Rat e at w hich dat a can be m oved
Physic a l T ype s
• Sem iconduct or
—RAM
• Magnet ic
—Disk & Tape • Opt ical
—CD & DVD
Physic a l Cha ra c t e rist ic s
• Decay
• Volat ilit y
• Er asable
Orga nisa t ion
• Physical ar r angem ent of bit s int o w or ds
• Not alw ays obvious
• e.g. int er leaved
T he Bot t om Line
• How m uch?
—Capacit y • How fast ?
—Tim e is m oney • How expensive?
H ie ra rc hy List
• Regist er s
• L1 Cache
• L2 Cache
• Main m em or y
• Disk cache
• Disk
• Opt ical
• Tape
So you w a nt fa st ?
• I t is possible t o build a com put er w hich uses only st at ic RAM ( see lat er )
• This w ould be ver y fast
• This w ould need no cache
—How can you cache cache?
• This w ould cost a ver y lar ge am ount
Loc a lit y of Re fe re nc e
• Dur ing t he cour se of t he execut ion of a pr ogr am , m em or y r efer ences t end t o clust er
• e.g. loops
Ca c he
• Sm all am ount of fast m em or y
• Sit s bet w een nor m al m ain m em or y and CPU
Ca c he /M a in M e m ory St ruc t ure Ca c he ope ra t ion – ove rvie w
• CPU r equest s cont ent s of m em or y locat ion
• Check cache for t his dat a
• I f pr esent , get fr om cache ( fast )
• I f not pr esent , r ead r equir ed block fr om m ain m em or y t o cache
• Then deliver fr om cache t o CPU
• Cache includes t ags t o ident ify w hich block of m ain m em or y is in each cache slot
Ca c he Re a d Ope ra t ion - Flow c ha rt Ca c he De sign
• Size
• Mapping Funct ion
• Replacem ent Algor it hm
• Wr it e Policy
• Block Size
• Num ber of Caches
Size doe s m a t t e r
• Cost
—More cache is expensive • Speed
—More cache is fast er ( up t o a point ) —Checking cache for dat a t akes t im e
Com pa rison of Ca c he Size s
Processor Type IntroductionYear of L1 cachea L2 cache L3 cache
IBM 360/85 Mainframe 1968 16 to 32 KB — —
PDP-11/70 Minicomputer 1975 1 KB — —
VAX 11/780 Minicomputer 1978 16 KB — —
IBM 3033 Mainframe 1978 64 KB — —
IBM 3090 Mainframe 1985 128 to 256 KB — —
Intel 80486 PC 1989 8 KB — —
Pentium PC 1993 8 KB/8 KB 256 to 512 KB —
PowerPC 601 PC 1993 32 KB — —
PowerPC 620 PC 1996 32 KB/32 KB — —
PowerPC G4 PC/server 1999 32 KB/32 KB 256 KB to 1 MB 2 MB IBM S/390 G4 Mainframe 1997 32 KB 256 KB 2 MB
IBM S/390 G6 Mainframe 1999 256 KB 8 MB —
Pentium 4 PC/server 2000 8 KB/8 KB 256 KB — IBM SP High-end server/ supercomputer 2000 64 KB/32 KB 8 MB —
CRAY MTAb Supercomputer 2000 8 KB 2 MB —
Itanium PC/server 2001 16 KB/16 KB 96 KB 4 MB SGI Origin 2001 High-end server 2001 32 KB/32 KB 4 MB —
Itanium 2 PC/server 2002 32 KB 256 KB 6 MB
IBM POWER5 High-end server 2003 64 KB 1.9 MB 36 MB CRAY XD-1 Supercomputer 2004 64 KB/64 KB 1MB —
M a pping Func t ion
• Cache of 64kByt e
• Cache block of 4 byt es
—i. e. cache is 16k ( 214) lines of 4 byt es
• 16MByt es m ain m em or y
• 24 bit addr ess
—( 224= 16M)
Dire c t M a pping
• Each block of m ain m em or y m aps t o only one cache line
—i. e. if a block is in cache, it m ust be in one specific place
• Addr ess is in t w o par t s
• Least Significant w bit s ident ify unique w or d
• Most Significant s bit s specify one m em or y block
• The MSBs ar e split int o a cache line field r and a t ag of s- r ( m ost significant )
Dire c t M a pping Addre ss St ruc t ure
Tag s-r Line or Slot r Word w
8 14 2
• 24 bit address
• 2 bit w ord identifier ( 4 byte block)
• 22 bit block identifier
—8 bit t ag ( = 22- 14)
—14 bit slot or line
• No tw o blocks in the sam e line have the sam e Tag field
• Check contents of cache by finding line and checking Tag
Dire c t M a pping Ca c he Line T a ble
• Cache line Main Mem ory blocks held
• 0 0, m , 2m , 3m …2s- m
• 1 1, m + 1, 2m + 1…2s- m + 1
• m - 1 m - 1, 2m - 1, 3m - 1…2s- 1
Dire c t M a pping
Ex a m ple Dire c t M a pping Sum m a ry
• Addr ess lengt h = ( s + w ) bit s
• Num ber of addr essable unit s = 2s+ w w or ds or byt es
• Block size = line size = 2w w or ds or byt es
• Num ber of blocks in m ain m em or y = 2s+ w / 2w = 2s
• Num ber of lines in cache = m = 2r
• Size of t ag = ( s – r ) bit s
Dire c t M a pping pros & c ons
• Sim ple
• I nexpensive
• Fixed locat ion for given block
—I f a program accesses 2 blocks t hat m ap t o t he sam e line repeat edly, cache m isses are very high
Assoc ia t ive M a pping
• A m ain m em or y block can load int o any line of cache
• Mem or y addr ess is int er pr et ed as t ag and w or d
• Tag uniquely ident ifies block of m em or y
• Ever y line’s t ag is exam ined for a m at ch
• Cache sear ching get s expensive
Fully Assoc ia t ive Ca c he Orga niza t ion
Tag 22 bit Word2 bit Assoc ia t ive M a pping
Addre ss St ruc t ure
• 22 bit t ag st ored w it h each 32 bit block of dat a • Com pare t ag field w it h t ag ent ry in cache t o
check for hit
• Least significant 2 bit s of address ident ify w hich 16 bit w ord is required from 32 bit dat a block • e. g.
—Address Tag Data Cache line
—FFFFFC FFFFFC24682468 3FFF
Assoc ia t ive M a pping Sum m a ry
• Addr ess lengt h = ( s + w ) bit s
• Num ber of addr essable unit s = 2s+ w w or ds or byt es
• Block size = line size = 2w w or ds or byt es
• Num ber of blocks in m ain m em or y = 2s+ w / 2w = 2s
• Num ber of lines in cache = undet er m ined
• Size of t ag = s bit s
Se t Assoc ia t ive M a pping
• Cache is divided int o a num ber of set s
• Each set cont ains a num ber of lines
• A given block m aps t o any line in a given set
—e. g. Block B can be in any line of set i • e.g. 2 lines per set
—2 w ay associat ive m apping
—A given block can be in one of 2 lines in only one set
Se t Assoc ia t ive M a pping Ex a m ple
• 13 bit set num ber
• Block num ber in m ain m em or y is m odulo 213
• 000000, 00A000, 00B000, 00C000 … m ap t o sam e set
T w o Wa y Se t Assoc ia t ive Ca c he Orga niza t ion
Se t Assoc ia t ive M a pping Addre ss St ruc t ure
• Use set field t o det er m ine cache set t o look in
• Com par e t ag field t o see if w e have a hit
• e.g
—Address Tag Dat a Set
num ber
—1FF 7FFC 1FF 12345678 1FFF
T w o Wa y Se t
Assoc ia t ive M a pping Ex a m ple
Se t Assoc ia t ive M a pping Sum m a ry
• Addr ess lengt h = ( s + w ) bit s
• Num ber of addr essable unit s = 2s+ w w or ds or byt es
• Block size = line size = 2w w or ds or byt es
• Num ber of blocks in m ain m em or y = 2d
• Num ber of lines in set = k
• Num ber of set s = v = 2d
• Num ber of lines in cache = kv = k * 2d
• Size of t ag = ( s – d) bit s
Re pla c e m e nt Algorit hm s (1 ) Dire c t m a pping
• No choice
• Each block only m aps t o one line
• Replace t hat line
Re pla c e m e nt Algorit hm s (2 ) Assoc ia t ive & Se t Assoc ia t ive
• Har dw ar e im plem ent ed algor it hm ( speed)
• Least Recent ly used ( LRU)
• e.g. in 2 w ay set associat ive
—Which of t he 2 block is lru? • Fir st in fir st out ( FI FO)
—replace block t hat has been in cache longest • Least fr equent ly used
—replace block w hich has had few est hit s • Random
Writ e Polic y
• Must not over w r it e a cache block unless m ain m em or y is up t o dat e
• Mult iple CPUs m ay have individual caches
• I / O m ay addr ess m ain m em or y dir ect ly
Writ e t hrough
• All w r it es go t o m ain m em or y as w ell as cache
• Mult iple CPUs can m onit or m ain m em or y t r affic t o keep local ( t o CPU) cache up t o dat e
Writ e ba c k
• Updat es init ially m ade in cache only
• Updat e bit for cache slot is set w hen updat e occur s
• I f block is t o be r eplaced, w r it e t o m ain m em or y only if updat e bit is set
• Ot her caches get out of sync
• I / O m ust access m ain m em or y t hr ough cache
• N.B. 15% of m em or y r efer ences ar e w r it es
Pe nt ium 4 Ca c he • 80386 – no on chip cache
• 80486 – 8k using 16 byt e lines and four w ay set associat ive organizat ion
• Pent ium ( all versions) – t w o on chip L1 caches —Data & instructions
• Pent ium I I I – L3 cache added off chip • Pent ium 4
—L1 caches
–8k byt es
–64 byt e lines
–four way set associat ive
—L2 cache
–Feeding bot h L1 caches
–256k
–128 byt e lines
–8 way set associat ive
—L3 cache on chip
I nt e l Ca c he Evolut ion
Problem Solution
Processor on which feature first appears
External memory slower than the system bus. Add external cache using faster memory technology.
386 Increased processor speed results in external bus
becoming a bottleneck for cache access. Move external cache on-chip, operating at the same speed as the processor.
486
Internal cache is rather small, due to limited space on
chip Add external L2 cache using faster technology than main memory
486 Contention occurs when both the Instruction Prefetcher
and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place.
Create separate data and instruction caches.
Pentium
Increased processor speed results in external bus becoming a bottleneck for L2 cache access.
Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache.
Pentium Pro
Move L2 cache on to the processor chip.
Pentium II Some applications deal with massive databases and
must have rapid access to large amounts of data. The on-chip caches are too small.
Add external L3 cache. Pentium III Move L3 cache on-chip. Pentium 4
Pe nt ium 4 Bloc k Dia gra m
Pe nt ium 4 Core Proc e ssor • Fet ch/ Decode Unit
—Fetches instructions from L2 cache
—Decode into m icro- ops
—Store m icro- ops in L1 cache
• Out of order execut ion logic —Schedules m icro- ops
—Based on data dependence and resources
—May speculativ ely execute
• Execut ion unit s —Execute m icro- ops
—Data from L1 cache
—Results in registers
Pe nt ium 4 De sign Re a soning
• Decodes instructions into RI SC like m icro- ops before L1 cache
• Micro- ops fixed length
—Superscalar pipelining and scheduling
• Pentium instructions long & com plex
• Perform ance im proved by separatin g decoding from scheduling & pipelining
—( More lat er – ch14)
• Data cache is w rite back
—Can be configured t o writ e t hrough
• L1 cache controlled by 2 bits in register
—CD = cache disable
—NW = not writ e t hrough
Pow e rPC Ca c he Orga niza t ion
• 601 – single 32kb 8 w ay set associat ive
• 603 – 16kb ( 2 x 8kb) t w o w ay set associat ive
• 604 – 32kb
• 620 – 64kb
• G3 & G4
—64kb L1 cache –8 w ay set associative
—256k, 512k or 1M L2 cache –tw o w ay set associative
• G5
—32kB inst ruct ion cache —64kB dat a cache
Pow e rPC G5 Bloc k Dia gra m
I nt e rne t Sourc e s
• Manufact ur er sit es