david.ruslingdigital.om
REVIEW,Version0.8-3
January25, 1999
Thisbook isforLinuxenthusiastswhowanttoknowhowtheLinux kernelworks. Itis
notaninternals manual. Rather itdesribes the priniples andmehanismsthatLinux
uses;how andwhytheLinuxkernelworksthewaythatitdoes. Linuxisa moving
tar-get;thisbookisbasedupontheurrent, stable,2.0.33souresasthosearewhatmost
individualsandompaniesarenowusing.
Thisbookisfreelydistributable,youmayopyandredistributeitunderertainonditions.
Univel.
Copyright 1996,1997,1998,19 99 DavidARusling
3FoxgloveClose,Wokingham,BerkshireRG413NF,UK
david.ruslingarm.om
Thisbook (\TheLinuxKernel")maybereproduedanddistributedinwholeor
inpart,withoutfee,subjettothefollowing onditions:
The opyright notie above and this permission notie must be preserved
ompleteonallompleteorpartialopies.
Anytranslationorderivedworkmustbeapprovedbytheauthorinwriting
beforedistribution.
If youdistribute thisworkinpart, instrutionsfor obtainingtheomplete
versionofthismanualmustbeinluded,andameansfor obtaininga
om-pleteversionprovided.
Smallportions may bereprodued asillustrations for reviewsor quotesin
otherworkswithoutthispermissionnotieifproperitationisgiven.
Exeptions to these rules may be granted for aademi purposes: Write to the
authorandask. Theserestritionsareheretoprotetusasauthors,nottorestrit
youaslearnersandeduators.
AllsoureodeinthisdoumentisplaedundertheGNUGeneralPubliLiense,
availableviaanonymousFTPfromprep.ai.mit.edu:/pub/gnu/COPYI NG.Itisalso
LinuxisaphenomenonoftheInternet. Bornoutofthehobbyprojetofastudentit
hasgrowntobeomemorepopularthananyotherfreelyavailableoperatingsystem.
Tomany Linux is an enigma. How ansomething that is free be worthwhile? In
aworlddominated byahandfulof largesoftwareorporations,howansomething
that has been written by a bunh of \hakers" (si) hope to ompete? How an
softwareontributedtobymanydierentpeopleinmanydierentountriesaround
the world have ahope of being stable and eetive? Yet stable and eetiveit is
andompeteitdoes. ManyUniversitiesandresearhestablishmentsuseitfortheir
everyday omputing needs. People are running iton their home PCsand I would
wagerthatmostompaniesareusingitsomewhereeveniftheydonotalwaysrealize
that they do. Linux is used to browse the web, host web sites, write theses, send
eletronimailand,asalwayswithomputers,toplaygames. Linuxisemphatially
notatoy;itisafullydevelopedandprofessionallywrittenoperatingsystemusedby
enthusiastsallovertheworld.
The roots of Linux an be traed bak to the origins of Unix TM
. In 1969, Ken
Thompson of the Researh Group at Bell Laboratories began experimenting on a
multi-user, multi-taskingoperating systemusing anotherwiseidle PDP-7. He was
soonjoinedbyDennisRihieandthetwoofthem,alongwithothermembersofthe
ResearhGroupproduedtheearlyversionsofUnix TM
. Rihiewasstronglyinuened
by anearlier projet, MULTICS and thename Unix TM
isitself a punon thename
MULTICS. Earlyversionswere writtenin assemblyode,butthethird versionwas
rewrittenin anewprogramminglanguage,C.CwasdesignedandwrittenbyRihie
expressly as a programminglanguage for writing operating systems. This rewrite
allowed Unix TM
to moveonto the morepowerfulPDP-11/45and 11/70omputers
then beingprodued byDIGITAL.Therest, astheysay,is history. Unix TM
moved
outofthelaboratoryandintomainstreamomputingandsoonmostmajoromputer
manufaturerswereproduingtheirownversions.
Linux was the solution to a simpleneed. The only software that Linus Torvalds,
Linux's author and priniplemaintainer wasableto aord wasMinix. Minixis a
simple,Unix TM
like,operatingsystemwidelyused asateahingaid. Linus wasless
thanimpressedwithitsfeatures,hissolutionwastowritehisownsoftware. Hetook
Unix TM
ashismodelasthatwasanoperatingsystemthathewasfamiliarwithinhis
daytodaystudentlife. HestartedwithanIntel386basedPCandstartedtowrite.
Progresswas rapid and, exited bythis, Linus oeredhis eortsto other students
viatheemergingworldwideomputernetworks,thenmainly usedbytheaademi
ommunity. Others saw the software and started ontributing. Muh of this new
softwarewasitselfthesolutiontoaproblemthatoneoftheontributorshad. Before
Mostpeople useLinuxasasimpletool,oftenjust installing oneofthe manygood
CD ROM-based distributions. A lot of Linux users useit to write appliations or
torunappliationswrittenbyothers. ManyLinuxusersreadtheHOWTOs 1
avidly
andfeel boththethrill ofsuess whensomepartof thesystemhasbeenorretly
onguredandthefrustrationoffailurewhenithasnot. Aminorityareboldenough
to write devie driversand oer kernel pathesto Linus Torvalds, the reatorand
maintainer of the Linux kernel. Linus aepts additions and modiations to the
kernelsouresfrom anyone,anywhere. This mightsound like areipe foranarhy
but Linus exerisesstrit quality ontrol and merges all new ode into the kernel
himself. At any onetime though, there are only ahandful of people ontributing
sourestotheLinuxkernel.
The majority of Linux users do notlook at how the operating systemworks, how
it ts together. This is a shame beause looking at Linux is a verygood way to
learn moreabout how an operating system funtions. Not only is it well written,
allthe souresarefreely availablefor youto look at. Thisis beausealthough the
authors retain the opyrights to theirsoftware, they allowthe souresto be freely
redistributableunder theFree SoftwareFoundation'sGNUPubliLiense. Atrst
glanethough,thesouresan beonfusing; youwill see diretoriesalled kernel,
mmandnetbutwhatdotheyontainandhowdoesthatodework? Whatisneeded
is a broader understanding of the overall struture and aims of Linux. This, in
short,is theaimof thisbook: topromote alearunderstandingof howLinux,the
operatingsystem,works. Toprovideamind modelthat allowsyoutopiturewhat
ishappeningwithinthesystemasyouopyalefromoneplaetoanotherorread
eletronimail. Iwellremembertheexitementthat Ifelt whenIrstrealizedjust
howanoperatingsystematually worked. ItisthatexitementthatI wantto pass
ontothereadersofthis book.
MyinvolvementwithLinuxstartedlatein1994whenIvisitedJimParadiswhowas
workingonaportofLinuxtotheAlphaAXPproessorbasedsystems. Ihadworked
for Digital Equipment Co. Limited sine 1984, mostly in networks and
ommuni-ationsand in 1992I startedworking forthe newlyformed Digital Semiondutor
division. Thisdivision'sgoalwastoenterfullyintothemerhanthipvendormarket
and sellhips,and in partiular theAlpha AXP rangeof miroproessorsbut also
Alpha AXP system boards outside of Digital. When I rst heard about Linux I
immediatelysawanopportunityto havefun. Jim's enthusiasmwasathingand I
startedtohelpontheport. AsIworkedonthis,Ibeganmoreandmoretoappreiate
notonlytheoperatingsystembutalsotheommunityofengineersthatproduesit.
However,Alpha AXPis onlyoneofthe many hardwareplatforms that Linux runs
on. MostLinuxkernelsarerunningonIntelproessorbasedsystemsbut agrowing
numberofnon-IntelLinuxsystemsarebeomingmoreommonlyavailable. Amongst
these areAlpha AXP, ARM,MIPS, SparandPowerPC.I ouldhavewritten this
bookusinganyoneofthoseplatformsbutmybakgroundandtehnialexperienes
withLinuxarewith Linuxon theAlphaAXPand, toalesserextentontheARM.
Thisiswhythisbooksometimesusesnon-Intelhardwareasanexampletoillustrate
1
ommonto allof thehardwareplatforms that itruns on. Likewise,around95%of
thisbookisaboutthemahineindependentpartsoftheLinux kernel.
Reader Prole
This book does not make any assumptions about the knowledge or experiene of
thereader. Ibelievethat interestin thesubjetmatterwill enourageaproessof
self eduationwhere neessary. Thatsaid, adegreeoffamiliaritywith omputers,
preferablythePCwill helpthe readerderiverealbenetfrom thematerial,aswill
someknowledgeoftheCprogramminglanguage.
Organisation of this Book
This book is not intended to be used as an internals manual for Linux. Instead
it is an introdution to operating systems in general and to Linux in partiular.
The hapters eah follow my rule of \workingfrom thegeneral to the partiular".
Theyrstgiveanoverviewofthe kernelsubsystemthat theyaredesribingbefore
launhingintoitsgorydetails.
Ihavedeliberatelynotdesribedthekernel'salgorithms,itsmethodsofdoingthings,
intermsofroutine X()allsroutineY()whihinrementsthefooeldofthebar
datastruture. Youanreadtheodetondthesethingsout. WheneverIneedto
understandapieeofodeordesribeittosomeoneelseI oftenstartwithdrawing
its data strutures on the white-board. So, I havedesribed many of the relevant
kerneldatastruturesandtheirinterrelationshipsinafairamountofdetail.
Eah hapter isfairly independent, liketheLinux kernelsubsystemthat they eah
desribe. Sometimes,though,there arelinkages; forexampleyouannotdesribea
proesswithoutunderstandinghowvirtualmemoryworks.
TheHardwareBasishapter (Chapter1) givesabriefintrodutionto themodern
PC. An operating systemhas to work losely with the hardware system that ats
as its foundations. The operating system needs ertain servies that an only be
providedbythehardware. InordertofullyunderstandtheLinuxoperatingsystem,
youneedtounderstandthebasisoftheunderlying hardware.
The Software Basis hapter (Chapter 2) introdues basisoftware priniples and
looks at assemblyand Cprograminglanguages. Itlooksat thetoolsthat areused
to build an operating system like Linux and it gives an overview of the aims and
funtionsofanoperatingsystem.
TheMemoryManagementhapter(Chapter3)desribesthewaythatLinuxhandles
thephysialandvirtualmemoryinthesystem.
The Proesseshapter (Chapter 4) desribes what aproess is andhowthe Linux
kernelreates,managesanddeletestheproessesin thesystem.
Proessesommuniatewitheahotherandwiththekerneltooordinatetheir
ativ-ities. LinuxsupportsanumberofInter-ProessCommuniation(IPC)mehanisms.
SignalsandpipesaretwoofthembutLinuxalsosupportstheSystemVIPC
meha-nismsnamedaftertheUnix TM
system.
TheInterruptsandInterruptHandlinghapter (Chapter7)looksat howtheLinux
kernelhandles interrupts. Whilstthe kernelhasgenerimehanismsand interfaes
for handling interrupts, some of the interrupt handling details are hardware and
arhiteturespei.
OneofLinux'sstrengths isitssupport forthemanyavailablehardwaredeviesfor
the modern PC. The Devie Drivers hapter (Chapter 8) desribeshow the Linux
kernelontrolsthephysialdevies inthesystem.
TheFilesystemhapter(Chapter 9)desribeshowtheLinuxkernelmaintainsthe
lesinthelesystemsthatitsupports. ItdesribestheVirtualFileSystem(VFS)
andhowtheLinux kernel'sreallesystemsaresupported.
Networkingand Linux aretermsthat are almostsynonymous. Inaveryreal sense
LinuxisaprodutoftheInternetorWorldWideWeb(WWW).Itsdevelopersand
usersusethewebtoexhangeinformationideas,odeandLinuxitselfisoftenused
to support thenetworking needsof organizations. Chapter 10desribeshowLinux
supportsthenetworkprotoolsknownolletivelyasTCP/IP.
TheKernelMehanismshapter(Chapter11)looksatsomeofthegeneraltasksand
mehanismsthat theLinuxkernelneedstosupplyso thatother partsofthekernel
work eetivelytogether.
TheModules hapter(Chapter12)desribeshowtheLinuxkernelandynamially
loadfuntions,forexamplelesystems,onlywhentheyareneeded.
TheProessorshapter(Chapter13)givesabriefdesriptionofsomeofthe
proes-sorsthatLinuxhasbeenportedto.
TheSoures hapter (Chapter 14)desribeswherein the Linux kernelsouresyou
shouldstartlooking forpartiularkernelfuntions.
Conventions used in this Book
Thefollowingisalistofthetypographialonventionsusedinthis book.
seriffont identiesommandsorothertextthatistobetyped
literallybytheuser.
type font referstodatastruturesorelds
withindatastrutures.
ThroughoutthetexttherereferenestopieesofodewithintheLinuxkernelsoure
tree (for example the boxed margin note adjaent to this text ). These are given Seefoo()in
foo/bar.
in aseyou wish to look at the soure ode itself and all of the le referenesare
relative to /usr/sr/linux. Taking foo/bar. as an example, the full lename
would be /usr/sr/linux/foo/bar.Ifyou arerunning Linux (and you should),
ARMisatrademarkofARMHoldingsPLC.
Caldera,OpenLinux andthe\C"logoaretrademarksofCaldera,In.
CalderaOpenDOS1997Caldera,In.
DECisatrademarkofDigitalEquipmentCorporation.
DIGITALisatrademarkofDigitalEquipmentCorporation.
Linuxis atrademarkofLinusTorvalds.
MotifisatrademarkofTheOpenSystemFoundation,In.
MSDOSisatrademarkofMirosoftCorporation.
RedHat,glintandtheRedHatlogoaretrademarksofRedHat Software,In.
UNIXisaregisteredtrademarkof X/Open.
XFree86isatrademarkofXFree86Projet,In.
X WindowSystem is atrademarkof theX Consortiumand theMassahusetts
In-stituteofTehnology.
The Author
Iwasbornin1957,afewweeksbeforeSputnikwaslaunhed,inthenorthofEngland.
IrstmetUnixatUniversity,wherealetureruseditasanexamplewhenteahing
thenotionsofkernels,shedulingandotheroperatingsystemsgoodies. Ilovedusing
thenewlydeliveredPDP-11formynalyearprojet. Aftergraduating(in1982with
aFirstClass Honoursdegreein ComputerSiene) I worked forPrime Computers
(Primos) and then after aouple of years for Digital (VMS, Ultrix). At Digital I
workedonmanythingsbutforthelast5yearsthere,Iworkedforthesemiondutor
grouponAlphaandStrongARMevaluationboards. In1998ImovedtoARMwhere
I haveasmall groupof engineerswriting lowlevelrmwareand portingoperating
systems. Myhildren(EstherandStephen)desribemeasageek.
People often ask me aboutLinux at work and at home and I am only toohappy
to oblige. Themorethat I useLinuxin bothmyprofessional and personal life the
morethat IbeomeaLinuxzealot. You maynote that I usetheterm`zealot'and
not `bigot'; I dene a Linux zealot to be an enthusiast that reognizes that there
are other operating systems but prefers not to use them. As my wife, Gill, who
uses Windows95 one remarked \I neverrealized that wewould havehis and her
operating systems". For me, as an engineer, Linux suits my needsperfetly. It is
a superb, exible and adaptableengineering tool that I use at work and at home.
MostfreelyavailablesoftwareeasilybuildsonLinuxandIanoftensimplydownload
pre-builtexeutablelesorinstallthemfromaCDROM.Whatelse ouldIuseto
learnto programinC++,PerlorlearnaboutJavaforfree?
Aknowledgements
e-this book in order to teah omputing. Myansweris an emphati yes; this is one
useofthebookthatIpartiularlywanted. Whoknows,theremaybeanotherLinus
Torvaldssatinthelass.
SpeialthanksmustgotoJohnRigbyandMihaelBauerwhogavemefull,detailed
reviewnotes ofthewhole book. Not aneasytask. Alan CoxandStephen Tweedie
havepatientlyansweredmy questions - thanks. I used LarryEwing's penguins to
brightenup the hapters abit. Finally, thank you to GregHankins for aepting
Prefae iii
1 Hardware Basis 1
1.1 TheCPU . . . 2
1.2 Memory . . . 4
1.3 Buses . . . 4
1.4 ControllersandPeripherals . . . 5
1.5 AddressSpaes . . . 5
1.6 Timers. . . 6
2 Software Basis 7 2.1 ComputerLanguages. . . 7
2.1.1 Assembly Languages . . . 7
2.1.2 TheCProgrammingLanguageandCompiler . . . 8
2.1.3 Linkers . . . 9
2.2 WhatisanOperatingSystem? . . . 9
2.2.1 Memorymanagement . . . 10
2.2.2 Proesses . . . 10
2.2.3 Deviedrivers. . . 11
2.2.4 TheFilesystems . . . 11
2.3 KernelDataStrutures . . . 11
2.3.1 LinkedLists. . . 12
2.3.2 HashTables. . . 12
2.3.3 AbstratInterfaes . . . 13
3 MemoryManagement 15 3.1 AnAbstratModelofVirtualMemory . . . 16
3.1.1 DemandPaging. . . 18
3.1.2 Swapping . . . 19
3.1.3 SharedVirtualMemory . . . 19
3.1.4 PhysialandVirtualAddressingModes . . . 19
3.4.1 PageAlloation. . . 24
3.4.2 PageDealloation . . . 24
3.5 MemoryMapping. . . 25
3.6 DemandPaging. . . 26
3.7 TheLinuxPageCahe . . . 27
3.8 SwappingOutandDisardingPages . . . 28
3.8.1 ReduingtheSizeofthePageandBuerCahes . . . 29
3.8.2 SwappingOutSystemVSharedMemoryPages. . . 30
3.8.3 SwappingOutandDisardingPages . . . 30
3.9 TheSwapCahe . . . 31
3.10 SwappingPagesIn . . . 32
4 Proesses 35 4.1 LinuxProesses. . . 36
4.2 Identiers . . . 38
4.3 Sheduling. . . 39
4.3.1 ShedulinginMultiproessorSystems . . . 41
4.4 Files . . . 42
4.5 VirtualMemory . . . 43
4.6 CreatingaProess . . . 45
4.7 TimesandTimers . . . 46
4.8 ExeutingPrograms . . . 47
4.8.1 ELF . . . 48
4.8.2 SriptFiles . . . 50
5 Interproess CommuniationMehanisms 51 5.1 Signals. . . 51
5.2 Pipes. . . 53
5.3 Sokets . . . 55
5.3.1 SystemVIPCMehanisms . . . 55
5.3.2 MessageQueues . . . 55
5.3.3 Semaphores . . . 56
5.3.4 SharedMemory. . . 58
6 PCI 61 6.1 PCIAddressSpaes . . . 61
6.5 PCI-PCIBridges . . . 65
6.5.1 PCI-PCIBridges: PCII/OandPCIMemoryWindows . . . . 65
6.5.2 PCI-PCI Bridges: PCI Conguration Cyles and PCI Bus Numbering . . . 65
6.6 LinuxPCI Initialization . . . 66
6.6.1 TheLinuxKernelPCIDataStrutures . . . 67
6.6.2 ThePCI DevieDriver . . . 68
6.6.3 PCIBIOSFuntions . . . 70
6.6.4 PCIFixup . . . 72
7 Interrupts and Interrupt Handling 75 7.1 ProgrammableInterruptControllers . . . 77
7.2 InitializingtheInterruptHandling DataStrutures . . . 77
7.3 InterruptHandling . . . 78
8 Devie Drivers 81 8.1 PollingandInterrupts . . . 82
8.2 DiretMemoryAess(DMA) . . . 84
8.3 Memory . . . 85
8.4 InterfaingDevieDriverswiththeKernel . . . 85
8.4.1 CharaterDevies . . . 86
8.4.2 BlokDevies . . . 87
8.5 HardDisks . . . 88
8.5.1 IDEDisks . . . 90
8.5.2 InitializingtheIDESubsystem . . . 91
8.5.3 SCSIDisks . . . 91
8.6 Network Devies . . . 95
8.6.1 InitializingNetwork Devies. . . 96
9 The Filesystem 99 9.1 TheSeondExtendedFilesystem(EXT2) . . . 101
9.1.1 TheEXT2Inode . . . 102
9.1.2 TheEXT2Superblok . . . 103
9.1.3 TheEXT2GroupDesriptor . . . 104
9.1.4 EXT2Diretories. . . 104
9.1.5 FindingaFilein anEXT2FileSystem . . . 105
9.1.6 ChangingtheSizeofaFilein anEXT2FileSystem . . . 106
9.2 TheVirtualFileSystem(VFS) . . . 107
9.2.6 CreatingaFilein theVirtualFileSystem . . . 112
9.2.7 UnmountingaFileSystem . . . 112
9.2.8 TheVFS InodeCahe . . . 113
9.2.9 TheDiretoryCahe . . . 113
9.3 TheBuer Cahe . . . 114
9.3.1 ThebdflushKernelDaemon . . . 116
9.3.2 TheupdateProess. . . 116
9.4 The/proFileSystem . . . 116
9.5 DevieSpeialFiles . . . 117
10 Networks 119 10.1 AnOverviewofTCP/IPNetworking . . . 119
10.2 TheLinuxTCP/IPNetworkingLayers. . . 122
10.3 TheBSD SoketInterfae . . . 124
10.4 TheINETSoketLayer . . . 125
10.4.1 CreatingaBSDSoket . . . 127
10.4.2 BindinganAddresstoanINETBSD Soket . . . 127
10.4.3 MakingaConnetiononanINETBSDSoket . . . 128
10.4.4 ListeningonanINETBSDSoket . . . 129
10.4.5 AeptingConnetionRequests . . . 130
10.5 TheIPLayer . . . 130
10.5.1 SoketBuers . . . 130
10.5.2 ReeivingIPPakets . . . 131
10.5.3 SendingIPPakets. . . 132
10.5.4 DataFragmentation . . . 133
10.6 TheAddress ResolutionProtool(ARP) . . . 133
10.7 IPRouting . . . 135
10.7.1 TheRouteCahe . . . 135
10.7.2 TheForwardingInformationDatabase . . . 136
11 KernelMehanisms 139 11.1 BottomHalf Handling . . . 139
11.2 Task Queues . . . 140
11.3 Timers. . . 141
11.4 WaitQueues . . . 142
12.1 LoadingaModule . . . 146
12.2 UnloadingaModule . . . 148
13Proessors 151 13.1 X86 . . . 151
13.2 ARM. . . 151
13.3 AlphaAXPProessor . . . 152
14The LinuxKernelSoures 153 A LinuxData Strutures 159 B UsefulWeb and FTP Sites 177 C LinuxDoumentationProjet Manifesto 179 C.1 Overview . . . 179
C.2 GettingInvolved . . . 180
C.3 CurrentProjets . . . 180
C.4 FTPsitesforLDPworks . . . 180
C.5 DoumentationConventions . . . 180
C.6 CopyrightandLiense . . . 181
C.7 PublishingLDPManuals . . . 181
D The GNU GeneralPubli Liense 183 D.1 Preamble . . . 183
D.2 TermsandConditions . . . 184
D.3 HowtoApplyTheseTerms . . . 188
Glossary 191
1.1 AtypialPCmotherboard. . . 2
3.1 AbstratmodelofVirtualtoPhysialaddressmapping . . . 16
3.2 AlphaAXPPageTableEntry. . . 20
3.3 ThreeLevelPageTables . . . 23
3.4 Thefree areadatastruture. . . 25
3.5 AreasofVirtualMemory . . . 26
3.6 TheLinuxPageCahe . . . 27
4.1 AProess'sFiles . . . 42
4.2 AProess'sVirtualMemory. . . 44
4.3 RegisteredBinaryFormats . . . 47
4.4 ELFExeutableFileFormat . . . 48
5.1 Pipes. . . 54
5.2 SystemVIPCMessageQueues . . . 56
5.3 SystemVIPCSemaphores . . . 57
5.4 SystemVIPCSharedMemory . . . 59
6.1 ExamplePCIBasedSystem . . . 62
6.2 ThePCI CongurationHeader . . . 63
6.3 Type0PCICongurationCyle . . . 65
6.4 Type1PCICongurationCyle . . . 65
6.5 LinuxKernel PCIDataStrutures . . . 67
6.6 ConguringaPCISystem: Part1 . . . 69
6.7 ConguringaPCISystem: Part2 . . . 70
6.8 ConguringaPCISystem: Part3 . . . 71
6.9 ConguringaPCISystem: Part4 . . . 71
6.10 PCICongurationHeader: BaseAddressRegisters . . . 72
7.1 ALogialDiagramofInterruptRouting . . . 76
7.2 LinuxInterruptHandling DataStrutures . . . 79
9.1 PhysialLayoutoftheEXT2Filesystem . . . 101
9.2 EXT2Inode. . . 102
9.3 EXT2Diretory . . . 105
9.4 ALogialDiagramoftheVirtualFileSystem . . . 107
9.5 RegisteredFileSystems . . . 110
9.6 AMountedFileSystem . . . 112
9.7 TheBuer Cahe . . . 114
10.1 TCP/IPProtoolLayers. . . 121
10.2 LinuxNetworkingLayers . . . 123
10.3 LinuxBSD SoketDataStrutures . . . 126
10.4 TheSoketBuer(skbu) . . . 131
10.5 TheForwardingInformationDatabase . . . 136
11.1 BottomHalf HandlingDataStrutures . . . 139
11.2 ATask Queue. . . 140
11.3 SystemTimers . . . 142
11.4 WaitQueue . . . 143
Hardware Basis
Anoperating systemhas to work loselywith the hardware systemthat
ats as itsfoundations. The operatingsystemneedsertain serviesthat
an only be provided by the hardware. In order to fully understand
the Linux operating system, you need to understand the basis of the
underlying hardware. This hapter gives a brief introdution to that
hardware: the modernPC.
When the \Popular Eletronis" magazine for January 1975 was printed with an
illustration of the Altair 8080 on its front over, arevolution started. The Altair
8080,namedafterthedestinationofanearlyStarTrekepisode,ouldbeassembled
byhomeeletronis enthusiastsforamere$397. With itsIntel8080proessorand
256bytes ofmemory but nosreen orkeyboard it waspuny bytoday's standards.
Itsinventor,EdRoberts, oinedtheterm\personalomputer"to desribehis new
invention, but the term PCis nowused to referto almost anyomputer that you
anpikupwithoutneedinghelp. Bythisdenition,evensomeoftheverypowerful
AlphaAXPsystemsarePCs.
Enthusiasti hakers saw the Altair's potential and started to write software and
build hardwarefor it. To these earlypioneers it represented freedom; thefreedom
from hugebath proessingmainframe systemsrun andguardedbyanelite
priest-hood. Overnight fortunes were made by ollege dropouts fasinated by this new
phenomenon,aomputerthat yououldhaveathomeonyourkithen table. Alot
ofhardwareappeared,alldierentto somedegreeand softwarehakerswerehappy
towritesoftwareforthesenewmahines. ParadoxiallyitwasIBMwhormlyast
themouldofthemodernPCbyannouningtheIBMPCin1981andshippingitto
ustomersearlyin1982. With itsIntel8088proessor,64Kofmemory(expandable
to256K),twooppydisksandan80haraterby25linesColourGraphisAdapter
(CGA) it was not verypowerfulby today's standardsbut it sold well. It was
fol-lowed,in 1983,bytheIBMPC-XTwhih hadtheluxury ofa10Mbyte harddrive.
ItwasnotlongbeforeIBMPCloneswerebeingproduedbyahostofompanies
CPU
parallel port
COM1
COM2
power
power
ISA Slots
Memory SIMM Slots
PCI Slots
Figure1.1: A typialPC motherboard.
de-fatostandardhelpedamultitudeofhardwareompaniestoompetetogetherin
agrowingmarketwhih,happilyforonsumers,keptprieslow. Manyofthesystem
arhiteturalfeaturesoftheseearlyPCshavearriedoverintothemodernPC.For
example,eventhemostpowerfulIntelPentiumProbasedsystemstartsrunningin
the Intel 8086's addressingmode. WhenLinus Torvalds started writingwhat was
to beomeLinux, hepiked themostplentiful and reasonablypriedhardware, an
Intel80386PC.
Looking ataPCfromtheoutside,themostobviousomponentsareasystembox,
akeyboard,amouseandavideomonitor. Onthefrontofthesystemboxaresome
buttons, a little display showing some numbersand a oppy drive. Most systems
thesedayshaveaCDROMandifyoufeelthatyouhavetoprotetyourdata,then
there willalsobeatapedriveforbakups. These deviesareolletivelyknownas
theperipherals.
Although theCPU is in overall ontrol of thesystem, itis notthe only intelligent
devie. All oftheperipheralontrollers, forexamplethe IDEontroller,havesome
levelofintelligene. InsidethePC(Figure1.1)youwillseeamotherboardontaining
theCPU ormiroproessor,thememoryand anumberof slotsfor theISAorPCI
peripheralontrollers. Some oftheontrollers,forexampletheIDE diskontroller
maybebuiltdiretlyontothesystemboard.
1.1 The CPU
TheCPU,orrathermiroproessor,istheheartofanyomputersystem. The
large)units. ThisiswhenthetermCentralProessingUnitwasoined. Themodern
miroproessorombines these omponents onto an integrated iruit ethed onto
averysmall piee of silion. ThetermsCPU, miroproessor andproessor are all
usedinterhangeablyinthisbook.
Miroproessorsoperate onbinary data; that is data omposed of ones and zeros.
These ones and zerosorrespond to eletrial swithes being either onoro. Just
as42isadeimalnumbermeaning\410sand2units",abinarynumberisaseries
ofbinarydigitseahonerepresentingapowerof2. Inthisontext,apowermeans
thenumberoftimesthatanumberismultipliedbyitself. 10to thepower1(10 1
is10x10x10and soon. Binary0001is
deimal1,binary0010isdeimal2,binary0011is3,binary0100is4andsoon. So,
42deimalis101010binaryor(2+8+32or2
to represent numbers in omputer programs, another base, hexadeimal is usually
used. Inthis base,eah digitalrepresentsapowerof16. As deimalnumbersonly
gofrom 0to9thenumbers10to 15arerepresentedasasingledigitbythe letters
A,B,C,D,EandF.Forexample,hexadeimalEisdeimal14andhexadeimal2A
isdeimal42(two16s)+10). UsingtheCprogramminglanguagenotation(asI do
throughoutthis book)hexadeimal numbersareprefaedby\0x";hexadeimal 2A
iswrittenas0x2A.
Miroproessorsanperformarithmetioperationssuhasadd,multiplyanddivide
andlogialoperationssuhas\isXgreaterthanY?".
Theproessor'sexeution isgovernedby anexternallok. This lok,the system
lok, generatesregular lok pulses to theproessor and, at eahlok pulse, the
proessor does somework. For example, a proessor ould exeute an instrution
everylokpulse. Aproessor'sspeedisdesribedintermsoftherateofthesystem
loktiks. A100Mhzproessorwillreeive100,000,000loktikseveryseond. It
ismisleadingtodesribethepowerofaCPUbyitslokrateasdierentproessors
performdierentamountsofworkperloktik. However,allthingsbeingequal,a
fasterlokspeedmeansamorepowerfulproessor.Theinstrutionsexeutedbythe
proessorareverysimple; forexample\readtheontentsofmemoryat loationX
intoregisterY".Registersarethemiroproessor'sinternalstorage,usedforstoring
data and performing operations on it. The operations performed may ause the
proessortostopwhatitisdoingandjumptoanotherinstrutionsomewhereelsein
memory. Thesetinybuildingbloksgivethemodernmiroproessoralmostlimitless
powerasitanexeutemillionsorevenbillionsofinstrutionsaseond.
Theinstrutionshavetobefethedfrommemoryastheyareexeuted. Instrutions
maythemselvesreferenedata withinmemory andthat datamustbefethed from
memoryandsavedtherewhenappropriate.
Thesize,numberandtypeofregisterwithin amiroproessorisentirelydependent
on itstype. An Intel 4086 proessorhas adierent registerset to anAlpha AXP
proessor;forastart, theIntel'sare 32bits wideandtheAlpha AXP'sare 64bits
wide. Ingeneral,though,anygivenproessorwillhaveanumberofgeneralpurpose
registers and a smaller number of dediated registers. Most proessors have the
followingspeialpurpose,dediated, registers:
ProgramCounter(PC) Thisregisterontainstheaddressofthenextinstrution
Stak Pointer (SP) Proessorshaveto have aess to largeamounts of external
read/writerandomaessmemory(RAM)whihfailitatestemporarystorage
ofdata. Thestakis awayof easilysavingandrestoringtemporaryvaluesin
externalmemory. Usually,proessorshavespeialinstrutionswhihallowyou
topushvaluesontothestakandtopopthemoagainlater. Thestakworks
on alast in rstout(LIFO) basis. Inother words, ifyou pushtwovalues,x
andy,ontoastakandthenpopavalueoofthestakthenyouwillgetbak
thevalueofy.
Someproessor'sstaksgrowupwardstowardsthetopofmemorywhilstothers
grow downwardstowardsthe bottom, or base,of memory. Some proessor's
supportbothtypes,forexampleARM.
ProessorStatus (PS) Instrutionsmayyieldresults;forexample\istheontent
ofregisterXgreaterthantheontentofregisterY?"willyieldtrueorfalseas
a result. ThePS registerholdsthis and other informationabouttheurrent
stateof theproessor. Forexample,mostproessorshaveatleast twomodes
of operation, kernel (or supervisor) and user. The PS register would hold
informationidentifying theurrentmode.
1.2 Memory
All systemshaveamemoryhierarhywith memoryatdierent speedsandsizes at
dierentpointsinthehierarhy. Thefastestmemoryisknownasahememoryand
iswhatitsounds like-memorythatisusedto temporarilyhold, orahe,ontents
ofthemainmemory. Thissortofmemoryisveryfastbutexpensive,thereforemost
proessorshaveasmallamountofon-hipahememoryandmoresystembased
(on-board)ahememory. Someproessorshaveoneahetoontainbothinstrutions
and data, but others have two, one for instrutions and the other for data. The
Alpha AXPproessorhastwointernal memoryahes;onefordata (the D-Cahe)
and onefor instrutions(the I-Cahe). Theexternalahe(or B-Cahe)mixesthe
twotogether. Finallythereisthemainmemorywhihrelativetotheexternalahe
memory is very slow. Relative to the on-CPU ahe, main memory is positively
rawling.
Theahe andmain memoriesmust be kept in step(oherent). In other words, if
aword ofmain memoryis held in oneormoreloationsin ahe, thenthe system
must makesure that the ontents of ahe and memory arethe same. The job of
ahe oherenyis done partially by the hardware and partially by the operating
system. This is also truefor anumber of major systemtasks where the hardware
andsoftwaremustooperateloselytoahievetheiraims.
1.3 Buses
Theindividualomponentsofthesystemboardareinteronnetedbymultiple
on-netion systemsknownasbuses. The systembus is dividedintothree logial
fun-tions;theaddressbus,thedata busand theontrol bus. Theaddressbusspeies
thememoryloations(addresses)forthedatatransfers. Thedatabusholdsthedata
andontrol signalsthroughoutthesystem. Manyavoursofbusexist,forexample
ISAandPCIbusesarepopularwaysofonnetingperipheralsto thesystem.
1.4 Controllers and Peripherals
Peripheralsarereal devies,suhasgraphis ardsordisksontrolledbyontroller
hipsonthesystemboardoronardspluggedintoit. TheIDEdisksareontrolled
bytheIDEontrollerhipandtheSCSIdisksbytheSCSIdiskontrollerhipsand
soon. These ontrollersare onneted to theCPU and to eahother by avariety
of buses. Mostsystems built now usePCI and ISA buses to onnet together the
main system omponents. The ontrollersare proessorslikethe CPU itself, they
anbeviewedasintelligenthelperstotheCPU.TheCPUisinoverallontrolofthe
system.
All ontrollers are dierent, but they usually have registers whih ontrol them.
Software running on the CPU must be able to read and write those ontrolling
registers. Oneregistermightontainstatusdesribing anerror. Anothermightbe
used forontrol purposes; hangingthemodeof theontroller. Eah ontroller on
abusan beindividuallyaddressedbytheCPU,thisis sothatthesoftwaredevie
driveranwritetoitsregistersandthusontrolit. TheIDEribbonisagoodexample,
asitgivesyoutheabilitytoaess eahdriveonthebusseparately. Anothergood
exampleisthePCIbuswhihallowseahdevie(forexampleagraphisard)tobe
aessedindependently.
1.5 Address Spaes
ThesystembusonnetstheCPUwiththemain memoryandisseparatefrom the
busesonnetingtheCPUwiththesystem'shardwareperipherals. Colletivelythe
memory spae that the hardwareperipheralsexist in is known asI/O spae. I/O
spaemayitself be furthersubdivided,but wewillnotworrytoomuh aboutthat
forthe moment. TheCPU an aessboththesystemspae memoryand theI/O
spae memory, whereasthe ontrollers themselves anonly aess system memory
indiretly and then only withthe help of theCPU. From thepointof view of the
devie, say the oppy disk ontroller, it will see only the address spae that its
ontrol registers are in (ISA),and notthe system memory. Typially a CPU will
haveseparate instrutionsfor aessing the memoryand I/Ospae. For example,
theremightbeaninstrutionthatmeans\readabyte fromI/Oaddress0x3f0 into
registerX".ThisisexatlyhowtheCPUontrolsthesystem'shardwareperipherals,
by reading and writing to their registers in I/O spae. Where in I/O spae the
ommon peripherals(IDE ontroller, serial port, oppy disk ontroller and so on)
havetheirregistershasbeensetbyonventionovertheyearsasthePCarhiteture
hasdeveloped. TheI/Ospaeaddress0x3f0 just happensto be theaddressof one
oftheserialport's(COM1)ontrolregisters.
Therearetimeswhenontrollersneedtoreadorwritelargeamountsofdatadiretly
to or from system memory. For example when user data is being written to the
harddisk. Inthis ase,DiretMemoryAess(DMA)ontrollersareusedto allow
1.6 Timers
AlloperatingsystemsneedtoknowthetimeandsothemodernPCinludesaspeial
peripheralalled theReal Time Clok(RTC). This provides twothings: areliable
timeof dayand anauratetiming interval. TheRTChasitsownbatterysothat
itontinuestorunevenwhenthePCisnotpoweredon,thisishowyourPCalways
\knows"theorretdate andtime. Theintervaltimerallowstheoperatingsystem
Software Basis
Aprogramisasetofomputerinstrutionsthatperformapartiulartask.
That program an be written in assembler, a very low level omputer
language, or in a high level, mahine independent language suh as the
C programming language. An operating system is a speial program
whihallows the user to run appliations suhas spreadsheets and word
proessors. This hapter introdues basi programming priniples and
givesan overview ofthe aimsand funtions of anoperating system.
2.1 Computer Languages
2.1.1 Assembly Languages
The instrutionsthat aCPU fethes from memoryand exeutesare notat all
un-derstandable to human beings. They are mahine odes whih tell the omputer
preiselywhattodo. Thehexadeimalnumber0x89E5 isanIntel80486instrution
whihopies theontentsof theESPregistertotheEBP register. Oneoftherst
softwaretoolsinventedfortheearliestomputerswasanassembler,aprogramwhih
takesa humanreadable soure le and assembles it into mahine ode. Assembly
languagesexpliitlyhandle registersandoperationsondataandtheyarespeito
apartiularmiroproessor. TheassemblylanguageforanIntelX86miroproessor
is very dierentto theassembly languagefor an Alpha AXPmiroproessor. The
followingAlphaAXPassemblyodeshowsthesortofoperationsthataprograman
perform:
ldr r16, (r15) ; Line 1
ldr r17, 4(r15) ; Line 2
beq r16,r17,100 ; Line 3
str r17, (r15) ; Line 4
The rst statement (on line 1) loads register 16 from the addressheld in register
15. Thenextinstrutionloadsregister17fromthenextloationinmemory. Line3
omparestheontentsofregister16withthat of register17 and,iftheyare equal,
branhestolabel100. Iftheregistersdonotontainthesamevaluethentheprogram
ontinuestoline4where theontentsofr17aresavedintomemory. Iftheregisters
doontainthesamevaluethennodataneedstobesaved. Assemblylevelprograms
aretediousandtrikytowriteandpronetoerrors. VerylittleoftheLinuxkernelis
writteninassemblylanguageandthosepartsthatarearewrittenonlyforeÆieny
andtheyarespei topartiularmiroproessors.
2.1.2 The C Programming Language and Compiler
Writinglargeprogramsin assemblylanguageisadiÆultandtimeonsumingtask.
It is prone to error and the resulting program is not portable, being tied to one
partiularproessorfamily. It isfar better to useamahine independentlanguage
likeC[7,TheCProgrammingLanguage℄. Callowsyoutodesribeprogramsinterms
oftheirlogialalgorithmsandthedatathattheyoperateon. Speialprogramsalled
ompilersread the Cprogram and translateit into assembly language, generating
mahine speiodefrom it. A good ompilerangenerateassembly instrutions
that are verynearlyaseÆientas those written bya good assembly programmer.
MostoftheLinuxkerneliswritten intheClanguage. ThefollowingCfragment:
if (x != y)
x = y ;
performsexatlythesameoperationsasthepreviousexampleassemblyode. Ifthe
ontents of the variable x are notthe same asthe ontents of variable y then the
ontents of ywill beopiedto x. Codeis organizedinto routines, eah of whih
performatask. RoutinesmayreturnanyvalueordatatypesupportedbyC.Large
programsliketheLinuxkernelomprisemanyseparateCsouremoduleseahwith
itsown routinesand datastrutures. These Csoureode modules grouptogether
logialfuntions suh aslesystemhandlingode.
Csupportsmanytypesofvariables,avariableisaloationinmemorywhihanbe
referenedbyasymboliname. IntheaboveCfragmentxandyreferto loations
in memory. Theprogrammerdoesnotarewherein memorythevariables areput,
it is the linker (see below) that has to worry about that. Some variables ontain
dierentsortsofdata,integerandoatingpointandothersarepointers.
Pointers are variables that ontain the address, the loation in memory of other
data. Consideravariablealled x, itmightlivein memoryat address0x80010000.
You ould have a pointer, alled px, whih points at x. px might live at address
0x80010030. Thevalueofpxwouldbe0x80010000: theaddressofthevariablex.
Callowsyoutobundletogetherrelatedvariablesintodatastrutures. Forexample,
strut {
int i ;
isadatastruturealledmystrutwhihontainstwoelements,aninteger(32bits
ofdatastorage)allediandaharater(8 bitsofdata)alledb.
2.1.3 Linkers
Linkersareprogramsthatlinktogetherseveralobjetmodulesandlibrariestoform
asingle,oherent,program. Objetmodules arethemahineodeoutput from an
assemblerorompilerandontainexeutablemahineodeanddatatogetherwith
information that allowsthelinkerto ombine the modules together toform a
pro-gram. Forexampleonemodule mightontainallofaprogram'sdatabasefuntions
andanother module itsommandline argumenthandlingfuntions. Linkersx up
referenes between these objet modules, where a routine or data struture
refer-enedinonemoduleatuallyexistsinanothermodule. TheLinuxkernelisasingle,
largeprogramlinkedtogetherfrom itsmanyonstituentobjetmodules.
2.2 What is an Operating System?
Without softwareaomputerisjust apileof eletronis thatgivesoheat. If the
hardwareistheheartofaomputerthenthesoftwareisitssoul. Anoperatingsystem
isaolletionofsystemprogramswhih allowtheuserto runappliationsoftware.
The operating system abstrats the real hardware of the system and presents the
system's users and its appliations with a virtual mahine. In a very real sense
thesoftware provides theharaterof thesystem. Most PCsan run oneormore
operating systems and eah one an have a very dierent look and feel. Linux is
made up of anumber of funtionally separate piees that, together, omprise the
operatingsystem. OneobviouspartofLinuxisthekernelitself;buteventhatwould
beuselesswithoutlibrariesorshells.
Inordertostartunderstandingwhatanoperatingsystemis,onsiderwhathappens
whenyoutypeanapparentlysimpleommand:
$ ls
Mail images perl
dos tl
$
The$is aprompt put outbyaloginshell (in thisase bash). This meansthat it
iswaitingforyou,theuser,to typesomeommand. Typingls ausesthekeyboard
driver to reognize that haraters have been typed. The keyboard driver passes
themtotheshellwhihproessesthatommandbylookingforanexeutableimage
ofthesamename. Itndsthatimage,in/bin/ls. Kernelserviesarealledtopull
the ls exeutable image into virtual memory and start exeuting it. The ls image
makesalls to the le subsystem of the kernel to nd out what les are available.
The lesystem might make use of ahed lesystem information or use the disk
deviedrivertoreadthis informationfromthedisk. It mightevenauseanetwork
NetworkedFileSystemorNFS).Whiheverwaytheinformationisloated,lswrites
thatinformationoutandthevideodriverdisplaysitonthesreen.
Allof theaboveseemsratherompliatedbut itshowsthatevenmostsimple
om-mandsrevealthatanoperatingsystemisinfatao-operatingsetoffuntionsthat
togethergiveyou,theuser,aoherentviewofthesystem.
2.2.1 Memory management
With inniteresoures,forexamplememory,manyofthethings thatanoperating
system has to do would be redundant. One of the basi triks of any operating
systemistheabilitytomakeasmallamountofphysialmemorybehavelikerather
morememory. Thisapparentlylargememoryisknownasvirtualmemory. Theidea
isthat thesoftwarerunningin thesystemisfooledinto believing thatit isrunning
in alot ofmemory. Thesystemdivides thememoryinto easily handled pagesand
swapsthesepagesontoaharddiskasthesystemruns. Thesoftwaredoesnotnotie
beauseofanothertrik,multi-proessing.
2.2.2 Proesses
A proess ould be thought of as a program in ation, eah proess is a separate
entity that is running a partiular program. If you look at the proesses on your
Linuxsystem,youwillseethatthereareratheralot. Forexample,typingpsshows
thefollowingproessesonmysystem:
$ ps
PID TTY STAT TIME COMMAND
158 pRe 1 0:00 -bash
174 pRe 1 0:00 sh /usr/X11R6/bin/startx
175 pRe 1 0:00 xinit /usr/X11R6/lib/X11/xinit/xinitr
--178 pRe 1 N 0:00 bowman
182 pRe 1 N 0:01 rxvt -geometry 120x35 -fg white -bg blak
184 pRe 1 < 0:00 xlok -bg grey -geometry -1500-1500 -padding 0
185 pRe 1 < 0:00 xload -bg grey -geometry -0-0 -label xload
187 pp6 1 9:26 /bin/bash
202 pRe 1 N 0:00 rxvt -geometry 120x35 -fg white -bg blak
203 pp 2 0:00 /bin/bash
1796 pRe 1 N 0:00 rxvt -geometry 120x35 -fg white -bg blak
1797 v06 1 0:00 /bin/bash
3056 pp6 3 < 0:02 emas intro/introdution.tex
3270 pp6 3 0:00 ps
$
If my systemhad many CPUs then eah proessould (theoretially at least) run
onadierentCPU. Unfortunately,there isonly oneso againtheoperatingsystem
resortstotrikerybyrunningeahproessinturnforashortperiod. Thisperiodof
timeisknownasatime-slie. Thistrikisknownasmulti-proessingorsheduling
and it fools eah proess into thinking that it is the only proess. Proesses are
protetedfromoneanothersothatifoneproessrashesormalfuntionsthenitwill
Devie driversmakeupthemajorpartoftheLinuxkernel. Likeother partsof the
operating system, they operate in a highly privileged environment and an ause
disasteriftheygetthingswrong. Deviedriversontroltheinterationbetweenthe
operating systemand thehardwaredevie that they are ontrolling. Forexample,
thelesystemmakesuseof ageneralblok devie interfaewhenwritingbloksto
an IDEdisk. Thedrivertakesare ofthe details and makes devie spei things
happen. Deviedriversarespeitotheontrollerhipthattheyaredrivingwhih
iswhy,forexample,youneedtheNCR810SCSIdriverifyoursystemhasanNCR810
SCSIontroller.
2.2.4 The Filesystems
In Linux, as it is for Unix TM
, the separate lesystems that the system may use
arenotaessedbydevie identiers(suh asadrivenumberoradrivename)but
insteadtheyareombinedintoasinglehierarhialtreestruturethatrepresentsthe
lesystemasasingleentity. Linuxaddseahnewlesystemintothissinglelesystem
treeas theyaremountedontoamountdiretory,forexample/mnt/drom. Oneof
the mostimportant features of Linux is its support for many dierent lesystems.
Thismakesitveryexibleandwellabletooexistwithotheroperatingsystems. The
most popular lesystemfor Linux is the EXT2lesystem and this is the lesystem
supportedbymostoftheLinuxdistributions.
A lesystemgivesthe userasensible viewofles and diretoriesheld onthehard
disks of the system regardless of the lesystem type or the harateristis of the
underlyingphysial devie. Linuxtransparentlysupportsmanydierentlesystems
(forexampleMS-DOSandEXT2)andpresentsallofthemountedlesandlesystems
asoneintegratedvirtuallesystem. So,in general,usersandproessesdonotneed
toknowwhat sortoflesystemthat anyleispartof,theyjust usethem.
Theblokdeviedrivershidethedierenesbetweenthephysialblokdevietypes
(forexample,IDEandSCSI)and,sofaraseahlesystemisonerned,thephysial
deviesarejustlinearolletionsofbloksofdata. Thebloksizesmayvarybetween
devies, for example512bytes is ommonfor oppy devieswhereas 1024bytes is
ommonforIDEdeviesand,again,thisishiddenfromtheusersofthesystem. An
EXT2lesystemlooksthesamenomatterwhat devieholdsit.
2.3 Kernel Data Strutures
Theoperatingsystemmustkeepalot ofinformationabouttheurrentstateofthe
system. Asthingshappenwithinthesystemthesedatastruturesmustbehanged
to reet the urrent reality. For example, a new proess might be reatedwhen
auser logs onto the system. The kernelmust reate adata struture representing
the new proess and link it with the data strutures representingall of the other
proessesinthesystem.
Mostly these data strutures exist in physial memory and are aessible only by
thekernelanditssubsystems. Datastruturesontaindataandpointers;addresses
has apurpose and although some are used by several kernelsubsystems, theyare
moresimplethantheyappearatrstsight.
UnderstandingtheLinuxkernelhingesonunderstandingitsdatastruturesandthe
use that the various funtions within the Linux kernelmakes of them. This book
bases itsdesriptionoftheLinuxkernelonitsdatastrutures. It talks abouteah
kernelsubsystemintermsofitsalgorithms,itsmethodsofgettingthingsdone,and
theirusageofthekernel'sdatastrutures.
2.3.1 Linked Lists
Linux uses a number of software engineering tehniques to link together its data
strutures. On alot of oasionsit uses linked orhained data strutures. If eah
datastruture desribesasingleinstaneorouraneof something,forexamplea
proessoranetworkdevie,thekernelmustbeabletondalloftheinstanes. Ina
linkedlistarootpointerontainstheaddressoftherstdatastruture,orelement,
inthelistandeahdatastrutureontainsapointertothenextelementinthelist.
Thelastelement'snextpointerwouldbe0orNULLtoshowthatitistheendofthe
list. Inadoubly linkedlisteahelementontainsbothapointertothenextelement
inthelistbutalsoapointertothepreviouselementinthelist. Usingdoublylinked
listsmakesiteasiertoaddorremoveelementsfromthemiddleoflistalthoughyou
doneedmorememoryaesses. Thisisatypialoperatingsystemtradeo: memory
aessesversusCPUyles.
2.3.2 Hash Tables
Linked listsarehandywaysoftying datastruturestogetherbut navigatinglinked
lists an be ineÆient. If you were searhing for a partiular element, you might
easily haveto look at thewhole list before you nd theonethat you need. Linux
uses another tehnique, hashing to get around this restrition. A hash table is an
arrayorvetorofpointers. Anarray,orvetor,issimplyasetofthingsomingone
afteranotherinmemory. Abookshelfouldbesaidtobeanarrayofbooks. Arrays
areaessedbyanindex,theindexisanosetintothearray. Takingthebookshelf
analogya littlefurther, youould desribe eah book byits position on the shelf;
youmightaskforthe5thbook.
A hash table is an array of pointers to data strutures and its index is derived
from information in those data strutures. If you had data strutures desribing
thepopulationof avillage then you oulduse aperson's ageasanindex. Tond
apartiular person's data youould usetheir ageas anindex into thepopulation
hashtableandthenfollowthepointertothedatastrutureontainingtheperson's
details. Unfortunately many people in the village are likely to havethe same age
andsothehashtable pointerbeomesapointertoahainorlistofdatastrutures
eahdesribing people of thesameage. However,searhing these shorterhainsis
stillfasterthansearhingallofthedatastrutures.
As a hash table speeds up aess to ommonly used data strutures, Linux often
useshash tablesto implement ahes. Cahesarehandy informationthat needsto
beaessedquiklyand areusuallyasubsetofthefullsetof informationavailable.
in theahe(thisisknownasaahe hit, thenallwelland good. Ifitannotthen
alloftherelevantdatastruturesmustbesearhedand,ifthedatastrutureexists
atall,itmustbeaddedintotheahe. Inaddingnewdatastruturesintotheahe
an old ahe entry may need disarding. Linux must deide whih oneto disard,
thedangerbeingthatthedisardeddatastruturemaybethenextonethat Linux
needs.
2.3.3 Abstrat Interfaes
TheLinuxkerneloftenabstratsitsinterfaes.Aninterfaeisaolletionofroutines
and data strutures whih operate in a partiular way. For example all network
devie drivershaveto provide ertain routinesin whih partiular data strutures
are operated on. This way there an be generi layers of ode using the servies
(interfaes) of lower layersof spei ode. The network layeris generi and it is
supportedbydeviespeiodethatonformstoastandardinterfae.
Oftentheselowerlayersregisterthemselveswiththeupperlayeratboottime. This
registration usually involvesadding a data struture to a linked list. For example
eah lesystem built into the kernel registers itself with the kernel at boot time
or, ifyou areusing modules, when the lesystem is rst used. You ansee whih
lesystems have registered themselves by looking at the le /pro/filesystems.
Theregistration data strutureoften inludes pointers to funtions. These are the
addressesofsoftwarefuntionsthatperformpartiulartasks. Again,usinglesystem
registrationasanexample,thedatastruturethateahlesystempassestotheLinux
kernelasitregistersinludes theaddressofalesystemspe routinewhihmust
Memory Management
Thememorymanagementsubsystemisoneofthemostimportantpartsof
the operatingsystem. Sine the early days ofomputing, there hasbeen
a need for more memory than exists physially in a system. Strategies
have beendeveloped to overome thislimitationand the mostsuessful
of theseisvirtual memory. Virtualmemorymakesthe systemappear to
havemore memorythanitatually hasby sharingitbetween ompeting
proessesas they need it.
Virtualmemorydoesmorethanjustmakeyouromputer'smemorygofurther. The
memorymanagementsubsystemprovides:
Large AddressSpaes Theoperatingsystemmakesthesystemappearasifithas
alargeramountofmemory thanit atuallyhas. Thevirtualmemoryanbe
manytimeslargerthanthephysialmemoryin thesystem,
Protetion Eah proess in thesystem has its own virtual addressspae. These
virtualaddressspaesareompletelyseparatefromeahotherandsoaproess
running one appliation annot aet another. Also, the hardware virtual
memory mehanisms allow areasof memory to be protetedagainst writing.
Thisprotetsodeanddatafrom beingoverwrittenbyrogueappliations.
MemoryMapping Memory mappingis used to map image and data les into a
proessesaddressspae. Inmemorymapping,theontentsofalearelinked
diretlyintothevirtualaddressspaeofaproess.
Fair Physial MemoryAlloation Thememory managementsubsystem allows
eahrunningproessin thesystemafairshareofthephysial memoryofthe
system,
Shared Virtual Memory Althoughvirtualmemoryallowsproessestohave
sep-arate(virtual)addressspaes,therearetimeswhenyouneedproessestoshare
VPFN 0
Figure3.1: AbstratmodelofVirtualto Physialaddressmapping
thebash ommandshell. Ratherthanhaveseveralopiesof bash,onein eah
proessesvirtualaddressspae,it isbetterto haveonly oneopyin physial
memoryand allofthe proessesrunningbash shareit. Dynamilibrariesare
anotherommonexampleof exeutingodesharedbetweenseveralproesses.
Shared memory analso be used asan Inter Proess Communiation (IPC)
mehanism, with twoor moreproessesexhanging information via memory
ommonto allofthem. Linuxsupports theUnix TM
System Vsharedmemory
IPC.
3.1 An Abstrat Model of Virtual Memory
Before onsidering the methods that Linux uses to support virtual memory it is
usefultoonsideranabstratmodelthatisnotlutteredbytoomuh detail.
Astheproessorexeutesaprogramitreadsaninstrutionfrommemoryanddeodes
it. Indeodingtheinstrutionitmayneedtofethorstoretheontentsofaloation
in memory. The proessorthen exeutes the instrutionand movesonto the next
instrution in the program. In this waythe proessor is alwaysaessing memory
eitherto fethinstrutionsortofethandstoredata.
In a virtual memory system all of these addresses are virtual addresses and not
physialaddresses. Thesevirtualaddressesareonvertedintophysialaddressesby
theproessorbasedoninformationheldinasetoftablesmaintainedbytheoperating
system.
Tomakethistranslationeasier,virtualandphysialmemoryaredividedintohandy
sizedhunksalledpages. Thesepagesareallthesamesize,theyneednotbebut if
theywere not,thesystemwouldbeveryhardtoadminister. LinuxonAlpha AXP
systemsuses 8Kbyte pages andon Intelx86 systemsituses 4Kbyte pages. Eah
of these pages is given a unique number; the page frame number (PFN). In this
theosetandbits12 andabovearethevirtualpageframenumber. Eahtime the
proessorenountersavirtualaddressitmustextrattheosetandthevirtualpage
frame number. The proessor must translate the virtual page frame number into
aphysialone and then aess theloation at theorret oset into that physial
page. Todothistheproessorusespage tables.
Figure3.1showsthevirtualaddressspaesoftwoproesses,proessXandproess
Y, eahwith their own page tables. These page tables map eah proesses virtual
pagesintophysialpagesinmemory. ThisshowsthatproessX'svirtualpageframe
number0ismappedintomemoryinphysialpageframenumber1andthatproess
Y'svirtualpageframenumber1ismappedintophysialpageframenumber4. Eah
entryin thetheoretialpagetableontainsthefollowinginformation:
Validag. This indiatesifthispagetableentryisvalid,
Thephysialpageframenumberthatthisentryisdesribing,
Aessontrol information. Thisdesribeshowthepagemaybeused. Canit
bewrittento? Doesitontainexeutableode?
Thepagetableisaessedusingthevirtualpageframenumberasanoset. Virtual
pageframe5wouldbethe6thelementofthetable (0istherstelement).
Totranslateavirtualaddressintoaphysialone,theproessormustrstworkout
thevirtualaddressespageframenumberandtheosetwithinthatvirtualpage. By
makingthepagesize apowerof2thisanbeeasilydonebymaskingandshifting.
Looking again at Figures 3.1 and assuming a page size of 0x2000 bytes (whih is
deimal 8192) and an addressof 0x2194 in proess Y's virtual addressspae then
theproessorwouldtranslatethataddressintooset0x194into virtualpageframe
number1.
The proessor uses the virtual page frame number as an index into the proesses
page table to retrieveitspage table entry. If thepage table entryat that osetis
valid, the proessor takes the physial page frame number from this entry. If the
entryis invalid, theproesshasaessed anon-existentareaofits virtualmemory.
Inthis ase,the proessorannot resolvetheaddressandmust passontrol tothe
operatingsystemsothatitan xthingsup.
Just how the proessor notiesthe operating system that the orret proess has
attemptedtoaessavirtualaddressforwhihthereisnovalidtranslationisspei
totheproessor. Howevertheproessordeliversit,thisisknownasapage faultand
theoperatingsystemisnotiedofthefaultingvirtualaddressandthereasonforthe
pagefault.
Assumingthatthisisavalidpagetableentry,theproessortakesthatphysialpage
framenumberandmultipliesitbythepagesizetogettheaddressofthebaseofthe
pageinphysialmemory. Finally,theproessoraddsintheosettotheinstrution
ordatathat itneeds.
Usingtheaboveexampleagain,proessY'svirtualpageframenumber1ismapped
tophysialpageframenumber4whihstartsat0x8000(4x0x2000). Addinginthe
0x194byte osetgivesusanal physialaddressof0x8194.
By mapping virtual to physial addresses this way, the virtual memory an be
1whereas virtualpage frame number 7is mapped to physial page frame number
0 even though it is higher in virtual memory than virtual page frame number 0.
Thisdemonstratesaninterestingbyprodutofvirtualmemory;thepagesofvirtual
memorydonothavetobepresentin physialmemoryinanypartiularorder.
3.1.1 Demand Paging
As there is muh lessphysial memorythan virtual memorythe operating system
mustbearefulthat it doesnotusethephysialmemory ineÆiently. Onewayto
savephysialmemoryisto onlyloadvirtualpagesthatareurrentlybeingused by
the exeuting program. For example, a databaseprogram may be run to query a
database. Inthisasenotallof thedatabaseneedsto beloadedinto memory, just
thosedatareordsthat arebeingexamined. Ifthedatabasequeryisasearhquery
thenitdoesnotmakesenseto loadtheodefromthedatabaseprogramthat deals
withaddingnewreords. Thistehniqueofonlyloadingvirtualpagesinto memory
astheyareaessedisknownasdemandpaging.
Whenaproessattemptstoaessavirtualaddressthatisnoturrentlyinmemory
the proessor annot nd a page table entry for the virtual page referened. For
example, in Figure 3.1 there is noentryin proess X'spage table for virtual page
framenumber2andsoifproessXattemptstoreadfromanaddresswithinvirtual
page frame number 2 the proessor annot translate the address into a physial
one. Atthis pointtheproessornotiestheoperatingsystemthatapagefault has
ourred.
Ifthe faulting virtualaddressisinvalid this meansthat theproess hasattempted
toaessavirtualaddressthatitshould nothave. Maybetheappliation hasgone
wronginsomeway,forexamplewritingtorandomaddressesinmemory. Inthisase
theoperatingsystemwill terminateit,protetingtheotherproessesin thesystem
fromthisrogueproess.
Ifthefaultingvirtualaddresswasvalidbutthepagethatitreferstoisnoturrently
inmemory,theoperatingsystemmustbringtheappropriatepageintomemoryfrom
the image on disk. Disk aess takesa long time, relatively speaking, and so the
proess mustwaitquite awhileuntil thepage hasbeenfethed. Ifthere areother
proessesthat ould runthen theoperatingsystem willselet oneof them to run.
The fethed page is written into a free physial page frame and an entry for the
virtualpageframenumberisaddedtotheproessespagetable. Theproessisthen
restartedat the mahine instrution where the memory fault ourred. This time
thevirtual memoryaess is made, theproessoranmakethe virtualto physial
addresstranslationandsotheproessontinuestorun.
Linuxusesdemandpagingtoloadexeutableimagesintoaproessesvirtualmemory.
Wheneveraommand isexeuted, thele ontainingit isopenedand itsontents
aremappedinto theproessesvirtualmemory. This isdonebymodifyingthedata
struturesdesribingthisproessesmemorymapandisknownasmemorymapping.
However,onlytherstpartoftheimage isatuallybroughtinto physial memory.
Therestoftheimageisleftondisk. Astheimageexeutes,itgeneratespagefaults
3.1.2 Swapping
If a proess needs to bring a virtualpage into physial memory and there are no
freephysialpagesavailable,theoperatingsystemmustmakeroomforthispageby
disardinganotherpagefrom physialmemory.
Ifthe pageto be disardedfromphysialmemory amefrom animage ordatale
andhasnotbeenwrittentothenthepagedoesnotneedtobesaved. Insteaditan
bedisarded and ifthe proess needsthat pageagain it anbebroughtbak into
memoryfromtheimageordatale.
However, if the page has been modied, the operating system must preserve the
ontentsofthatpagesothatitanbeaessedatalatertime. Thistypeofpageis
known asadirty pageandwhenitis removedfrom memoryitissavedin aspeial
sortoflealledtheswaple. Aessestotheswapleareverylongrelativetothe
speedofthe proessorandphysial memory andthe operatingsystemmust juggle
theneedtowrite pagestodiskwith theneedtoretainthem inmemoryto beused
again.
Ifthealgorithmusedto deidewhih pagesto disardorswap(the swap algorithm
is noteÆient then aonditionknown asthrashing ours. In thisase, pagesare
onstantlybeingwrittentodiskandthenbeingreadbakandtheoperatingsystem
is too busy to allow muh real work to be performed. If, for example, physial
pageframenumber1inFigure 3.1isbeingregularlyaessed thenitisnotagood
andidate for swapping to hard disk. The set of pages that a proess is urrently
using isalled the working set. An eÆientswapshemewould makesure that all
proesseshavetheirworkingset inphysialmemory.
LinuxusesaLeastReentlyUsed(LRU)pageagingtehniquetofairlyhoosepages
whih mightberemovedfrom the system. This shemeinvolveseverypagein the
systemhavinganagewhihhangesasthepageis aessed. Themorethat apage
isaessed,theyoungeritis; thelessthatitisaessed theolderand morestaleit
beomes. Oldpagesaregood andidatesforswapping.
3.1.3 Shared Virtual Memory
Virtual memorymakesit easyfor several proessesto sharememory. All memory
aess are madevia page tables and eah proess hasits own separate page table.
Fortwoproessessharingaphysialpageofmemory,itsphysialpageframenumber
mustappearin apagetable entryinbothoftheirpagetables.
Figure 3.1showstwoproessesthat eah share physialpage framenumber4. For
proess Xthis isvirtualpage framenumber4whereasforproess Y thisis virtual
pageframenumber6. Thisillustratesaninterestingpointaboutsharingpages: the
sharedphysialpagedoesnothavetoexistatthesameplaeinvirtualmemoryfor
anyoralloftheproessessharingit.
3.1.4 Physial and Virtual Addressing Modes
Itdoesnotmakemuhsensefortheoperatingsystemitselftoruninvirtualmemory.
Thiswouldbeanightmaresituationwheretheoperatingsystemmustmaintainpage
A
Figure3.2: AlphaAXPPageTableEntry
pagetablesandtheproessordoesnotattempttoperformanyaddresstranslations
inthis mode. TheLinuxkernelislinkedtoruninphysial addressspae.
TheAlphaAXPproessordoesnothaveaspeialphysialaddressingmode. Instead,
it divides up the memory spae into several areas and designates two of them as
physiallymapped addresses. ThiskerneladdressspaeisknownasKSEGaddress
spaeanditenompassesalladdressesupwardsfrom0xf0000000000. Inorderto
exeutefromodelinkedin KSEG(bydenition,kernelode)oraessdatathere,
theodemustbeexeutingin kernelmode. TheLinuxkernelonAlphaislinkedto
exeutefrom address0xf0000310000.
3.1.5 Aess Control
Thepage table entries alsoontainaess ontrol information. As theproessoris
alreadyusing thepagetable entrytomap aproessesvirtualaddressto aphysial
one,itaneasilyusetheaess ontrolinformationtohekthat theproessisnot
aessingmemoryinawaythatitshould not.
Thereare manyreasonswhyyouwouldwanttorestritaessto areasof memory.
Somememory,suhasthatontainingexeutableode,isnaturallyreadonly
mem-ory;theoperatingsystemshouldnotallowaproesstowritedataoveritsexeutable
ode. Byontrast,pagesontainingdata anbewrittentobut attemptstoexeute
that memory asinstrutions should fail. Mostproessorshaveat least twomodes
ofexeution: kernel anduser. Youwouldnotwantkernelodeexeutingbyauser
orkernel data strutures to be aessibleexept when the proessor is running in
kernelmode.
TheaessontrolinformationisheldinthePTEandisproessorspei;gure3.2
showsthePTEforAlphaAXP. Thebit eldshavethefollowingmeanings:
V Valid, ifset thisPTEisvalid,
FOE \FaultonExeute",Wheneveranattempttoexeuteinstrutionsinthispage
page,
FOR \Fault on Read", as abovebut page fault on an attempt to read from this
page,
ASM AddressSpaeMath. Thisisusedwhentheoperatingsystemwishestolear
onlysomeoftheentriesfromtheTranslationBuer,
KRE Coderunningin kernelmodeanreadthispage,
URE Coderunningin usermodeanreadthispage,
GH Granularityhintusedwhenmappinganentire blokwithasingleTranslation
Buerentryratherthanmany,
KWE Coderunningin kernelmodeanwrite tothispage,
UWE Coderunninginusermodeanwritetothispage,
page frame number ForPTEswiththeVbitset,thiseldontainsthephysial
PageFrameNumber(pageframenumber) forthisPTE. Forinvalid PTEs, if
this eld is not zero, it ontainsinformation aboutwhere the page is in the
swaple.
Thefollowingtwobitsaredened andusedbyLinux:
PAGE DIRTY ifset,thepageneedstobewrittenouttotheswaple,
PAGE ACCESSED UsedbyLinuxtomarkapageashavingbeenaessed.
3.2 Cahes
Ifyouwereto implementasystemusingtheabovetheoretialmodelthen itwould
work,butnotpartiularlyeÆiently. Bothoperatingsystemandproessordesigners
try hard to extrat more performane from the system. Apart from making the
proessors, memory and so on faster the best approah is to maintain ahes of
usefulinformationanddatathatmakesomeoperationsfaster. Linuxusesanumber
ofmemorymanagementrelatedahes:
Buer Cahe The buer ahe ontains data buers that are used by the blok
devie drivers. These buers are of xed sizes (for example 512 bytes) and Seefs/buffer.
ontainbloks of informationthat have either been read from ablokdevie
or arebeingwritten to it. Ablokdevie isonethat anonlybeaessed by
readingandwritingxedsizedbloksofdata. Allharddisksareblokdevies.
The buer ahe is indexed via the devie identier and the desired blok
numberandisusedtoquiklyndablokofdata. Blokdeviesareonlyever
aessed viathebuerahe. Ifdataanbefoundin thebuerahethenit
doesnotneed to beread from thephysial blok devie, for exampleahard
disk, andaesstoitismuhfaster.
Page Cahe Thisisusedtospeedupaesstoimagesanddataondisk. Itisused See
mm/filemap.
to ahe thelogialontentsofaleapage atatimeand isaessed viathe
Swap Cahe Only modied (or dirty) pages are saved in the swap le. So long Seeswap.h,
mm/swapstate.
mm/swapfile.
as these pages arenotmodiedafter theyhavebeenwritten to theswaple
thenthenexttimethepageisswappedoutthereisnoneedtowriteittothe
swapleas the pageisalready in theswaple. Insteadthepage ansimply
bedisarded. Inaheavilyswappingsystemthis savesmanyunneessaryand
ostlydiskoperations.
Hardware Cahes Oneommonlyimplementedhardwareaheisintheproessor;
aaheofPageTableEntries.Inthisase,theproessordoesnotalwaysread
the page table diretly but instead ahes translations for pages as it needs
them. ThesearetheTranslationLook-asideBuersandontainahedopies
ofthepagetableentriesfrom oneormoreproessesin thesystem.
Whenthereferenetothevirtualaddressismade,theproessorwillattemptto
ndamathingTLBentry. Ifitndsone,itandiretlytranslatethevirtual
addressinto aphysialoneandperformtheorretoperationon thedata. If
theproessorannotndamathingTLBentrythenitmustgettheoperating
systemtohelp. ItdoesthisbysignallingtheoperatingsystemthataTLBmiss
has ourred. A systemspei mehanism is used to deliverthat exeption
to the operating system ode that an x things up. The operating system
generatesanewTLBentryfortheaddressmapping. Whentheexeptionhas
beenleared,theproessorwill makeanotherattemptto translatethevirtual
address. Thistimeitwill workbeausethereisnowavalidentryin theTLB
forthat address.
Thedrawbakofusingahes,hardwareorotherwise,is thatinorderto saveeort
Linux must use more time and spae maintaining these ahes and, if the ahes
beomeorrupted,thesystemwillrash.
3.3 Linux Page Tables
Linux assumesthatthere arethreelevelsofpagetables. EahPageTableaessed
ontainsthe page framenumberof the nextlevel ofPage Table. Figure 3.3shows
howavirtualaddressanbebrokenintoanumberofelds;eaheldprovidingan
oset into a partiular Page Table. To translate avirtual address into a physial
one,theproessormusttaketheontentsofeahleveleld,onvertitintoanoset
into thephysialpage ontainingthe PageTable andread the pageframe number
of the nextlevel of PageTable. This is repeated three times until thepage frame
numberofthephysialpage ontainingthevirtualaddressisfound. Nowthenal
eldin thevirtualaddress,thebyte oset,isusedto ndthedatainside thepage.
Eah platform that Linux runs on must providetranslation maros that allow the
kerneltotraversethepagetablesforapartiularproess. Thisway,thekerneldoes
notneedtoknowtheformatofthepagetableentriesorhowtheyarearranged. This See
inlude/-asm/pgtable.h
issosuessfulthatLinuxusesthesamepagetablemanipulationodefortheAlpha