ContentslistsavailableatScienceDirect
Journal of Systems Architecture
journalhomepage:www.elsevier.com/locate/sysarc
An energy-efficient system on a programmable chip platform for cloud applications
Xu Wang
a, Yongxin Zhu
a,f,∗, Yajun Ha
b, Meikang Qiu
c, Tian Huang
a, Xueming Si
d, Jiangxing Wu
eaSchool of Microelectronics, Shanghai Jiao Tong University, 800 Dongchuan Road,Shanghai 200240, P.R. China
bDepartment of Electrical and Computer Engineering, National University of Singapore, 21 Lower Kent Ridge Road, Singapore
cDepartment of Computer Science, Pace University, 1 Pace Plaza, Manhattan, New York City, NY, 10038, USA
dShanghai Key Laboratory of Data Science, Fudan University, 825 Zhangheng Road, Shanghai 201203, P.R. China
ePLA Information Engineering University, P.R. China
fSchool of Computing, National University of Singapore, Singapore
a rt i c l e i n f o
Article history:
Received 30 January 2016 Revised 8 October 2016 Accepted 22 November 2016 Available online 30 November 2016 Keywords:
Cloud computing ECG classification FPGA
Performance analysis Reconfigurable architectures Web server
a b s t r a c t
Traditionalcloudserviceprovidersbuildlargedata-centerswithahugenumberofconnectedcommodity computerstomeettheever-growingdemandonperformance.However,thegrowthpotential ofthese data-centersislimitedbytheircorrespondingenergyconsumptionandthermalissues.Energyefficiency becomesakeyissueofbuildinglarge-scalecloudcomputingcenters.Tosolvethisissue, weproposea standaloneSOPC(SystemonaProgrammableChip)basedplatformforcloudapplications.Weimprove theenergyefficiency forcloud computingplatformswithtwotechniques.First,weproposeamassive- sessionsoptimizedTCP/IPhardwarestackusingamacro-pipelinearchitecture. Itenablesthehardware accelerationofpipeliningexecution ofnetworkpacket offloadingand applicationleveldataprocessing.
Thisachieves higherenergyefficiencywhilemaintainingpeakperformance.Second,weproposeaon- linedynamicschedulingstrategy.It can reconfigureorshutdownFPGA nodesaccording toworkload variancetoreducetheruntimeenergyconsumptioninastandaloneSOPCbasedreconfigurablecluster system.TwocasestudiesincludingawebserverapplicationandacloudbasedECG(electrocardiogram) classificationapplicationaredevelopedtovalidatetheeffectivenessoftheproposedplatform.Evaluation resultsshow that ourSOPCbasedcloud computingplatformcan achieve upto418X improvementin termsofenergyefficiencyovercommercialcloudsystems.
© 2016ElsevierB.V.Allrightsreserved.
1. Introduction
We are livingin an era ofcloud computingthat is the back- boneofembeddedubiquitouscomputing[1]andbigdata[2].Peo- ple can accesspowerful IT resources (e.g.,computationand stor- age)andconvenientservicesinthecloudviamobiledevicesatany timeandanyplace.Theshiftofcomputinginfrastructurefromlo- caldesktoptotheremotecloudimprovestheutilizationofoverall computingresources andreduces thecosts associated withman- agementofhardwareandsoftwareresourcesforindividualusers.
Manyservicesonthecloud areweb-basedapplications,which incur a large numberof concurrent requests andeach of there- quests involves a light-weighted task which often desires short
∗ Corresponding author.
E-mail addresses: [email protected] (X. Wang), [email protected] , [email protected] (Y. Zhu), [email protected] (Y. Ha), [email protected] (M.
Qiu), [email protected] (T. Huang), [email protected] (X. Si).
latencies in response from energy hunger servers. For existing servers, the performance is typically limitedby overheads of the network packets processing and the connection management in the NIC and OS kernel [3]. Although traditional cloud service providers such as Google and Amazon build data centers with a huge numberof connected commodity computers to meet the demand of cloud computing, the ever-growing energy consump- tionlimitsthescaleoutofthesemachinesduetolimitedenergy- budget. Moreover, it comes expense related to the energy con- sumptionandheat dissipation dominatingthe operating costs in suchhigh-densitycomputingenvironments. Therefore,theenergy efficiency(performanceperjoule)becomesthekeyissueofbuild- inglarge-scalecloudcomputingcenters[4].
Reconfigurable computing based on Field Programmable Gate Array(FPGA)technologiespossessesthecapability ofparalleland specializedcomputationthatcaneffectivelyexploitsthetask-level parallelisminherentincloudcomputingwithrelativelylowenergy consumption[5].The currenttrendto take advantageof FPGAin http://dx.doi.org/10.1016/j.sysarc.2016.11.009
1383-7621/© 2016 Elsevier B.V. All rights reserved.
cloudcomputingliesintwoways:(1)toimprovedatatransferper- formancewithindatacentersasaTCP/IPoffloadengine(TOE)[6]; (2)toacceleratecomputation-intensiveapplicationsasahardware accelerator [7,8]. In both two methods, FPGAs are used as slave deviceshostedby commodityCPUson themaster side.However, suchmaster-slavebasedarchitectureisnotsuitableforweb-based cloudapplications,sincethehighcommunicationlatencybetween themaster CPU andthe slave FPGA limitsthethroughput of the system.
Inthispaper,weanalyzetheinefficiencyofmaster-slavebased reconfigurable architectures in dealing with highly concurrent cloud applications by proposing a performance model, and indi- catethatastandaloneSOPCbasedarchitecture,whichtightlycou- plesnetworkIOhandlinganddataprocessinginasingleFPGA,is amorepromisingsolutionincloudcomputingwithbothhighen- ergy efficiency andhigh performance cateringto needs of cloud applications.
Havingidentifieda promisingstandaloneSOPCbasedarchitec- ture,wethenpresentthedesignandimplementationoftheSOPC basedarchitectureforcloudcomputingindetails.Themajorchal- lenge to implement such SOPC based architecture is the design ofahighperformance TCP/IP hardwarestackthat isableto sup- port highnetwork throughputwith highconcurrent connections.
Since third party TCP/IP offload engines are primarily optimized forinter-serverdatatransferamongafewTCPconnectionsindata centers,theyadoptone-TCP/IP-session-per-pipelinearchitectureto achievehighthroughputandextremelylowlatency.However,such TCP/IP engines are hard to scale due to resources limitation in FPGA, which is not suitable for high concurrent cloud applica- tions.We address thisproblem by adoptinga macro-pipeline ar- chitecture,whereallTCPsessionsshareacentralizedmemoryand coarsegrainedpipeline.Processingmodulesinthemacro-pipeline areoperated inasynchronous mode to achieve highsystemlevel throughput,whilean external SRAMis usedto minimize the ac- cesslatencyofTCPconnectioninformationthat arerandomly ac- cessedandmodifiedbydifferentmodulesconcurrently.
OneadvantageofourdesignbyplacingFPGAsonthecloudas standaloneentitiesto provideserviceisitshighenergyefficiency atits peak performance. In other words, it supports higher per- formancewithina fixedenergy-budgetconstraintindatacenters.
Besidestheeffortsinhardwaredesign,wefurtherproposeaonline dynamic scheduling scheme to reduce the runtime energy con- sumptionofthewholeclustersystem.Bytakingadvantageoffast configurationability,a loadbalancing nodeis usedto shutdown orpowerupFPGAsaccordingtorealtimeworkloadvarianceinthe cloud environment,saving the energywithout compromisingthe qualityofservicetousers.
To verify ourarchitecture designand scheduling scheme,two casestudiesincludingawebserverclusterandacloudbasedECG (electrocardiogram) classification cluster are presented to evalu- atetheeffectivenessofproposedarchitecture.Designconsideration andimplementationdetails are givento show how thisarchitec- turemeets the demand of cloud computing in high throughput, highconcurrency,lowlatency,aswellaslowenergyconsumption.
Themajorcontributionsofthisworkinclude:
• Performance analysis of master-slave based reconfigurable ar- chitecture andstandalone SOPCbased reconfigurablearchitec- tureusingananalyticalmodel.
• Amassive-sessions optimizedTCP/IP offloadengine(described in Section 3.A) that supports up to 100K TCP sessions under 10Gbpslinerate.
• An online dynamic scheduling method (described in Section 3.B)thatcanreconfigureorpoweroff FPGAnodesaccordingto workloadvariancetoreducetheruntimeenergyconsumption.
• Prototype of a reconfigurable cluster system based on stan- dalone SOPC to provide cloud services, achieving up to 38X speedupinperformanceand418Ximprovementinenergyef- ficiencycomparedtothesoftwarebasedcloudsystems.
• Prototype verification by implementing an I/O intensive web servercluster systemwithdetailedcomparisonwith alternate servers.
• Prototype verification by implementing a both I/O and com- putationintensive ECG classification cluster system, indicating supportforpracticalcloudapplications.
The rest of the paper is organized as follows. Section 2 ana- lyzesrelatedworkbriefly.Section3proposesaperformancemodel toanalyzethemaster-slavebasedarchitectureandthestandalone SOPC based architecture. Section 4 describes the design details of standalone SOPC based cluster systems for cloud applications.
Section 5 presents the design and implementation of two case studiesincludingawebserverclusterandacloudbasedECGclas- sificationcluster.Section6evaluatesperformanceandenergyeffi- ciencyofthetwoapplicationcases.Ourconclusionsarepresented inSection7.
2. Relatedworks
Theissueofenergyconsumptionincloudcomputinghasbeen receivingincreasingattentioninrecentyears[9].Researchershave considered energy minimization by consolidating the workload into theminimum ofphysical resources andswitchingidlecom- putingnodesoff,withguaranteedthroughput andresponse time.
Forexample,Chaseetal.[10]proposed aneconomic approachto managingsharedserverresources,whichimprovestheenergyef- ficiencyofserverclustersbydynamicallyresizingtheactiveserver set. Similar dynamic provisioningalgorithms [11] are studied for long-lived TCP connections as in instant messaging and gaming.
Moreover, a queuing model [12] to dynamic provisioning tech- nique has also been studied to obtain the minimum number of serversandacombinationofpredictiveandreactivemethodshas beenproposedtodeterminewhentoprovisiontheseresources.Al- though such scheduling methods reduce energy consumption by switching idle servers to power saving modes (e.g. sleep, hiber- nationor poweroff), the long switch latency consistingof reini- tializationofthe systemandrestoringthecontext incommercial serverspreventstheirusageinrealcloudsystems,especiallywhen the workloads are unstable. Recent study[13] utilises the power gating techniqueforFPGA-basedaccelerators toefficientlyreduce theenergyconsumptionoftasksrunningonsingleFPGA.However, asfarasweknow,thereisno relatedwork ondynamicschedul- ing ofFPGAbased cloud computingsystemto reduce theenergy consumption.
To ensure fast changing system configuration and sharing of systems resources, Li et al. [14] proposed a rack scale compos- ablesystemusingPCIeswitches,whichisconstructedasacollec- tion ofindividual resources such asCPU, memory, GPU/FPGAac- celerator,disks etc.,and composedinto workloadexecutionunits ondemand. Althoughsuchcomposable systemisapromising ar- chitecture that can improve theutilization ofoverall systems re- sources,the relativehighcommunicationlatencybetweenmaster CPU andslavedevices(memory,GPU/FPGA)makes itlessefficient in dealingwith latency-bounded applications such as web-based cloudservices.Comparetoit,ourstandaloneSOPCbasedarchitec- turetightlycouplesnetworkIOhandlinganddataprocessingina singleFPGA, which significantly reduces such communicationla- tency. Recent study[15] shows that energy consumption in net- work processing can be a significant percentage of total energy consumption in cloud computing. Neither high-performance nor low-powercoresprovideasatisfactoryenergy-performancetrade-
off[16],sincethe performance of many cloudapplications is typ- ically limited by the per-packetprocessing overheads in the NIC and OSkernel. Therefore,there have beena few workson port- ingcloudworkloadstoFPGAwithatighterintegrationofthenet- workinterface.Saietal.[16]proposedanFPGAbasedMemcached appliance,acceleratingthelow-latencyaccesstolargeamountsof dataforwebserviceswith1Gbpsnetworkinterface,andMichaela etal.[17] extendeditto 10Gbpslinerate.Theyutilized asimilar architecture toourswithatightproximity ofnetwork andDRAM toachieve lowerlatency.However,they usedUDPoffloadengines rather than TCP offload engines to handle network communica- tions for its minimal overhead to ensure maximum throughput over the networkandthe limitednumber ofconcurrent sessions inthirdpartyTCPoffloadengines.Comparedtotheirwork,ourar- chitectureiscapableforconnection-orientedcloudapplicationsby usinganovelTCP/IPoffloadenginethatsupports100Kconcurrent TCPsessions.
As an attractive solution to reduce the host CPU overhead from TCP/IP processing and improve network performance, the TCP/IP offload engine (TOE)has beenextensively studiedboth in academia andindustry. Wu etal.[18] implementeda TOE accel- eration hardware block and associatedTCP firmware, which pro- vided TCP/IP transmission rate up to 296 Mbpsas receiver and 239 Mbpsassender.Hankooketal.[19]designedahybridarchi- tectureforaTCPoffloadengine,wherethetransmissionandrecep- tionpaths ofTCP/IParecompletelyseparatedwiththeaidoftwo generalembeddedprocessorstoprocessdatatransmissionandre- ception simultaneously. A complete TOE for FPGAs without pro- cessor assistwaspresentedin[20],whichsupportsGigabitEther- netinlimitedhardwareresources.TheclosestrelatedworkinTOE is [21],which supports for10K concurrent connections in VC709 with 10 Gbpsbandwidth. For each connection, thisdesign allo- cates fixed-sized data buffers in the external SDRAM and main- tainsfixed-sizedsessioninformationinon-chipblock rams.Com- paredwithit,ourTOEavoidsusingprivatesendingbuffersforeach connection, allowing sending data to be shared among multiple connections, which saves the memoryspace andtime caused by multiplecopiestoprivate buffers.Also, weusesaevent-triggered connection management in external SRAM. As a result, our TOE cansupportmoreTCPsessions (100Kcomparedwith10Kin[21]) withsimilarperformanceandresourceconsumption.Recently,so- lutionsfromindustryhavealreadyextendedthethroughputofTOE to10Gbps–40Gbps[22,23].Upuntil2014,alloftheavailableTOEs were ultra-low latency and low session count (below 256 con- current sessions), with the key driving applications being high- frequencytradingandinter-serverdatatransferindatacenters.As a result, these third party solutions typically support one TCP/IP sessionper instantiated TOE,whichlimitsthenumberofconcur- rentTCPsessions duetothe FPGAresources. In2014, Intilophas announced a highsession count variant supporting up to 16,000 sessions.However,asIntilophasnotdisclosedthedesign,itishard forustomakeadetailedarchitecturalcomparisonatthisstage.
3. Performanceanalysisoftworeconfigurablearchitectures 3.1. Master-slavebasedarchitecture
Many reconfigurable systems using FPGA are built as the master-slavesystems [8,24,25].Ineach computationnode, agen- eral purpose processor based computer works as master, which runs operating system and handles communications and data transfer with the slave systems. Slave systems consist ofI/O de- vices includinghard drivesandnetwork interfaces,aswellasre- configurablelogicbasedonFPGAs.Communicationsbetweenmas- tersideandslavesidearedeliveredviahighspeedonboardinter- connectionssuchasPCI-E.
AccordingtoAmdahl’s law[26],an arbitraryalgorithm canbe dividedintoportions thatcanbe madeparallelandportionsthat cannotbeparallelized,i.e.,toremainserial,andthespeedupofan algorithminparallelcomputingislimitedbythetime neededfor thesequentialfractionofthealgorithm.Modernreconfigurablede- vicescanprovideabundantcomputingresourcestoaccelerate the parallel portion of an algorithm. However, performance speedup maynotbealwaysachievedduetowithmanylonglatencycontrol communicationsandI/Ooperations,thusincreasingthesequential portion.
Torevealsuch architecturallimitationsinparallelimplementa- tionofanapplicationinreconfigurablesystems,wefurtherdivide sequential portion into three parts: serial computation (S), con- trol I/O operation (CIO), and data I/O operation (DIO), based on whichwe definethe time consumedbyeach partasTs,Tcio,and Tdio respectively. Alsowe define Tp asthetime spent onparallel computation,thentheoverallexecutingtimeTinacompletesyn- chronoussystemisgivenby:
T=Ts+Tp+Tcio+Tdio. (1)
ConsideringoptimizationtechniqueslikeoverlappingI/Oandpro- cessing,wecanalsodefinetheoverallexecutingtimeTinacom- pleteasynchronoussystemas:
T=max
{
Ts,Tp,(
Tcio+Tdio) }
. (2) However,realreconfiguresystemsarefallssomewherein-between oftheEq.(1)andEq.(2).Forsimplicity,weuseEq.(2)toanalyze thetworeconfiguresystems.Since the serial computation portion of an application is in mostsituations executed on a processor, we can calculatethe Ts
as:
Ts= S fs×
i Ni CPIi
, (3)
where S represents the workload of serial computation portion whichis definedby number ofinstructions. Ni is the numberof instructionsthat can be issued simultaneously of functional unit type i,CPIi is theaverage numberof cyclesper instruction (such asDSP, ALU)forfunctionalunittype iandfsistheoperatingfre- quencyofthedevice. SincetheSisdefinedthatcannot beparal- lelized,wesimplifyEq.(3)as:
Ts=S×CPI
fs . (4)
Inreconfigurablesystems, theparallel computationpart ofan applicationismostlikelyexecutedbyhardwarelogicinFPGAs,so thevalueofTpcanbecalculatedas:
Tp= P
R×CPPfp , (5)
whereP representsthe workloadof parallel computationportion whichisdefinedbynumberofoperations.Risthenumberofre- configurableresources available for parallel computation. For ex- ample,thetotalnumberofInt32IP coresofaddersandmultipli- ersthatan FPGAcansupport.CPPistheaveragenumberofclock cyclesperoperation perreconfigurableresourceandfp istheop- eratingfrequency.
Theprocessofcontrol involvesasequenceofmessagepassing operationsbetweenmasterCPUandslavedevices.Althoughmod- ernhighspeed interconnectionsare abletoprovide anextremely highthroughput fordata transmission,it brings a relatively high latency.Since suchcontrol messages always containlittlevolume ofpayloadandacontrolprocessincludesseveraltimesofmessage passing,itmakesthecommunicationdelaythedominantfactorin Tcio:
Tcio=Nc×Dc. (6)
Eq.(6)givesthedefinitionofTcio,whereNcisthenumberofcom- munications for control operation and Dc is the delay for such communications.
Weassume dataI/Ooperationsarealllargedatatransmissions andsmalldatatransmissioncanbereferredtocontroloperations.
ThethroughputofI/Ointerfaceratherthancommunicationlatency becomesthemajorfactor.Sinceitneedsmorethanonedatacopy operationsviadifferentI/O interfacetofeeddatatotheexecution unit,thetimeconsumedbyI/Ooperationscanbedefinedas:
Tdio=
j
θ
j, (7)where isthe size of datato be transferred and
θ
j is the max throughputofeach interfacej wheredatacopy/transferoperation occurs.Tosumup,Eq.(2)isspecifiedas:
T=max
S×CPI fs , P
R×CPPfp ,
Nc×Dc+
j
θ
j. (8)
WiththedefinitiongivenbyEq.(8)andconsideringtheenergy consumptionofthewholesystemincludingbothgeneralpurpose processorsandFPGAs,wecaninferthattheapplicationsthatper- fectlymatchthemaster-slavereconfigurablesystemthusachieving highenergyefficiencymusthave followingproperties:1) a com- parativeportionofserialcomputationworkloadthat cankeepthe generalpurposeprocessorsinhighutilization;2)lesscontrolinter- actionsbetweenmasterandslaves;3)thedatafromI/Ointerface mustsatisfytheneedsofparallelprocessinginFPGAs.
However, many services on the cloud are web-based applica- tions,whichincludea largenumberofconcurrent tasksandeach ofthe tasks is light-weighted.In other words, theseapplications contain very little portion of serial computation-intensive work- load but rather large part of parallel computation coupled with complexcontrolandI/O operations.Inthiscase, themaster-slave basedreconfigurablesystems becomeinefficient. SinceDc islarge inmaster-slavesystems,whenNcincreasestoacertainthreshold, thetimespentinmessageinteractionbetweenhostandFPGAde- vices mayconsume much moretime than that forparallel com- putationin FPGAs. Also, when there isa bottleneck in the chain oftransferring data fromI/O to FPGA orlarge I/O delays caused bytoomanydatacopyoperations,poorperformancewouldbethe consequencebecauseof theinsufficientlydata supplyforparallel computationalcores.
3.2.SOPCbasedarchitecture
TheSOPCbasedreconfigurablearchitectureintegratesthefunc- tionalityprovidedbymastersideinamaster-slavesystemintoFP- GAs.Generalpurposeprocessorswithhighperformanceandhigh energyconsumption are replaced by embedded microprocessors, while off-chip interconnect are replaced by on-chip interconnect with low latency. I/O facilities are directly connected to FPGAs wherecontrollercircuitsofsuchI/O interfaceareimplementedas IPblocks.As aresult, itismore suitableforcloud basedapplica- tionsforthefollowingreasons:
1.Itnarrowsthedistancebetweencontrolunitandcomputational unitbyusingdirectmessagepassinginterfacesuchasFIFOsor sharedmemories,sothatitdecreasesthecontrollatency(Dc).
2. DatafromI/O interfacesaredirectlyfed tocomputationallogic viarawinterfacewithextremelyhighthroughput.Assuch,the architecturedecreasesthenumberofdatacopyoperationsand eliminates thedata transmission bottleneck that may exist in themaster-slavearchitecture.Also,directI/Oaccesswithhard- ware control logic can offload some of control workload that shouldbedonebyprocessors.Asaresult,itreducesNc.
Embedded Microprocessor
Embedded Microprocessor
AXI interconnect AXI interconnect DDR3
Controller 1 DDR3 Controller 1
DR3 Controller 2
DDR3 Controller 2
RAM Controller
SRAM Controller DDR3
DDR3 DDR3 DDR3 SRAMSRAM Memory subsystem
I/O subsystem User defined
application Logic TCP/IP Stack
10G Ethernet XAUI Interface Network Storage
SATA
SSD
TIMER INTC UART MDM Slow I/O
Fig. 1. SOPC based architecture for Cloud applications in single FPGA.
3. Using embedded microprocessors instead of general purpose processors in applications to incur little serial computational workloadcandramaticallyreducetheoverallenergyconsump- tionofthesystem, whileitstill providesidenticalfunctionali- tieswithacceptableperformance.
4. DesigndetailsofSOPCbasedreconfigurableclustersfor cloudapplications
Theinherentnatureofthecloudcomputingimposes morede- mandoncloud-sidesystemforhighthroughputnetworkI/O,high performance computing capability and fast storage. In order to meetthedemandinastandaloneFPGA,wedesignedastandalone SOPCbasedarchitectureforcloudservicethattightlycouplesnet- workIOhandlinganddataprocessinginasingleFPGA,whichisil- lustratedinFig.1.ItincludesanI/Osubsystem,amemorysubsys- tem, anembeddedmicroprocessorandauserdefinedapplication logic.IntheI/O subsystem,networkprotocolsfromphysicallayer to transport layer are integrated inhardware pipeline toachieve highthroughput andlow latency. It supports 10Gb optical ether- netvia XAUIinterfaceaswell astheentireTCP/IPstack. Wealso integrate a SATA controller to support non-volatile mass storage devices.Inthe memorysubsystem,we use twotypes ofexternal memoryfordifferentpurposes:TwoDDR3SDRAMs,whichprovide 16GB memory space in total, are used asdata memory to store bulkdatasuchasupstreamanddownstreamdata,foritshighstor- agedensityand highburstthroughput;One 72MbSRAMis used tostore datastructuresofTCPsessionsforits lowaccesslatency.
Moreover,anembeddedmicroprocessorisinvolvedtoincorporate with AMBA AXI4 interconnects to provide 3 major functions as follows:(1) Initializingthe wholesystemby configuring registers ofsystem modules andperipherals; (2)Providingan AHCIbased driverforSATAcontroller;(3)Maintainingafile systembasedon SSD. Unlike the former-described common component, the user definedapplicationlogicisuniquefordifferentcloudapplications, whichimplementstheapplicationlevelalgorithmincloudservice inspecializedhardwarepipelines.
4.1. Massive-sessionsoptimizedTCP/IPstack
In the scenario of cloud applications, the major disadvantage ofthirdparty TCP/IPoffload enginesthat wouldoutweighadvan- tages is that they are primarily optimized for massive data vol- umetransmission rather than number ofconcurrent connections toboasttheir supportforbandwidthof10+Gbps,whereeachTOE
Upstream data processing
TCP Packet Classifier
Connection Request Processing
Data Acknowledgment Processing
Connection Management
Engine
Retransmission Engine Data
Fragment Engine
...
Disconnection Request Processing Disconnection Accept Processing ...
...
...
...
...
FIFO Group for Different packages
IP Encapsulation FIFO
Downstream Data in DDR3 10G Ethernet Controller
TCP/IP Stack
TCP Packaging Engine
IP Datagram Receive
Engine
IP Encapsulation
Engine
Upstream Data in DDR3
User Defined Application
Logic TCP Sessions in SRAM
TCP Packaging Message FIFO ...
Fig. 2. Mircro-architecture of TCP/IP stack.
only handlesno morethan hundreds ofTCPconnections concur- rently.As aresult, they adoptone TCP/IPsession perinstantiated pipelinearchitecture[22,23],whereeachsessionincludesfull TCP functionaswellasprivateTXandRXbufferusingon-chipBRAMs.
Multi-sessions can be supported by instantiating multiple copies ofsuch pipeline.However, webbased cloudcomputingis quitea different application scenario in that the TOE needs to maintain more than ten thousands of concurrent connections where each connection only contributes a little portion of total throughput.
ThusthesethirdpartyTOEsnolongersatisfysuchrequirementdue to limited on-chip resource to support more TCPsessions.To ad- dresstheproblem,weproposea massive-sessionsoptimizedTOE.
Weuseamacro-pipelinearchitecture,whereallTCPsessionsshare a coarse-grained pipeline with carefullyorganized TX, RX buffer.
Also,wedevelopahardwarebasedconnectionmanagementtosup- port rapidscheduling of up to 100K TCP sessions. We show our architecture in Fig.2 withthe detailedexplanations given inthe followingsubsections.
4.1.1. Macro-pipelinearchitecture
Inordertoachievehighsystemthroughput,allprocessingmod- ulesinthemacro-pipelineareoperatedinasynchronousway,each ofwhichhandlesaportionofprocessingstageordirectlyresponds to a specific type of TCP packets. These processing modules are eventtriggered,thatistheycontinuouslyreadtheeventsormes- sages fromtheinput FIFO,processtheevents,andsendmessages tothenextstagewithoutanysynchronizations.Specifically,forin- comingdataprocessing,datafromnetworkarefirstoffloadedfrom 10G Ethernetandunpacked byIP datagram receiveengineto ob- tainTCPpackets.ThentheseTCPpacketsaredividedintodifferent categories basedon the control bit intheir TCP headers andde- livered todifferent processingengines respectively.Each process- ingenginedealswithaparticulartypeofTCPpacketsuchasSYN packet,FIN packetorACKpacketanddirectly generatesfeedback packets to response. The connection management engine deals with the connection and disconnection of TCP links and sched- ulesofthedatatransmissionorderamongallestablishedlinks.For outgoingdataprocessing,thedatafragmentenginesegmentsdata
Fig. 3. Memory allocation and scheduling of TCP connections.
thatistobetransmittedintosmalldatapackages.Thenthesedata packagesareencapsulatedwithTCPheaderandIPheadersucces- sivelybyTCPpackagingengineandIPencapsulationengine.When a timeout occurs, the retransmission engine retransmits the lost TCPpacketsbased onthe informationfromthe mostrecentdata ACKnumber.
4.1.2. Connectionmanagement
UnlikeconventionalTOEsthatsupportonlynomorethanhun- dredsofTCPsessions,tosupportmorethantenthousandsofTCP sessionsin ourTOE isnotrivialtask. Dueto thelimitedon-chip memoryresources, connection information can onlybe stored in externalmemory,whilethesharedmemorymaybecomethebot- tleneck of the system when it incurs random accesses from dif- ferentprocessingmodulesconcurrently.Wemitigatetheimpactof theproblemintwo ways:toreduce accesslatencyandtoreduce the number of memory accesses. We first choose SRAM instead ofDRAMto storeTCPsessions foritsmuch lower accesslatency.
Then,asdescribedinFig.3,thememory spaceofexternalSRAM issegmentedinto a numberofconsecutive TCPconnection slots, wherewe useahashmechanismwithsourceIPaddressandport numberasthekey to fastlookup ofadedicated TCPconnection.
We avoidusing round-robin scheduling scheme, which iswidely usedinTOEsthatsupportsmallnumberofTCPsessions,tosched-
uletheexecutionofdatatransmissionamongestablishedTCPcon- nections. Becausein web-basedcloud computingscenario witha largenumber ofconcurrent connections, it is time consumingto check every connection slot to find an active connection that is abletosent data.Instead, we maintaintwolinked list tofastlo- cateactiveconnectionsandtimeoutconnections.Theactivelinked listisusedtostoreTCPconnectionsthathaveaneventtobepro- cessed.The connection management engine continuously obtains theconnectioninthehead,executestheevent,thenputsthecon- nection to the tail of timeout linked list. When an event (new ACKreceived,newdatareceived ornewdata tosend)forcertain connectionoccurs, theconnection isinserted tothe tailofactive linkedlist. Ontheother hand,the timeoutlinked listisused for monitoringthetransmissiontimeoutofidleconnections.Idlecon- nectionsareinsertedtothetailoftimeoutlinkedlistwithatime- outpointthat addsa fixtimeoutvaluetocurrenttime.Since the timeoutlinkedlistisnaturallysortedintimeoutpoint,theretrans- missionengineonlyneedtocheckthroughtheheadofthelistto obtainthemosttimeoutconnections,avoidingrecurrentlytravers- ing thewhole established TCP connection slots that mayimpose accessburdenstothesharedmemory.
4.1.3. TX,RXbufferallocation
Duetolimitedon-chipresources,itisimpossibletooffereach TCPsessiona privateTXandRX bufferusingBRAMs.Instead,we directlystorethedatainDDR3externalmemory.However,theor- ganizationof upstream data anddownstream data for each TCP connectionsis different.Forupstream data,we maintaina buffer pool(acontinuousmemoryspaceinDDR3SDRAM)thatisdivided equallyintosegments.Thesizeofeachsegmentcanbeprogramed accordingtothe need ofapplicationscenario. Abuffer allocation stackis used to manage the allocation andreclaim ofthe buffer resources.When a newconnection isestablished, theconnection management engine acquires a newbuffer fromthe buffer pool, andregistersitintheconnectioninformation.Thenupstreamdata canbeoffloadedtothebuffer,whereitcanbedirectlyaccessedby userdefinedlogic.DuringTCPdisconnection,thebufferisrecycled back to the buffer pool. For downstream data,however, our TCP stackdoesnot assignaspecific TXbuffer toeach establishedTCP connection.Ratherweleaveuserdefinedlogictoorganizethedata tobesent. Specifically,the userdefinedlogiccantrigger theTCP stackto senddata by registeringone or moredataready events, eachofwhichcontainsthestartingaddressandthelengthofpre- pareddata inDDR3 SDRAM.Thisleadsto aseriesof benefitsfor performance:(1)Theuserdefinedlogicneeds nottowaitincase thatthesendbufferisfull;(2)Iteliminatesthedatacopyopera- tionsfromapplicationleveltotransportlevel;(3)Itsupportsmul- tiple data transmission without the involvement of user defined logic;(4)ManyTCPconnectionscansharethesamedatatobesent (e.g.users’ requests forthe samewebpage in webserver applica- tion),whichsavesthememoryspaceandtimecausedbymultiple copiestoprivatesendbuffers.
4.2.Onlinedynamicscheduling
Data centers are typically designed to provide processing ca- pacityfor potential peak workload. However, actual request load imposed to data centers changes over time. It is a waste of en- ergytoactivateall itsprocessingpowerwhen systemisatalow load. Fortunately, the standalone SOPC based architecture where every FPGAworks asan independent systemoffers high flexibil- ityinscheduling,whichcanbeusedtoaddresstheproblem.Since theconfigurationandshutdowntimeofaFPGAisnegligiblecom- paredtothatofageneral-purposecomputer,itispossibletoadjust the number of activated computing nodes according to realtime
Fig. 4. Online dynamic scheduling in standalone SOPC based reconfigurable clus- ters.
workloadbydirectlypoweringupordownFPGAsintheclouden- vironment to savethe energy, while the lattermust keep awake mostofthemachines.
Dueto ourimpressively shortreconfiguration time inseconds (lessthantwoseconds)ratherthanminutes,whichmostofFPGA userstaketoreconfigureFPGAviaJTAGcables,aswestorerecon- figuration bits in NOR flash connectedto FPGA via BPI interface, we can exploit the advantages of a lot of dynamic load balanc- ing methods on web-based systems [27], many ofwhich can be adoptedinourarchitecture. WechoseHTTPredirectiontechnique to route incoming requests into our FPGA based cloud servers.
Fig.4demonstratestheprocess ofsuchdynamicscheduling.First ofall,theuserontheclientsideconnectsandsendsanHTTPGET requesttotheloadbalancingnode.Theloadbalancingnode picks up a running FPGA that has the least workload andsends its IP backtotheclientsideasan HTTP302redirectionresponse.Then, theusersendstherequesttotheallocatedFPGAtoreceiveservices onthecloudside.
The load balancing node periodically (five second interval for instance)gathersrunningstatusofeachFPGAtorecordthework- loaddistribution of thewhole system. Note that addingcommu- nication supportbetween theload balancing node andthe FPGA wouldcause additional chiparea usage. However, in oursystem, wecanreusethenetworkdatapathwehavealreadybuildtosup- portonlinedynamicschedulingontheFPGAside.The pollingac- tionissimplydonebysendingauniqueHTTPGETrequesttoeach FPGA,whichisbeen processedbyourTCP/IP stack.Then theon- chip microprocessor will collect data of running status and sent them backas theHTTP response.All that areneeded areseveral registers to store the running status and a small piece of code running on the on-chip microprocessor to deal with the unique HTTP GET request. For different applications, running status and scheduling algorithm can be various. In our implementation, the average system throughput during the last time interval of each FPGAisobtainedtomonitortheconditionofloading.WedefineTi asthecurrentthroughputforFPGAi,Tmasthemaximumthrought- putthatasingleFPGAcanaffordtoprovideservicesundercertain QoS(QualityofService)requirement,andNisthecurrentnumber oftotalactiveFPGAs.Sothesystemlevelloadrateisdefinedas:
Lcurrent= N
i=1Ti
N·Tm (9)
Common practice (defined as basic algorithm) uses an upper bound Lu anda lower bound Ll of system level loadrate to de- cidewhethertoincreaseordecreasethenumberoftheactivated FPGAs. Whenthe systemloadrateLcurrent drops belowthe lower bound Ll,theloadbalancing nodefirst stopsrouting theuserre- queststo theFPGAthat containstheleastworkload,andinforms the onboardconfiguration moduleto power off suchFPGA when itbecomesidle.Onthecontrary,whenthesystemloadrateLcurrent
risesovertheupperboundLu,theloadbalancingnodeinformsthe onboardconfigurationmoduletopowerupandconfigurearesting FPGAtoprovideservice.
However,theparametersettingoftheschedulingupperbound LuandlowerboundLlisnotrivial.ThehigherLuincreasesthepos- sibilityofservicefailure,whilethelower Lu leadstothewasteof performance.Also,thehigherLl maycausetheoscillationbetween wakeupandpoweroff andthelowerLlgivesrisetothewasteof energy.Toaddress theproblem,we introducedthe followingtwo optimizationstothebasicalgorithm:
4.2.1. Predictivescheduling
Foreveryschedulingpoint,we firstpredictthesystemloadof the next schedulingperiod (Lnext) by quadratic fittingthe system loaddataofthelastthreesamples.Specifically,we defineS(−j)=
i∈ActiveList(−j)Ti(−j)asthetotalthroughtputofallFPGAsatthelast j schedulingpoint,whereS(0)standsforcurrenttotalthroughput.
Then,wecancomputethesystemloadofthenextschedulingpe- riodasLnext= S(−2|Act)−3ivSeList(−1)|+·T3mS(0).WeusethemaximumvalueofLnext
andLcurrenttodecidewhethertoactivatemoreFPGAs.Comparedto the basic algorithm that tends to choose a moderateLu to avoid service failure when the system encounters explosive growth of load, ouroptimizationcan predictthetrendofloadvariation and make adjustmentinadvance.As aresult, we canusea higherLu
tomakethewholesystemmoreenergyefficientwithoutworrying servicefailure.
4.2.2. Slidingwindow
Sincetheoverallloadsonthecloudsidearesumofmassivein- dividualsrequests,ittendstobesmoothlychangedandstablewith smallfluctuationinmostoftime.Therefore,itisimportanttoim- provetheenergyefficiencywhenthesystemisundersuch stable state.However,thebasicalgorithmishardtoguaranteeahighen- ergy efficiencyduringthestable period,whichtends tochoose a lowLltoavoidoscillationbetweenwakeupandpoweroff.Forex- ample,IftheLlissetto50%andLu to80%,whilethesystemload fluctuatearound60%±8%,thebasicalgorithmwouldnotdecrease thenumber ofactivatedFPGAsto improvethesystemutilization.
Toaddresstheproblem,wesetupaslidingwindowofsizek,and keep track ofthe maximumvalue ofthe whole systemthrough- put(Smax=max
{
S(0),S(−1),...,S(−k+1)}
)amongthelastksamples.Byassuming that thereisa highprobability thatthe loadduring the next schedulingperiod is nomore than Smax, we can reduce thenumberofactivatedFPGAstomakesysteminhighestutiliza- tionundertheupperboundLu,whichiscalculatedas
TSmmax·Lu.Al-thoughsuchoptimizationincreasestheresponselatencyofpower- ing off FPGAswhensystemloadrateLcurrent drops,itsignificantly improves the energyefficiency when the system is under stable state.
Morespecifically,wedescribeouralgorithminpseudo-codeas showninAlgorithm1.
5. Casestudies
Inthissection,weusetwocasestudiestoverifytheeffective- ness of our design and implementation of cloud based applica- tions usingtheproposedarchitecture. Thewebserverapplication, whichplaysafundamentalroleincurrentcloudcomputinginfras- tructure, is chosen asthe base case study. Since the web server isthe majorchannel toconnect usersandcloud services,weuse thiscasestudytoevaluatehowthisarchitecturemeetstherequire- mentoflowdelay,highbandwidthandhighconcurrencyI/O per- formance incloud environment.Thesecond casestudyisa cloud basedECG classification whichisan importantbio-cloudapplica- tion[28].UnlikewebserverapplicationwhichisonlyI/O-intensive,
Algorithm1 Onlinedynamicscheduling.
Input: Acti
v
eList setofactivatedFPGAs Input: IdleList setofready-to-power-off FPGAs Input: PowerO ff List setofpower-off FPGAsInput: inter
v
al schedulingintervalInput: k windowsize
Input: Lu schedulingupperbound
1: loop
2: forall FPGAiinActi
v
eListIdleListdo 3: gathercurrentthroughputTi 4: endfor5: S(0)←
i∈ActiveListTi 6: Lcurrent←|ActivSeList(0) |·Tm
7: Lnext← S(−2|Act)−3ivSeList(−1)|+3·TmS(0) quadraticfitting 8: Ltarget←max
{
Lcurrent,Lnext}
9: ifLtarget>Luthen activatenewFPGAs
10: repeat
11: ifIdleListisnotemptythen 12: node←IdleList.maxT_pop() 13: Acti
v
eList.push(node)14: elseifPowerO ff Listisnotemptythen 15: node←PowerO ff List.pop() 16: reconfigureFPGAnode 17: Acti
v
eList.push(node)18: endif
19: Ltarget←|Act|ActivieListveList|+|1Ltarget 20: untilLtarget<Lu
21: else
22: Smax←max
{
S(0),S(−1),...,S(−k+1)}
23: Nt←
TSmmax·Lu24: while
|
Activ
eList|
>Nt do deactivateFPGAs 25: node←Activ
eList.minT_pop()26: IdleList.push(node) 27: endwhile
28: endif
29: forall FPGAiinIdleListdo poweroff FPGAs 30: if Ti==0then
31: IdleList.delete(i) 32: poweroffFPGAi 33: PowerO ff List.push(i)
34: endif
35: endfor 36: wait(inter
v
al) 37: endloopthecloudbasedECGclassification isanapplicationthatbothI/O- intensive and computation-intensive. We use this case study to demonstrate how this architecture handles realistic cloud-based workloads.Sincewe havealreadypresented thecommonmodule oftheSOPCBasedArchitectureindetails,inthissectionwemainly focusonthedesignoftheuserdefinedapplicationlogic.
5.1. Webserver
Theprimary functionof aweb server isto deliver webpages orfiles to clients. The communicationbetweenclient and server takesplace using the Hypertext Transfer Protocol (HTTP). A user agent,commonlyawebbrowser,initiatescommunicationbymak- ing a requestfor a specific resourcevia HTTP andthe server re- spondswiththecontentofthatresource.Inourdesign(Fig.5),the userdefinedapplicationlogiccontainstwo mainmodules:a URL parseranda datalocatingmodule.Tonarrowthe bandwidthgap betweennetworkI/OandSSDstorageI/O,acacheismaintainedin
Fig. 5. Architecture of the user defined application logic for webserver.
theDDR3memory.Forsimplicity,weuseembeddedmicroproces- sortoassistthecachemanagementwhileahashtableinon-chip blockramisusedtofastlocatethecached addressofrequestfile by hardware. When TCP/IP stack offloads the upstream data,the URLparser readsthe HTTP GET requestfrom theDDR3 memory andparseittoobtainafile name.Then thedatalocatingmodule checksiftherequested fileis cachedinDDR3 memoryusing the hashtable.Ifyes,thedatalocatingmoduledirectlysignalsthedata fragmentenginebywritingtheaddressandthelengthinformation ofsuchcacheddatatothecorrespondingdatasendmessagewith- outcopyingthedatatoabuffer;Otherwise,itnotifiestheembed- dedmicroprocessorto move the datafrom theSSD to thecache andthenrepeats thepreviousoperations.Duetolimitedspaceof thepaper,wedonotextendfurtherwithdetailedhardwareimple- mentationofthetwomodules,whichisnotthecoreissueofthis case.
5.2.CloudbasedECGclassification
The electrocardiogram(ECG) isagraphic representationofthe electricalactivity ofthe heart,which isa usefultool to diagnose heart diseases. Since the heart disease is still the most common causeof death inthe modern world, the early diagnosis ofsuch disease becomessignificantly important. Recent advances in mo- biletechnologyandcloud computinghaveinspirednumerousde- signsofcloud-basedhealthcareservicesanddevices[28,29].Med- icaldatasuch asECG can becollectedandtransmittedautomati- callytothecloudbasedanalysissystemfromanywhereandtimely feedbackcanbereturnedtopatientsthroughthenetwork.Theen- tire process of automated diagnosis can be generally subdivided intothreeprocedures:beatdetection,featureextraction,andclas- sification[30].
Beat Detection: The QRS complex detection algorithm de- scribedin[31]isusedforourdesignofbeatdetection.Thisalgo- rithmisbasedonwavelettransform(WT),inwhichatargetsignal isdecomposedinto an approximatesignal and aset ofrelatively high-frequencywavelets.EachstageofWTconsistsofahigh-pass filterG[· ], a low-passfilterH[ · ] andtwodown-samplers by 2 (
↓
2).Thedown-sampledoutputs offirsthighandlow-pass filters providethedetail,D1andtheapproximation,A1,respectively.Fol- lowingthe first level of decomposition, the approximation A1 is furtherdecomposedinthesecondlevelusingthesamefilters,and soon.The detectionofQRS complexismainlyfocused onthedetec- tion of R peaks in the raw ECG data. In D2, every special wave wherea negativeminimum point isfollowed by a positive max- imumpoint corresponds to eachR peak. In other words,we can capturethepositionofRpeaksbyidentifyingthezero-crossingof extremumpairsinD2.ThentheQRScomplexisextractedfromthe
Fig. 6. Architecture of the user defined application logic for cloud based ECG clas- sification.
rawECGdataby segmentinga fixednumberofsamplescentered atRpeaks.
Feature Extraction:We use the method presentedin [32] to extractfeaturesfromQRScomplex.Thetwo-leveldiscretewavelet decompositionproducessignal componentsindifferentsubbands.
Inordertoreducethedimensionalityoftheextractedsignalcom- ponents,11statisticsoverthesetofthewaveletcoefficientswere usedto representthetime-frequencydistribution oftheECG sig- nals:
1. VarianceoftheoriginalQRScomplexsignal;
2. Variance of the wavelet coefficients in each subband (D1,D2,A2);
3. Varianceofautocorrelationfunctionsofthewaveletcoefficients ineachsubband(D1,D2,A2);
4. Ratio oftheminimumto themaximumof thewaveletcoeffi- cientsineachsubband(D1,D2,A2);
5. instantaneousRRintervals.
Classification: Weuse artificialneuralnetwork (ANN) toclas- sifytheextractedfeatures,whichisawidelyusedmethodforclas- sificationinmachinelearningandpatternrecognition.
Thearchitecture ofourcloud basedECG classificationinFPGA is showed in Fig. 6. While users connect to the FPGA and up- loadtheir ECGdata throughTCP/IP protocol,the TCP/IPstack en- gine handles these TCP connections and offloads the ECG data to the DDR3 memory. Then the TCP/IP stack engine sends the connection information messages tothe task assignment module through aFIFO. The latterperiodically polls allfeature extraction (FE)pipelines to findan idleone, andthenreads anddispatches theECG dataassociated withhead-of-the-FIFOTCPconnection to theidleFEpipeline.Sincethefeatureextractionstageisthemost computation-intensiveandtimeconsumingstage,weputmultiple FEpipelinesto meettherequirementofsystemlevelthroughput.
The input ofeach FE pipelines is the sequential samples of ECG data relatedto one user. After several processingsteps including R-peak detection, 2 level wavelet transformation, autocorrelation computation,variancecomputation,andmin/maxcomputation,FE pipeline outputs 11statistics features described above aswell as itsconnectioninformationmessagetoafeatureFIFO.Theclassifi- cationmodulecontainsa pre-trainedANNwith11inputneurons, 20 hidden neuronsand 6 output neurons. It automatically reads featuresfromthefeatureFIFO,classifiesthetypeoftheECGdata, writestheresulttotheDDR3memory,andthennotifiestheTCP/IP enginetosendthediagnosticresultbacktotheuser.
SinceboththeRpeakdetectionandthefeatureextractioncon- tainawavelettransformationprocedure,weshareatwolevelWT modulebetweenthesetwoprocesses.Thearchitectureofthefea-
Fig. 7. Architecture of feature extraction pipeline.
Fig. 8. Architecture of light-weight neuron design in FPGA.
ture extraction pipeline is showed in Fig. 7, when ECG data is readyinthelocalblock ram,theWTmodulereadsitoutandex- ecutes a pipelinedwavelet transformation.The D2 outputis sent toRpeakdetectionmodule,whereallRpeakpositionsaswellas the64-pointQRSsegmentscenteredatR-peaksareextractedfrom therecord.ThelatterrecordsthepositionoftheRpeak andAlso, theinstantaneous RRintervalsarecomputedandsentto thefea- ture output module. After that, a second wavelet transformation is executed to the extracted QRS segments. The controller stores all subbanddataandcontrolsthe executionordertocompute all statisticsfeatureviadifferentmodulesincludingMin/Maxmodule, autocorrelation module, and variance module. In our design, the memoryspaceintheblockramisdividedintotwoidenticalparts to support ping pong operations, which can hide the latencyre- latedtodataacquisition.
The architecture of single neuron in hidden layer and output layerisillustrated asFig.8.Sincethemosttimeconsumingstage isfeatureextractionstageratherthanclassificationstage,weadopt a morelight-weightneuron designto savetheon-chipresources.
Eachneuroncontainsafinitestatemachinetocontroltheaccessof datafromtheoutputoftheformerneurons.Duringeachcycle,the neuron executes one multiply-accumulate operation toone of its inputdatawithits pre-trainedweight.Theimplementationofthe sigmoid function usea look-up table (LUT)stored inthe on-chip block Ram[33].TheuseoftheLUT reducesthe resourcerequire- ment andimprovesthe speed. Inour design,a global64Kdepth and16bitwidthLUTisusedtocompromisebetweentheresources utilizationandtheresolutionoftheoutput.
6. Experimentalresultsanddiscussion 6.1. Buildingexperimentalprototypeplatform
We have been involved in building mimicry computer [34], andused thisreconfigurableplatform (Fig.9) to evaluatetheour SOPC based architecture for cloud applications. The current pro- totype platformconsistsof16computingnodes,eachofwhichis packaged in a standard 2U rack-mountable, as shownin Fig. 10. Each computingnode houses one motherboardand fourdaugh- ter boards.Eachdaughterboardcontains:1) aXC6VLX550TFPGA fromXilinx;2)threeSO-DIMM channelsthatsupportupto 16GB
Fig. 9. Experimental prototype platform. (a) Reconfigurable prototype platform con- taining 16 2U rack-mountable computing node; (b) one of its quad-FPGA reconfig- urable computing node.
Fig. 10. Board architecture for single node in experimental prototype platform.
DDR3 SDRAM and72Mb SRAM;3) 10G Ethernet withXAUI/SFP+
interface; 4) SATA3 interface connecting an Intel 320GB SSD; 5) 32MBNORflashinBPIinterface;6)PCI-EX8interface.Datatrans- fer between adjacent FPGAs can be carried out directly through pointtopointinterconnectionsinmotherboard.AllFPGAsarein- terconnected by a fiber channel switch from Cisco with their IP addressedpre-definedintheflashthatisattachedtoeachFPGA.
Forthe purpose ofremote configuration, an onboard configu- rationmodule basedon an IntelAtomCPU is integratedoneach motherboard.Toolsareprovidedforuserstomanipulateallkinds ofoperation such as startingup orshutting down certain FPGA, downloadingthe bitstream,evendebugging remotelyusingChip- scope,when a PChostis connectedto theAtomsystem through network. By using BPI configuration interface, it only takes less thantwosecondtoconfiguretheFPGA.
6.2.Resourcesutilization
Asmentionedabove,weimplementedbothtwoapplicationson ourreconfigurableprototype platform.Eachinstance ofthe SOPC basedarchitecturewasimplementedinaXilinxXC6VLX550TFPGA.
The embedded microprocessor is implemented by Microblaze, a light-weight soft core provided by Xilinx. The system clock fre- quencyofbothtwoimplementationsare156.25Mhzwhichisequal tothe frequencyof10Gbps PHYinterface. Table1 showsthede- viceutilizationsummaryofmainmodulesforsinglecore,whichis generatedbyXilinxISE14.1.Allcommonmodulesintotaluseonly 6.4%of totalsliceregisters, 13.5% oftotalsliceLUTs, 25.1%of to- talBRAMsandlessthan1%oftotalDSP48E1s, whichleavesmost ofon-chipresources foruserdefined applicationlogic. Notethat
Table 1
Device utilization summary for single core.
XC6VLX550T
Slice Reg LUTs BRAM DSP48 Available 687 ,360 343 ,680 2528 864 Common Modules
TCP/IP Stack 21 ,312 22 ,604 126 6
SATA Controller 3108 4816 10 0
Microblaze 1620 2185 498 3
DDR3 Controller 10 ,028 8200 0 0
AXI interconnect 4155 4989 0 0
Ethernet 3704 3555 0 0
User Defined Application Logic
web server 9727 11 ,563 114 0 ECG classification 87 ,242 72 ,972 162 450 The-state-of-the-art TOE in academia
TOE [21] 20 ,611 19 ,026 279
ourTCP/IPstackhasasimilarsourceusage asthe-state-of-the-art [21]TOEinacademia,butsupportsmoreconcurrentTCPsessions.
6.3.PerformanceevaluationoftheSOPCbasedwebserver
The performanceofthe SOPCbasedwebserverwasevaluated by Spirent Avalanche 3100 [35], a popular network performance testingequipment that is ableto evaluate OSI Layer 4–7stateful trafficperformanceover multi-10Gbpsphysicallayers. Threeper- formancemetricsaremeasuredincluding:
• Throughput, measured in bits per second (bps), which is the rateofsuccessfuldatadeliveryinagiventimeperiod;
• Number ofHTTP Transactionsper second, which refers tothe numberofcompleteHTTPtransactionincludingconnectiones- tablishment, file transmission, and connection disconnection, performed by web server per second. This metric is highly related to the response time of single transaction, when re- questedfilesareverysmall;
• Max number of concurrent connections, which refers to the maxnumberofactiveTCPlinksprocessedconcurrently.
Results werecompared withothertwo mainstreamindustrial systems:
• Apache HTTP webserver [36], whichisthe mostwidelyused open-source HTTPserverinindustry.We launchedApache 2.2 HTTPwebserverinstanceonIBMx3635M3serverswithIntel X5650CPU(2.66 GHz),8GBDDR3memoryandIntelX520-SR2 10GEnetworkadapter.
• Master-slavebasedwebserverwhereweconnectedourdaugh- ter board asa slavedevice to IBM x3635 M3servers via PCI- E interface. We integrated the QuickTCP [22] IP core, a third party 10GE TCP/IPstack inFPGA tooffload networkdata pro- cessing,leavingthemastersideCPUhandlingapplicationlevel functionssuchasfilemanagementandlookup.
Files ofdifferentsizeswere storedin theSSD forSOPC based webserver, while themaster-slave basedimplementation system heldsuchfilesintheharddrivehostedonthemasterside.
6.3.1. Throughput
Fig. 11 shows that the peak throughput of SOPC based web serveris8.3GbpswhichisslightlylowerthanApacheandmaster- slavebased webserver. However, whenthe sizeofrequested file becamesmall,the standalone SOPC based web server performed muchbetterthanothertwosystems.Becauseinthissituation,the processingdelayofsinglecompleteHTTPtransactionincludingthe settingup andtearing downof TCPconnection turnedout tobe
Fig. 11. Throughput at different file sizes.
Fig. 12. HTTP transaction rate at different file sizes.
the bottleneck. For master-slave based implementation, a lot of messageinteractionandremotememoryaccessoperationsviaPCI- Ewereneededbetweenthemastersideandtheslaveside,which putmoredelayduringsingleHTTP transaction.ForApache based implementation,OSoperationssuch assystemcall, userspaceto kernalspacememorycopyandinterrupthandlingbringmorede- lay. In our standalone SOPC based web server, however, latency isreduced dueto thetightly-coupled architecture.Compare with [21], ourTCP/IP stack alsohas a similar throughput(8.3Gbps) as [21](8.6 Gbps).
6.3.2. NumberofHTTPtransactionspersecond
Fig.12illustratesourexplanationfromanotherperspective.The HTTPtransactionrateoftheSOPCbasedweb servercanreachup to 120,000 TPS(Transactions per Second) when transmitting the same 1KBfile, which is6.8 timesthat of the master-slavebased implementationand9.2timesthatoftheApachewebserver.How- ever,asfilesizeincreases,theperformancegapisgettingnarrower, andthethroughputlimitationisbecomingthebottleneck.
We can use the performance model introduced from Section 2 to analyze the results. In master-slave based imple- mentation system, the communication latency between the master side and the slave side via PCI-E interface is nearly 4μs.
And during the process of a complete HTTP request-response operation including three-way handshake in the process of es- tablishing a connection and four-way handshake in the process of disconnecting a connection, there are 12 communications operations over PCI-E interface. According to Eq. (6), the time consumed by control I/O operation Tcio is approximately equal to 4
μ
s×12=48μ
s. On the other hand, according to Eq. (7),the time consumed by data I/O operation Tdio is almost equal to θ
PCI−E+θ
HardDisk×cache_miss_rate. When file size is small and cachedinthememoryofthemasterside,theTdio isnearlyequal to the PCI-E latency. The HTTP transaction rate of master-slave
based implementation is 17,500 TPS when file size is 1K byte, which means it needs 1000000
μ
s/17500=57μ
s foreach com-pletetransaction.Asperthecomputation, wecaninferthat most of the time is consumed by control I/O operations. This obser- vation indicates that the master-slave based architecture is quite unsuitable forsuch situation.Inour SOPCbasedimplementation, however,thecontrolI/O latencyissmallenoughtobeignored,so thatitdramaticallyreducestheprocesstimeforHTTPtransaction, which is only1000000
μ
s/120000=8.3μ
s.This resultindicatesthat our SOPCbasedweb server can offerbetter userexperience in terms oflower response time when sufferinghigh concurrent requests.
6.3.3. Maxnumberofconcurrentconnections
We continuously generated keep-alive HTTP requests to mea- sure the max number of concurrent connections of three sys- tems. TCPconnections established based on such request do not teardownafterrequestedfileiscompletelytransmitted.Thenum- ber ofmaxconcurrent connections were recordedwhenthe web server is not able to set up more TCPconnections. Experimental resultsshow that themaxconcurrent TCPconnections supported by our stand-alone SOPC based web server, Apache web server andthemaster-slavebasedweb serverarenearly 100K,80K, and 128 respectively. Since the QuickTCP [22] IP core only supports one TCP/IP session per instantiated pipeline, considering on-chip resource constrain,the maxnumberof concurrentconnections is limited for not able to integrate more pipelines. Also compared with [21], our TCP/IP stack supports much more concurrent TCP connections(100KVS.10K),withsimilarperformanceandresource consumption.
6.3.4. Energyefficiencyunderpeakperformance
We use the term million bits transmitted per joule (Mb/J) to define the energyefficiencyofdifferentsystems.The power con- sumption for one instance of the SOPC based web server run- ningundermaximumworkloadwas40W(includingthepowerof the off-chip devicesandthe cooling fan).Sothe peak energyef- ficiency wascalculated as 8320Mbps÷40W=208Mb/J. However, theApachewebserverconsumed160Wwithfullworkloadandits peak energy efficiencywas 58Mb/J,while the master-slavebased implementationconsumed180Wwhichleadstoaworsepeaken- ergyefficiencythatwas52Mb/J.Inmaster-slavebasedwebserver, mostoftheenergywasconsumedby theconventionalPConthe master side. Since the application mapped to such architecture does not have a large serial computation-intensive portion to be executed inthehighperformancegeneralpurposeprocessor,web servers basedonthemaster-slavestylearchitecture areobviously inefficient.
6.3.5. Energyefficiencyinmimicrycluster
We measure the energy efficiency of our SOPC based web serverinMimicryClusterwithonlinedynamicscheduling.Weem- ulated the dynamic changing requests where the peak workload is nearly 4 times of its average workload. 40 files ranging from 1MBto64MB wereusedasrequestedfilesand12IBMx3635M3 serverswere usedasclientmachines.TobalancetheQoSanddy- namic energyefficiency, we empirically setthe schedulingupper boundLuto85%,windowsizeto15andschedulingintervalto5s. The experimentwascarriedout for24hours, andatthepoint of requestpeak,12FPGAwereusedtoprovidewebservices.Aclus- ter thatcontains12master-slavebasedweb serversandacluster that contains12 Apacheweb serverswere testedunderthesame input.Onaccountofunsustainablecostthatfrequentlypowersup andshutsdowntheconventionalPC,wemadeall12nodesawake duringthetest.
Fig. 13. Comparison of energy efficiency among our standalone SOPC implemen- tation, FPGA-in-salve-mode implementation and Apache implementation of web- server. We evaluate the energy efficiency of both single node under peak perfor- mance and cluster under dynamic load.
Table 2
Latency of the user defined application logic in cloud based ECG classification application.
Data acquisition Feature extraction Neural network
4096 9728 61
We measured thetotal energy consumption P(J) and totalfile sizesS(MB)downloadedduringthetestperiod,thenwecalculated theenergyefficiencyas8S/P(Mb/J).Fig.13 showsthe energyeffi- ciencyofthethreesystemsunderdynamicchangingworkload.The energyefficiencyofourstandaloneSOPCbasedwebservercluster was152Mb/J,whichwas73.8%ofitspeakenergyefficiency(energy efficiency under maximum workload). However, the energy effi- ciencyofthe master-slavebasedweb serverwas12.8Mb/J,which was24.7%ofits peak energyefficiency.Moreover, Apachecluster achieved26.4% of its peak energy efficiency. The results indicate that,using onlinedynamicscheduling,our SOPCbased systemis ableto effectivelyutilize theprocessing powerof activatedcom- puting node, while power off the rest of nodes to save energy.
Thus,itgainsmuchmoreenergyefficiencyinadynamicway,com- paredtoothertwostaticscheduledsystems.
6.4.PerformanceevaluationofthecloudbasedECGclassificationin FPGA
InourstudyofcloudbasedECG classification,23ECG records were selected fromthe MTI-BIH arrhythmia database [37] which iswidely usedinresearchcommunityofECGanalysis.Allrecords weredigitizedat360samplespersecondwith11-bitresolutionfor slightlyover30minutes.TheserecordsincludesixECGbeattypes:
thenormalbeat(N), theleftbundlebranchblockbeat(LBBB),the rightbundle branch block beat(RBBB), theatrial premature beat (APB),theprematureventricularcontraction(PVC),andthepaced beat(PB).Foreachbeattype,wechoose50%beatsfortrainingand another50%beats fortesting. Theparameters relatedtotheneu- ronnetworkweretrainedofflineusingsoftwareandthenfixed to theimplementation inFPGA. Ten FEpipelines were integratedin featureextractionmodule tomaximize thethroughput andmake full useofreconfigurable resources. Thesize of theblock ram in eachFEpipelineis16KBwhichcanstore8KpointsoftheECGdata in16bitwide. Halfofthememoryspace isusedforcurrentpro- cessing,whiletheotherhalfisusedfornewECGdatabuffering.
Table 2 illustrates the latency of the user defined application logic in our design. The latency of the FE pipeline to process a 4KpointsECG data containing12 heart beats is 9782 cycles.Al-