cloud applications An energy-eﬃcient system on a programmable chip platform for Journal of Systems Architecture

(1)

ContentslistsavailableatScienceDirect

Journal of Systems Architecture

journalhomepage:www.elsevier.com/locate/sysarc

An energy-eﬃcient system on a programmable chip platform for cloud applications

Xu Wang

^a

, Yongxin Zhu

^a^,^f^,^∗

, Yajun Ha

^b

, Meikang Qiu

^c

, Tian Huang

^a

, Xueming Si

^d

, Jiangxing Wu

^e

aSchool of Microelectronics, Shanghai Jiao Tong University, 800 Dongchuan Road,Shanghai 200240, P.R. China

bDepartment of Electrical and Computer Engineering, National University of Singapore, 21 Lower Kent Ridge Road, Singapore

cDepartment of Computer Science, Pace University, 1 Pace Plaza, Manhattan, New York City, NY, 10038, USA

dShanghai Key Laboratory of Data Science, Fudan University, 825 Zhangheng Road, Shanghai 201203, P.R. China

ePLA Information Engineering University, P.R. China

fSchool of Computing, National University of Singapore, Singapore

a rt i c l e i n f o

Article history:

Received 30 January 2016 Revised 8 October 2016 Accepted 22 November 2016 Available online 30 November 2016 Keywords:

Cloud computing ECG classiﬁcation FPGA

Performance analysis Reconﬁgurable architectures Web server

a b s t r a c t

Traditionalcloudserviceprovidersbuildlargedata-centerswithahugenumberofconnectedcommodity computerstomeettheever-growingdemandonperformance.However,thegrowthpotential ofthese data-centersislimitedbytheircorrespondingenergyconsumptionandthermalissues.Energyefficiency becomesakeyissueofbuildinglarge-scalecloudcomputingcenters.Tosolvethisissue, weproposea standaloneSOPC(SystemonaProgrammableChip)basedplatformforcloudapplications.Weimprove theenergyefficiency forcloud computingplatformswithtwotechniques.First,weproposeamassive- sessionsoptimizedTCP/IPhardwarestackusingamacro-pipelinearchitecture. Itenablesthehardware accelerationofpipeliningexecution ofnetworkpacket offloadingand applicationleveldataprocessing.

Thisachieves higherenergyefficiencywhilemaintainingpeakperformance.Second,weproposeaon- linedynamicschedulingstrategy.It can reconfigureorshutdownFPGA nodesaccording toworkload variancetoreducetheruntimeenergyconsumptioninastandaloneSOPCbasedreconfigurablecluster system.TwocasestudiesincludingawebserverapplicationandacloudbasedECG(electrocardiogram) classificationapplicationaredevelopedtovalidatetheeffectivenessoftheproposedplatform.Evaluation resultsshow that ourSOPCbasedcloud computingplatformcan achieve upto418X improvementin termsofenergyefficiencyovercommercialcloudsystems.

1. Introduction

We are livingin an era ofcloud computingthat is the back- boneofembeddedubiquitouscomputing[1]andbigdata[2].Peo- ple can accesspowerful IT resources (e.g.,computationand storage)andconvenientservicesinthecloudviamobiledevicesatany timeandanyplace.Theshiftofcomputinginfrastructurefromlo- caldesktoptotheremotecloudimprovestheutilizationofoverall computingresources andreduces thecosts associated withman- agementofhardwareandsoftwareresourcesforindividualusers.

Manyservicesonthecloud areweb-basedapplications,which incur a large numberof concurrent requests andeach of there- quests involves a light-weighted task which often desires short

∗ Corresponding author.

E-mail addresses: [email protected] (X. Wang), [email protected] , [email protected] (Y. Zhu), [email protected] (Y. Ha), [email protected] (M.

Qiu), [email protected] (T. Huang), [email protected] (X. Si).

latencies in response from energy hunger servers. For existing servers, the performance is typically limitedby overheads of the network packets processing and the connection management in the NIC and OS kernel [3]. Although traditional cloud service providers such as Google and Amazon build data centers with a huge numberof connected commodity computers to meet the demand of cloud computing, the ever-growing energy consump- tionlimitsthescaleoutofthesemachinesduetolimitedenergy- budget. Moreover, it comes expense related to the energy con- sumptionandheat dissipation dominatingthe operating costs in suchhigh-densitycomputingenvironments. Therefore,theenergy eﬃciency(performanceperjoule)becomesthekeyissueofbuild- inglarge-scalecloudcomputingcenters[4].

Reconﬁgurable computing based on Field Programmable Gate Array(FPGA)technologiespossessesthecapability ofparalleland specializedcomputationthatcaneffectivelyexploitsthetask-level parallelisminherentincloudcomputingwithrelativelylowenergy consumption[5].The currenttrendto take advantageof FPGAin http://dx.doi.org/10.1016/j.sysarc.2016.11.009

(2)

cloudcomputingliesintwoways:(1)toimprovedatatransferper- formancewithindatacentersasaTCP/IPoﬄoadengine(TOE)[6]; (2)toacceleratecomputation-intensiveapplicationsasahardware accelerator [7,8]. In both two methods, FPGAs are used as slave deviceshostedby commodityCPUson themaster side.However, suchmaster-slavebasedarchitectureisnotsuitableforweb-based cloudapplications,sincethehighcommunicationlatencybetween themaster CPU andthe slave FPGA limitsthethroughput of the system.

Inthispaper,weanalyzetheinefficiencyofmaster-slavebased reconfigurable architectures in dealing with highly concurrent cloud applications by proposing a performance model, and indi- catethatastandaloneSOPCbasedarchitecture,whichtightlycou- plesnetworkIOhandlinganddataprocessinginasingleFPGA,is amorepromisingsolutionincloudcomputingwithbothhighen- ergy efficiency andhigh performance cateringto needs of cloud applications.

Havingidentiﬁeda promisingstandaloneSOPCbasedarchitec- ture,wethenpresentthedesignandimplementationoftheSOPC basedarchitectureforcloudcomputingindetails.Themajorchal- lenge to implement such SOPC based architecture is the design ofahighperformance TCP/IP hardwarestackthat isableto support highnetwork throughputwith highconcurrent connections.

Since third party TCP/IP oﬄoad engines are primarily optimized forinter-serverdatatransferamongafewTCPconnectionsindata centers,theyadoptone-TCP/IP-session-per-pipelinearchitectureto achievehighthroughputandextremelylowlatency.However,such TCP/IP engines are hard to scale due to resources limitation in FPGA, which is not suitable for high concurrent cloud applications.We address thisproblem by adoptinga macro-pipeline architecture,whereallTCPsessionsshareacentralizedmemoryand coarsegrainedpipeline.Processingmodulesinthemacro-pipeline areoperated inasynchronous mode to achieve highsystemlevel throughput,whilean external SRAMis usedto minimize the ac- cesslatencyofTCPconnectioninformationthat arerandomly ac- cessedandmodiﬁedbydifferentmodulesconcurrently.

OneadvantageofourdesignbyplacingFPGAsonthecloudas standaloneentitiesto provideserviceisitshighenergyeﬃciency atits peak performance. In other words, it supports higher per- formancewithina ﬁxedenergy-budgetconstraintindatacenters.

Besidestheeffortsinhardwaredesign,wefurtherproposeaonline dynamic scheduling scheme to reduce the runtime energy con- sumptionofthewholeclustersystem.Bytakingadvantageoffast conﬁgurationability,a loadbalancing nodeis usedto shutdown orpowerupFPGAsaccordingtorealtimeworkloadvarianceinthe cloud environment,saving the energywithout compromisingthe qualityofservicetousers.

To verify ourarchitecture designand scheduling scheme,two casestudiesincludingawebserverclusterandacloudbasedECG (electrocardiogram) classiﬁcation cluster are presented to evalu- atetheeffectivenessofproposedarchitecture.Designconsideration andimplementationdetails are givento show how thisarchitec- turemeets the demand of cloud computing in high throughput, highconcurrency,lowlatency,aswellaslowenergyconsumption.

Themajorcontributionsofthisworkinclude:

• Performance analysis of master-slave based reconﬁgurable architecture andstandalone SOPCbased reconﬁgurablearchitec- tureusingananalyticalmodel.

• Amassive-sessions optimizedTCP/IP oﬄoadengine(described in Section 3.A) that supports up to 100K TCP sessions under 10Gbpslinerate.

• An online dynamic scheduling method (described in Section 3.B)thatcanreconﬁgureorpoweroff FPGAnodesaccordingto workloadvariancetoreducetheruntimeenergyconsumption.

• Prototype of a reconﬁgurable cluster system based on standalone SOPC to provide cloud services, achieving up to 38X speedupinperformanceand418Ximprovementinenergyef- ﬁciencycomparedtothesoftwarebasedcloudsystems.

• Prototype veriﬁcation by implementing an I/O intensive web servercluster systemwithdetailedcomparisonwith alternate servers.

• Prototype veriﬁcation by implementing a both I/O and com- putationintensive ECG classiﬁcation cluster system, indicating supportforpracticalcloudapplications.

The rest of the paper is organized as follows. Section 2 ana- lyzesrelatedworkbrieﬂy.Section3proposesaperformancemodel toanalyzethemaster-slavebasedarchitectureandthestandalone SOPC based architecture. Section 4 describes the design details of standalone SOPC based cluster systems for cloud applications.

Section 5 presents the design and implementation of two case studiesincludingawebserverclusterandacloudbasedECGclas- siﬁcationcluster.Section6evaluatesperformanceandenergyeﬃ- ciencyofthetwoapplicationcases.Ourconclusionsarepresented inSection7.

2. Relatedworks

Theissueofenergyconsumptionincloudcomputinghasbeen receivingincreasingattentioninrecentyears[9].Researchershave considered energy minimization by consolidating the workload into theminimum ofphysical resources andswitchingidlecom- putingnodesoff,withguaranteedthroughput andresponse time.

Forexample,Chaseetal.[10]proposed aneconomic approachto managingsharedserverresources,whichimprovestheenergyef- ﬁciencyofserverclustersbydynamicallyresizingtheactiveserver set. Similar dynamic provisioningalgorithms [11] are studied for long-lived TCP connections as in instant messaging and gaming.

Moreover, a queuing model [12] to dynamic provisioning tech- nique has also been studied to obtain the minimum number of serversandacombinationofpredictiveandreactivemethodshas beenproposedtodeterminewhentoprovisiontheseresources.Al- though such scheduling methods reduce energy consumption by switching idle servers to power saving modes (e.g. sleep, hiber- nationor poweroff), the long switch latency consistingof reini- tializationofthe systemandrestoringthecontext incommercial serverspreventstheirusageinrealcloudsystems,especiallywhen the workloads are unstable. Recent study[13] utilises the power gating techniqueforFPGA-basedaccelerators toeﬃcientlyreduce theenergyconsumptionoftasksrunningonsingleFPGA.However, asfarasweknow,thereisno relatedwork ondynamicschedul- ing ofFPGAbased cloud computingsystemto reduce theenergy consumption.

To ensure fast changing system configuration and sharing of systems resources, Li et al. [14] proposed a rack scale compos- ablesystemusingPCIeswitches,whichisconstructedasacollec- tion ofindividual resources such asCPU, memory, GPU/FPGAac- celerator,disks etc.,and composedinto workloadexecutionunits ondemand. Althoughsuchcomposable systemisapromising architecture that can improve theutilization ofoverall systems resources,the relativehighcommunicationlatencybetweenmaster CPU andslavedevices(memory,GPU/FPGA)makes itlessefficient in dealingwith latency-bounded applications such as web-based cloudservices.Comparetoit,ourstandaloneSOPCbasedarchitec- turetightlycouplesnetworkIOhandlinganddataprocessingina singleFPGA, which significantly reduces such communicationla- tency. Recent study[15] shows that energy consumption in network processing can be a significant percentage of total energy consumption in cloud computing. Neither high-performance nor low-powercoresprovideasatisfactoryenergy-performancetrade-

(3)

off[16],sincethe performance of many cloudapplications is typically limited by the per-packetprocessing overheads in the NIC and OSkernel. Therefore,there have beena few workson port- ingcloudworkloadstoFPGAwithatighterintegrationofthenet- workinterface.Saietal.[16]proposedanFPGAbasedMemcached appliance,acceleratingthelow-latencyaccesstolargeamountsof dataforwebserviceswith1Gbpsnetworkinterface,andMichaela etal.[17] extendeditto 10Gbpslinerate.Theyutilized asimilar architecture toourswithatightproximity ofnetwork andDRAM toachieve lowerlatency.However,they usedUDPoffloadengines rather than TCP offload engines to handle network communications for its minimal overhead to ensure maximum throughput over the networkandthe limitednumber ofconcurrent sessions inthirdpartyTCPoffloadengines.Comparedtotheirwork,ourar- chitectureiscapableforconnection-orientedcloudapplicationsby usinganovelTCP/IPoffloadenginethatsupports100Kconcurrent TCPsessions.

As an attractive solution to reduce the host CPU overhead from TCP/IP processing and improve network performance, the TCP/IP offload engine (TOE)has beenextensively studiedboth in academia andindustry. Wu etal.[18] implementeda TOE accel- eration hardware block and associatedTCP firmware, which provided TCP/IP transmission rate up to 296 Mbpsas receiver and 239 Mbpsassender.Hankooketal.[19]designedahybridarchi- tectureforaTCPoffloadengine,wherethetransmissionandrecep- tionpaths ofTCP/IParecompletelyseparatedwiththeaidoftwo generalembeddedprocessorstoprocessdatatransmissionandre- ception simultaneously. A complete TOE for FPGAs without processor assistwaspresentedin[20],whichsupportsGigabitEther- netinlimitedhardwareresources.TheclosestrelatedworkinTOE is [21],which supports for10K concurrent connections in VC709 with 10 Gbpsbandwidth. For each connection, thisdesign allo- cates fixed-sized data buffers in the external SDRAM and main- tainsfixed-sizedsessioninformationinon-chipblock rams.Com- paredwithit,ourTOEavoidsusingprivatesendingbuffersforeach connection, allowing sending data to be shared among multiple connections, which saves the memoryspace andtime caused by multiplecopiestoprivate buffers.Also, weusesaevent-triggered connection management in external SRAM. As a result, our TOE cansupportmoreTCPsessions (100Kcomparedwith10Kin[21]) withsimilarperformanceandresourceconsumption.Recently,so- lutionsfromindustryhavealreadyextendedthethroughputofTOE to10Gbps–40Gbps[22,23].Upuntil2014,alloftheavailableTOEs were ultra-low latency and low session count (below 256 concurrent sessions), with the key driving applications being high- frequencytradingandinter-serverdatatransferindatacenters.As a result, these third party solutions typically support one TCP/IP sessionper instantiated TOE,whichlimitsthenumberofconcur- rentTCPsessions duetothe FPGAresources. In2014, Intilophas announced a highsession count variant supporting up to 16,000 sessions.However,asIntilophasnotdisclosedthedesign,itishard forustomakeadetailedarchitecturalcomparisonatthisstage.

3. Performanceanalysisoftworeconﬁgurablearchitectures 3.1. Master-slavebasedarchitecture

Many reconﬁgurable systems using FPGA are built as the master-slavesystems [8,24,25].Ineach computationnode, agen- eral purpose processor based computer works as master, which runs operating system and handles communications and data transfer with the slave systems. Slave systems consist ofI/O devices includinghard drivesandnetwork interfaces,aswellasre- conﬁgurablelogicbasedonFPGAs.Communicationsbetweenmas- tersideandslavesidearedeliveredviahighspeedonboardinter- connectionssuchasPCI-E.

AccordingtoAmdahl’s law[26],an arbitraryalgorithm canbe dividedintoportions thatcanbe madeparallelandportionsthat cannotbeparallelized,i.e.,toremainserial,andthespeedupofan algorithminparallelcomputingislimitedbythetime neededfor thesequentialfractionofthealgorithm.Modernreconﬁgurablede- vicescanprovideabundantcomputingresourcestoaccelerate the parallel portion of an algorithm. However, performance speedup maynotbealwaysachievedduetowithmanylonglatencycontrol communicationsandI/Ooperations,thusincreasingthesequential portion.

Torevealsuch architecturallimitationsinparallelimplementa- tionofanapplicationinreconfigurablesystems,wefurtherdivide sequential portion into three parts: serial computation (S), control I/O operation (CIO), and data I/O operation (DIO), based on whichwe definethe time consumedbyeach partasTs,T_cio,and T_dio respectively. Alsowe define Tp asthetime spent onparallel computation,thentheoverallexecutingtimeTinacompletesyn- chronoussystemisgivenby:

T=Ts+Tp+Tcio+T_dio. (1)

ConsideringoptimizationtechniqueslikeoverlappingI/Oandpro- cessing,wecanalsodeﬁnetheoverallexecutingtimeTinacom- pleteasynchronoussystemas:

T=max

{

^T^s,Tp,

(

^Tcio+T_dio

) }

. (2) However,realreconﬁguresystemsarefallssomewherein-between oftheEq.(1)andEq.(2).Forsimplicity,weuseEq.(2)toanalyze thetworeconﬁguresystems.

Since the serial computation portion of an application is in mostsituations executed on a processor, we can calculatethe Ts

as:

Ts= S fs×

i Ni CPIi

, (3)

where S represents the workload of serial computation portion whichis deﬁnedby number ofinstructions. N_i is the numberof instructionsthat can be issued simultaneously of functional unit type i,CPI_i is theaverage numberof cyclesper instruction (such asDSP, ALU)forfunctionalunittype iandfsistheoperatingfre- quencyofthedevice. SincetheSisdeﬁnedthatcannot beparal- lelized,wesimplifyEq.(3)as:

Ts=S×CPI

fs . (4)

Inreconﬁgurablesystems, theparallel computationpart ofan applicationismostlikelyexecutedbyhardwarelogicinFPGAs,so thevalueofT_pcanbecalculatedas:

Tp= P

R×_CPP^f^p , (5)

whereP representsthe workloadof parallel computationportion whichisdefinedbynumberofoperations.Risthenumberofre- configurableresources available for parallel computation. For ex- ample,thetotalnumberofInt32IP coresofaddersandmultipli- ersthatan FPGAcansupport.CPPistheaveragenumberofclock cyclesperoperation perreconfigurableresourceandfp istheop- eratingfrequency.

Theprocessofcontrol involvesasequenceofmessagepassing operationsbetweenmasterCPUandslavedevices.Althoughmod- ernhighspeed interconnectionsare abletoprovide anextremely highthroughput fordata transmission,it brings a relatively high latency.Since suchcontrol messages always containlittlevolume ofpayloadandacontrolprocessincludesseveraltimesofmessage passing,itmakesthecommunicationdelaythedominantfactorin T_cio:

T_cio=Nc×Dc. (6)

(4)

Eq.(6)givesthedeﬁnitionofT_cio,whereNcisthenumberofcom- munications for control operation and Dc is the delay for such communications.

Weassume dataI/Ooperationsarealllargedatatransmissions andsmalldatatransmissioncanbereferredtocontroloperations.

ThethroughputofI/Ointerfaceratherthancommunicationlatency becomesthemajorfactor.Sinceitneedsmorethanonedatacopy operationsviadifferentI/O interfacetofeeddatatotheexecution unit,thetimeconsumedbyI/Ooperationscanbedeﬁnedas:

T_dio=

j

θ

j, (7)

where ^is^the ^size ^of ^data^to ^be transferred and

θ

j is the max throughputofeach interfacej wheredatacopy/transferoperation occurs.

Tosumup,Eq.(2)isspeciﬁedas:

T=max

S×CPI fs , P

R×_CPP^f^p ,

Nc×Dc+

j

θ

j

. (8)

WiththedefinitiongivenbyEq.(8)andconsideringtheenergy consumptionofthewholesystemincludingbothgeneralpurpose processorsandFPGAs,wecaninferthattheapplicationsthatper- fectlymatchthemaster-slavereconfigurablesystemthusachieving highenergyefficiencymusthave followingproperties:1) a com- parativeportionofserialcomputationworkloadthat cankeepthe generalpurposeprocessorsinhighutilization;2)lesscontrolinter- actionsbetweenmasterandslaves;3)thedatafromI/Ointerface mustsatisfytheneedsofparallelprocessinginFPGAs.

However, many services on the cloud are web-based applications,whichincludea largenumberofconcurrent tasksandeach ofthe tasks is light-weighted.In other words, theseapplications contain very little portion of serial computation-intensive workload but rather large part of parallel computation coupled with complexcontrolandI/O operations.Inthiscase, themaster-slave basedreconfigurablesystems becomeinefficient. SinceDc islarge inmaster-slavesystems,whenNcincreasestoacertainthreshold, thetimespentinmessageinteractionbetweenhostandFPGAde- vices mayconsume much moretime than that forparallel com- putationin FPGAs. Also, when there isa bottleneck in the chain oftransferring data fromI/O to FPGA orlarge I/O delays caused bytoomanydatacopyoperations,poorperformancewouldbethe consequencebecauseof theinsufficientlydata supplyforparallel computationalcores.

3.2.SOPCbasedarchitecture

TheSOPCbasedreconﬁgurablearchitectureintegratesthefunc- tionalityprovidedbymastersideinamaster-slavesystemintoFP- GAs.Generalpurposeprocessorswithhighperformanceandhigh energyconsumption are replaced by embedded microprocessors, while off-chip interconnect are replaced by on-chip interconnect with low latency. I/O facilities are directly connected to FPGAs wherecontrollercircuitsofsuchI/O interfaceareimplementedas IPblocks.As aresult, itismore suitableforcloud basedapplica- tionsforthefollowingreasons:

1.Itnarrowsthedistancebetweencontrolunitandcomputational unitbyusingdirectmessagepassinginterfacesuchasFIFOsor sharedmemories,sothatitdecreasesthecontrollatency(Dc).

2. DatafromI/O interfacesaredirectlyfed tocomputationallogic viarawinterfacewithextremelyhighthroughput.Assuch,the architecturedecreasesthenumberofdatacopyoperationsand eliminates thedata transmission bottleneck that may exist in themaster-slavearchitecture.Also,directI/Oaccesswithhard- ware control logic can oﬄoad some of control workload that shouldbedonebyprocessors.Asaresult,itreducesNc.

Embedded Microprocessor

AXI interconnect AXI interconnect DDR3

Controller 1 DDR3 Controller 1

DR3 Controller 2

DDR3 Controller 2

RAM Controller

SRAM Controller DDR3

DDR3 DDR3 DDR3 SRAMSRAM Memory subsystem

I/O subsystem User defined

application Logic TCP/IP Stack

10G Ethernet XAUI Interface Network Storage

SATA

SSD

TIMER INTC UART MDM Slow I/O

Fig. 1. SOPC based architecture for Cloud applications in single FPGA.

3. Using embedded microprocessors instead of general purpose processors in applications to incur little serial computational workloadcandramaticallyreducetheoverallenergyconsump- tionofthesystem, whileitstill providesidenticalfunctionali- tieswithacceptableperformance.

4. DesigndetailsofSOPCbasedreconﬁgurableclustersfor cloudapplications

Theinherentnatureofthecloudcomputingimposes morede- mandoncloud-sidesystemforhighthroughputnetworkI/O,high performance computing capability and fast storage. In order to meetthedemandinastandaloneFPGA,wedesignedastandalone SOPCbasedarchitectureforcloudservicethattightlycouplesnet- workIOhandlinganddataprocessinginasingleFPGA,whichisil- lustratedinFig.1.ItincludesanI/Osubsystem,amemorysubsys- tem, anembeddedmicroprocessorandauserdeﬁnedapplication logic.IntheI/O subsystem,networkprotocolsfromphysicallayer to transport layer are integrated inhardware pipeline toachieve highthroughput andlow latency. It supports 10Gb optical ether- netvia XAUIinterfaceaswell astheentireTCP/IPstack. Wealso integrate a SATA controller to support non-volatile mass storage devices.Inthe memorysubsystem,we use twotypes ofexternal memoryfordifferentpurposes:TwoDDR3SDRAMs,whichprovide 16GB memory space in total, are used asdata memory to store bulkdatasuchasupstreamanddownstreamdata,foritshighstor- agedensityand highburstthroughput;One 72MbSRAMis used tostore datastructuresofTCPsessionsforits lowaccesslatency.

Moreover,anembeddedmicroprocessorisinvolvedtoincorporate with AMBA AXI4 interconnects to provide 3 major functions as follows:(1) Initializingthe wholesystemby configuring registers ofsystem modules andperipherals; (2)Providingan AHCIbased driverforSATAcontroller;(3)Maintainingafile systembasedon SSD. Unlike the former-described common component, the user definedapplicationlogicisuniquefordifferentcloudapplications, whichimplementstheapplicationlevelalgorithmincloudservice inspecializedhardwarepipelines.

4.1. Massive-sessionsoptimizedTCP/IPstack

In the scenario of cloud applications, the major disadvantage ofthirdparty TCP/IPoﬄoad enginesthat wouldoutweighadvan- tages is that they are primarily optimized for massive data vol- umetransmission rather than number ofconcurrent connections toboasttheir supportforbandwidthof10+Gbps,whereeachTOE

(5)

Upstream data processing

TCP Packet Classifier

Connection Request Processing

Data Acknowledgment Processing

Connection Management

Engine

Retransmission Engine Data

Fragment Engine

...

Disconnection Request Processing Disconnection Accept Processing ...

...

FIFO Group for Different packages

IP Encapsulation FIFO

Downstream Data in DDR3 10G Ethernet Controller

TCP/IP Stack

TCP Packaging Engine

IP Datagram Receive

Engine

IP Encapsulation

Engine

Upstream Data in DDR3

User Defined Application

Logic TCP Sessions in SRAM

TCP Packaging Message FIFO ...

Fig. 2. Mircro-architecture of TCP/IP stack.

only handlesno morethan hundreds ofTCPconnections concur- rently.As aresult, they adoptone TCP/IPsession perinstantiated pipelinearchitecture[22,23],whereeachsessionincludesfull TCP functionaswellasprivateTXandRXbufferusingon-chipBRAMs.

Multi-sessions can be supported by instantiating multiple copies ofsuch pipeline.However, webbased cloudcomputingis quitea different application scenario in that the TOE needs to maintain more than ten thousands of concurrent connections where each connection only contributes a little portion of total throughput.

ThusthesethirdpartyTOEsnolongersatisfysuchrequirementdue to limited on-chip resource to support more TCPsessions.To ad- dresstheproblem,weproposea massive-sessionsoptimizedTOE.

Weuseamacro-pipelinearchitecture,whereallTCPsessionsshare a coarse-grained pipeline with carefullyorganized TX, RX buffer.

Also,wedevelopahardwarebasedconnectionmanagementtosup- port rapidscheduling of up to 100K TCP sessions. We show our architecture in Fig.2 withthe detailedexplanations given inthe followingsubsections.

4.1.1. Macro-pipelinearchitecture

Inordertoachievehighsystemthroughput,allprocessingmod- ulesinthemacro-pipelineareoperatedinasynchronousway,each ofwhichhandlesaportionofprocessingstageordirectlyresponds to a specific type of TCP packets. These processing modules are eventtriggered,thatistheycontinuouslyreadtheeventsormes- sages fromtheinput FIFO,processtheevents,andsendmessages tothenextstagewithoutanysynchronizations.Specifically,forin- comingdataprocessing,datafromnetworkarefirstoffloadedfrom 10G Ethernetandunpacked byIP datagram receiveengineto ob- tainTCPpackets.ThentheseTCPpacketsaredividedintodifferent categories basedon the control bit intheir TCP headers andde- livered todifferent processingengines respectively.Each process- ingenginedealswithaparticulartypeofTCPpacketsuchasSYN packet,FIN packetorACKpacketanddirectly generatesfeedback packets to response. The connection management engine deals with the connection and disconnection of TCP links and sched- ulesofthedatatransmissionorderamongallestablishedlinks.For outgoingdataprocessing,thedatafragmentenginesegmentsdata

Fig. 3. Memory allocation and scheduling of TCP connections.

thatistobetransmittedintosmalldatapackages.Thenthesedata packagesareencapsulatedwithTCPheaderandIPheadersucces- sivelybyTCPpackagingengineandIPencapsulationengine.When a timeout occurs, the retransmission engine retransmits the lost TCPpacketsbased onthe informationfromthe mostrecentdata ACKnumber.

4.1.2. Connectionmanagement

UnlikeconventionalTOEsthatsupportonlynomorethanhun- dredsofTCPsessions,tosupportmorethantenthousandsofTCP sessionsin ourTOE isnotrivialtask. Dueto thelimitedon-chip memoryresources, connection information can onlybe stored in externalmemory,whilethesharedmemorymaybecomethebot- tleneck of the system when it incurs random accesses from dif- ferentprocessingmodulesconcurrently.Wemitigatetheimpactof theproblemintwo ways:toreduce accesslatencyandtoreduce the number of memory accesses. We ﬁrst choose SRAM instead ofDRAMto storeTCPsessions foritsmuch lower accesslatency.

Then,asdescribedinFig.3,thememory spaceofexternalSRAM issegmentedinto a numberofconsecutive TCPconnection slots, wherewe useahashmechanismwithsourceIPaddressandport numberasthekey to fastlookup ofadedicated TCPconnection.

We avoidusing round-robin scheduling scheme, which iswidely usedinTOEsthatsupportsmallnumberofTCPsessions,tosched-

(6)

uletheexecutionofdatatransmissionamongestablishedTCPcon- nections. Becausein web-basedcloud computingscenario witha largenumber ofconcurrent connections, it is time consumingto check every connection slot to ﬁnd an active connection that is abletosent data.Instead, we maintaintwolinked list tofastlo- cateactiveconnectionsandtimeoutconnections.Theactivelinked listisusedtostoreTCPconnectionsthathaveaneventtobepro- cessed.The connection management engine continuously obtains theconnectioninthehead,executestheevent,thenputsthecon- nection to the tail of timeout linked list. When an event (new ACKreceived,newdatareceived ornewdata tosend)forcertain connectionoccurs, theconnection isinserted tothe tailofactive linkedlist. Ontheother hand,the timeoutlinked listisused for monitoringthetransmissiontimeoutofidleconnections.Idlecon- nectionsareinsertedtothetailoftimeoutlinkedlistwithatime- outpointthat addsa ﬁxtimeoutvaluetocurrenttime.Since the timeoutlinkedlistisnaturallysortedintimeoutpoint,theretrans- missionengineonlyneedtocheckthroughtheheadofthelistto obtainthemosttimeoutconnections,avoidingrecurrentlytravers- ing thewhole established TCP connection slots that mayimpose accessburdenstothesharedmemory.

4.1.3. TX,RXbufferallocation

Duetolimitedon-chipresources,itisimpossibletooffereach TCPsessiona privateTXandRX bufferusingBRAMs.Instead,we directlystorethedatainDDR3externalmemory.However,theor- ganizationof upstream data anddownstream data for each TCP connectionsis different.Forupstream data,we maintaina buffer pool(acontinuousmemoryspaceinDDR3SDRAM)thatisdivided equallyintosegments.Thesizeofeachsegmentcanbeprogramed accordingtothe need ofapplicationscenario. Abuffer allocation stackis used to manage the allocation andreclaim ofthe buffer resources.When a newconnection isestablished, theconnection management engine acquires a newbuffer fromthe buffer pool, andregistersitintheconnectioninformation.Thenupstreamdata canbeoffloadedtothebuffer,whereitcanbedirectlyaccessedby userdefinedlogic.DuringTCPdisconnection,thebufferisrecycled back to the buffer pool. For downstream data,however, our TCP stackdoesnot assignaspecific TXbuffer toeach establishedTCP connection.Ratherweleaveuserdefinedlogictoorganizethedata tobesent. Specifically,the userdefinedlogiccantrigger theTCP stackto senddata by registeringone or moredataready events, eachofwhichcontainsthestartingaddressandthelengthofpre- pareddata inDDR3 SDRAM.Thisleadsto aseriesof benefitsfor performance:(1)Theuserdefinedlogicneeds nottowaitincase thatthesendbufferisfull;(2)Iteliminatesthedatacopyopera- tionsfromapplicationleveltotransportlevel;(3)Itsupportsmul- tiple data transmission without the involvement of user defined logic;(4)ManyTCPconnectionscansharethesamedatatobesent (e.g.users’ requests forthe samewebpage in webserver application),whichsavesthememoryspaceandtimecausedbymultiple copiestoprivatesendbuffers.

4.2.Onlinedynamicscheduling

Data centers are typically designed to provide processing ca- pacityfor potential peak workload. However, actual request load imposed to data centers changes over time. It is a waste of en- ergytoactivateall itsprocessingpowerwhen systemisatalow load. Fortunately, the standalone SOPC based architecture where every FPGAworks asan independent systemoffers high ﬂexibil- ityinscheduling,whichcanbeusedtoaddresstheproblem.Since theconﬁgurationandshutdowntimeofaFPGAisnegligiblecom- paredtothatofageneral-purposecomputer,itispossibletoadjust the number of activated computing nodes according to realtime

Fig. 4. Online dynamic scheduling in standalone SOPC based reconﬁgurable clus- ters.

workloadbydirectlypoweringupordownFPGAsintheclouden- vironment to savethe energy, while the lattermust keep awake mostofthemachines.

Dueto ourimpressively shortreconfiguration time inseconds (lessthantwoseconds)ratherthanminutes,whichmostofFPGA userstaketoreconfigureFPGAviaJTAGcables,aswestorerecon- figuration bits in NOR flash connectedto FPGA via BPI interface, we can exploit the advantages of a lot of dynamic load balancing methods on web-based systems [27], many ofwhich can be adoptedinourarchitecture. WechoseHTTPredirectiontechnique to route incoming requests into our FPGA based cloud servers.

Fig.4demonstratestheprocess ofsuchdynamicscheduling.First ofall,theuserontheclientsideconnectsandsendsanHTTPGET requesttotheloadbalancingnode.Theloadbalancingnode picks up a running FPGA that has the least workload andsends its IP backtotheclientsideasan HTTP302redirectionresponse.Then, theusersendstherequesttotheallocatedFPGAtoreceiveservices onthecloudside.

The load balancing node periodically (five second interval for instance)gathersrunningstatusofeachFPGAtorecordthework- loaddistribution of thewhole system. Note that addingcommu- nication supportbetween theload balancing node andthe FPGA wouldcause additional chiparea usage. However, in oursystem, wecanreusethenetworkdatapathwehavealreadybuildtosup- portonlinedynamicschedulingontheFPGAside.The pollingac- tionissimplydonebysendingauniqueHTTPGETrequesttoeach FPGA,whichisbeen processedbyourTCP/IP stack.Then theon- chip microprocessor will collect data of running status and sent them backas theHTTP response.All that areneeded areseveral registers to store the running status and a small piece of code running on the on-chip microprocessor to deal with the unique HTTP GET request. For different applications, running status and scheduling algorithm can be various. In our implementation, the average system throughput during the last time interval of each FPGAisobtainedtomonitortheconditionofloading.WedefineT_i asthecurrentthroughputforFPGA_i,T_masthemaximumthrought- putthatasingleFPGAcanaffordtoprovideservicesundercertain QoS(QualityofService)requirement,andNisthecurrentnumber oftotalactiveFPGAs.Sothesystemlevelloadrateisdefinedas:

Lcurrent= _N

i=1Ti

N·Tm (9)

Common practice (defined as basic algorithm) uses an upper bound Lu anda lower bound L_l of system level loadrate to de- cidewhethertoincreaseordecreasethenumberoftheactivated FPGAs. Whenthe systemloadrateLcurrent drops belowthe lower bound L_l,theloadbalancing nodefirst stopsrouting theuserre- queststo theFPGAthat containstheleastworkload,andinforms the onboardconfiguration moduleto power off suchFPGA when itbecomesidle.Onthecontrary,whenthesystemloadrateLcurrent

(7)

risesovertheupperboundLu,theloadbalancingnodeinformsthe onboardconﬁgurationmoduletopowerupandconﬁgurearesting FPGAtoprovideservice.

However,theparametersettingoftheschedulingupperbound LuandlowerboundL_lisnotrivial.ThehigherLuincreasesthepos- sibilityofservicefailure,whilethelower Lu leadstothewasteof performance.Also,thehigherL_l maycausetheoscillationbetween wakeupandpoweroff andthelowerL_lgivesrisetothewasteof energy.Toaddress theproblem,we introducedthe followingtwo optimizationstothebasicalgorithm:

4.2.1. Predictivescheduling

Foreveryschedulingpoint,we firstpredictthesystemloadof the next schedulingperiod (L_next) by quadratic fittingthe system loaddataofthelastthreesamples.Specifically,we defineS⁽⁻^j⁾=

i∈ActiveList(−j)T_i⁽⁻^j⁾asthetotalthroughtputofallFPGAsatthelast j schedulingpoint,whereS⁽⁰⁾standsforcurrenttotalthroughput.

Then,wecancomputethesystemloadofthenextschedulingpe- riodasLnext= ^S⁽⁻²_|_Act⁾⁻³_i_v^S_eList⁽⁻¹⁾_|⁺_·_T³_m^S⁽⁰⁾.WeusethemaximumvalueofLnext

andL_currenttodecidewhethertoactivatemoreFPGAs.Comparedto the basic algorithm that tends to choose a moderateLu to avoid service failure when the system encounters explosive growth of load, ouroptimizationcan predictthetrendofloadvariation and make adjustmentinadvance.As aresult, we canusea higherLu

tomakethewholesystemmoreenergyeﬃcientwithoutworrying servicefailure.

4.2.2. Slidingwindow

Sincetheoverallloadsonthecloudsidearesumofmassivein- dividualsrequests,ittendstobesmoothlychangedandstablewith smallfluctuationinmostoftime.Therefore,itisimportanttoim- provetheenergyefficiencywhenthesystemisundersuch stable state.However,thebasicalgorithmishardtoguaranteeahighen- ergy efficiencyduringthestable period,whichtends tochoose a lowL_ltoavoidoscillationbetweenwakeupandpoweroff.Forex- ample,IftheL_lissetto50%andLu to80%,whilethesystemload fluctuatearound60%±8%,thebasicalgorithmwouldnotdecrease thenumber ofactivatedFPGAsto improvethesystemutilization.

Toaddresstheproblem,wesetupaslidingwindowofsizek,and keep track ofthe maximumvalue ofthe whole systemthrough- put(Smax=max

{

^S⁽⁰⁾,S⁽⁻¹⁾,...,S⁽⁻^k⁺¹⁾

}

⁾^among^the^last^k^samples.

Byassuming that thereisa highprobability thatthe loadduring the next schedulingperiod is nomore than Smax, we can reduce thenumberofactivatedFPGAstomakesysteminhighestutiliza- tionundertheupperboundLu,whichiscalculatedas

T^Sm^max·Lu

^.^Al-

thoughsuchoptimizationincreasestheresponselatencyofpower- ing off FPGAswhensystemloadrateL_current drops,itsigniﬁcantly improves the energyeﬃciency when the system is under stable state.

Morespeciﬁcally,wedescribeouralgorithminpseudo-codeas showninAlgorithm1.

5. Casestudies

Inthissection,weusetwocasestudiestoverifytheeffective- ness of our design and implementation of cloud based applications usingtheproposedarchitecture. Thewebserverapplication, whichplaysafundamentalroleincurrentcloudcomputinginfras- tructure, is chosen asthe base case study. Since the web server isthe majorchannel toconnect usersandcloud services,weuse thiscasestudytoevaluatehowthisarchitecturemeetstherequire- mentoflowdelay,highbandwidthandhighconcurrencyI/O performance incloud environment.Thesecond casestudyisa cloud basedECG classiﬁcation whichisan importantbio-cloudapplica- tion[28].UnlikewebserverapplicationwhichisonlyI/O-intensive,

Algorithm1 Onlinedynamicscheduling.

Input: Acti

v

eList setofactivatedFPGAs Input: IdleList setofready-to-power-off FPGAs Input: PowerO ff List setofpower-off FPGAs

Input: inter

v

al schedulinginterval

Input: k windowsize

Input: Lu schedulingupperbound

1: loop

2: forall FPGAiinActi

v

eListIdleListdo 3: gathercurrentthroughputT_i 4: endfor

5: S⁽⁰⁾←

i∈ActiveListT_i 6: L_current←_|_Act_i_v^S_eList⁽⁰⁾ _|_·T_m

7: Lnext← ^S⁽⁻²_|_Act⁾⁻³_i_v^S_eList⁽⁻¹⁾_|⁺³_·_T_m^S⁽⁰⁾ quadraticﬁtting 8: Ltarget←max

{

^L^current,Lnext

}

9: ifLtarget>Luthen activatenewFPGAs

10: repeat

11: ifIdleListisnotemptythen 12: node←IdleList.maxT_pop() 13: Acti

v

eList.push(node)

14: elseifPowerO ff Listisnotemptythen 15: node←PowerO ff List.pop() 16: reconﬁgureFPGA_node 17: Acti

v

eList.push(node)

18: endif

19: Ltarget←_|_Act^|^Act_i_vⁱ_eList^v^eList_|₊^|₁Ltarget 20: untilLtarget<Lu

21: else

22: Smax←max

{

^S⁽⁰⁾,S⁽⁻¹⁾,...,S⁽⁻^k⁺¹⁾

}

23: Nt←

T^S_m^max·L_u

24: while

|

^Actⁱ

v

eList

|

>N_t do deactivateFPGAs 25: node←Acti

v

eList.minT_pop()

26: IdleList.push(node) 27: endwhile

28: endif

29: forall FPGAiinIdleListdo poweroff FPGAs 30: if T_i==0then

31: IdleList.delete(i) 32: poweroffFPGA_i 33: PowerO ff List.push(i)

34: endif

35: endfor 36: wait(inter

v

al) 37: endloop

thecloudbasedECGclassiﬁcation isanapplicationthatbothI/O- intensive and computation-intensive. We use this case study to demonstrate how this architecture handles realistic cloud-based workloads.Sincewe havealreadypresented thecommonmodule oftheSOPCBasedArchitectureindetails,inthissectionwemainly focusonthedesignoftheuserdeﬁnedapplicationlogic.

5.1. Webserver

Theprimary functionof aweb server isto deliver webpages orfiles to clients. The communicationbetweenclient and server takesplace using the Hypertext Transfer Protocol (HTTP). A user agent,commonlyawebbrowser,initiatescommunicationbymak- ing a requestfor a specific resourcevia HTTP andthe server re- spondswiththecontentofthatresource.Inourdesign(Fig.5),the userdefinedapplicationlogiccontainstwo mainmodules:a URL parseranda datalocatingmodule.Tonarrowthe bandwidthgap betweennetworkI/OandSSDstorageI/O,acacheismaintainedin

(8)

Fig. 5. Architecture of the user deﬁned application logic for webserver.

theDDR3memory.Forsimplicity,weuseembeddedmicroproces- sortoassistthecachemanagementwhileahashtableinon-chip blockramisusedtofastlocatethecached addressofrequestfile by hardware. When TCP/IP stack offloads the upstream data,the URLparser readsthe HTTP GET requestfrom theDDR3 memory andparseittoobtainafile name.Then thedatalocatingmodule checksiftherequested fileis cachedinDDR3 memoryusing the hashtable.Ifyes,thedatalocatingmoduledirectlysignalsthedata fragmentenginebywritingtheaddressandthelengthinformation ofsuchcacheddatatothecorrespondingdatasendmessagewith- outcopyingthedatatoabuffer;Otherwise,itnotifiestheembed- dedmicroprocessorto move the datafrom theSSD to thecache andthenrepeats thepreviousoperations.Duetolimitedspaceof thepaper,wedonotextendfurtherwithdetailedhardwareimple- mentationofthetwomodules,whichisnotthecoreissueofthis case.

5.2.CloudbasedECGclassiﬁcation

The electrocardiogram(ECG) isagraphic representationofthe electricalactivity ofthe heart,which isa usefultool to diagnose heart diseases. Since the heart disease is still the most common causeof death inthe modern world, the early diagnosis ofsuch disease becomessigniﬁcantly important. Recent advances in mo- biletechnologyandcloud computinghaveinspirednumerousde- signsofcloud-basedhealthcareservicesanddevices[28,29].Med- icaldatasuch asECG can becollectedandtransmittedautomati- callytothecloudbasedanalysissystemfromanywhereandtimely feedbackcanbereturnedtopatientsthroughthenetwork.Theen- tire process of automated diagnosis can be generally subdivided intothreeprocedures:beatdetection,featureextraction,andclas- siﬁcation[30].

Beat Detection: The QRS complex detection algorithm de- scribedin[31]isusedforourdesignofbeatdetection.Thisalgo- rithmisbasedonwavelettransform(WT),inwhichatargetsignal isdecomposedinto an approximatesignal and aset ofrelatively high-frequencywavelets.EachstageofWTconsistsofahigh-pass ﬁlterG[· ], a low-passﬁlterH[ · ] andtwodown-samplers by 2 (

↓

^2).^Thedown-sampledoutputs offirsthighandlow-pass filters providethedetail,D1andtheapproximation,A1,respectively.Fol- lowingthe first level of decomposition, the approximation A1 is furtherdecomposedinthesecondlevelusingthesamefilters,and soon.

The detectionofQRS complexismainlyfocused onthedetec- tion of R peaks in the raw ECG data. In D2, every special wave wherea negativeminimum point isfollowed by a positive max- imumpoint corresponds to eachR peak. In other words,we can capturethepositionofRpeaksbyidentifyingthezero-crossingof extremumpairsinD2.ThentheQRScomplexisextractedfromthe

Fig. 6. Architecture of the user deﬁned application logic for cloud based ECG clas- siﬁcation.

rawECGdataby segmentinga ﬁxednumberofsamplescentered atRpeaks.

Feature Extraction:We use the method presentedin [32] to extractfeaturesfromQRScomplex.Thetwo-leveldiscretewavelet decompositionproducessignal componentsindifferentsubbands.

Inordertoreducethedimensionalityoftheextractedsignalcom- ponents,11statisticsoverthesetofthewaveletcoeﬃcientswere usedto representthetime-frequencydistribution oftheECG sig- nals:

1. VarianceoftheoriginalQRScomplexsignal;

2. Variance of the wavelet coeﬃcients in each subband (D1,D2,A2);

3. Varianceofautocorrelationfunctionsofthewaveletcoeﬃcients ineachsubband(D1,D2,A2);

4. Ratio oftheminimumto themaximumof thewaveletcoeﬃ- cientsineachsubband(D1,D2,A2);

5. instantaneousRRintervals.

Classification: Weuse artificialneuralnetwork (ANN) toclas- sifytheextractedfeatures,whichisawidelyusedmethodforclas- sificationinmachinelearningandpatternrecognition.

Thearchitecture ofourcloud basedECG classificationinFPGA is showed in Fig. 6. While users connect to the FPGA and up- loadtheir ECGdata throughTCP/IP protocol,the TCP/IPstack engine handles these TCP connections and offloads the ECG data to the DDR3 memory. Then the TCP/IP stack engine sends the connection information messages tothe task assignment module through aFIFO. The latterperiodically polls allfeature extraction (FE)pipelines to findan idleone, andthenreads anddispatches theECG dataassociated withhead-of-the-FIFOTCPconnection to theidleFEpipeline.Sincethefeatureextractionstageisthemost computation-intensiveandtimeconsumingstage,weputmultiple FEpipelinesto meettherequirementofsystemlevelthroughput.

The input ofeach FE pipelines is the sequential samples of ECG data relatedto one user. After several processingsteps including R-peak detection, 2 level wavelet transformation, autocorrelation computation,variancecomputation,andmin/maxcomputation,FE pipeline outputs 11statistics features described above aswell as itsconnectioninformationmessagetoafeatureFIFO.Theclassifi- cationmodulecontainsa pre-trainedANNwith11inputneurons, 20 hidden neuronsand 6 output neurons. It automatically reads featuresfromthefeatureFIFO,classifiesthetypeoftheECGdata, writestheresulttotheDDR3memory,andthennotifiestheTCP/IP enginetosendthediagnosticresultbacktotheuser.

SinceboththeRpeakdetectionandthefeatureextractioncon- tainawavelettransformationprocedure,weshareatwolevelWT modulebetweenthesetwoprocesses.Thearchitectureofthefea-

(9)

Fig. 7. Architecture of feature extraction pipeline.

Fig. 8. Architecture of light-weight neuron design in FPGA.

ture extraction pipeline is showed in Fig. 7, when ECG data is readyinthelocalblock ram,theWTmodulereadsitoutandex- ecutes a pipelinedwavelet transformation.The D2 outputis sent toRpeakdetectionmodule,whereallRpeakpositionsaswellas the64-pointQRSsegmentscenteredatR-peaksareextractedfrom therecord.ThelatterrecordsthepositionoftheRpeak andAlso, theinstantaneous RRintervalsarecomputedandsentto thefea- ture output module. After that, a second wavelet transformation is executed to the extracted QRS segments. The controller stores all subbanddataandcontrolsthe executionordertocompute all statisticsfeatureviadifferentmodulesincludingMin/Maxmodule, autocorrelation module, and variance module. In our design, the memoryspaceintheblockramisdividedintotwoidenticalparts to support ping pong operations, which can hide the latencyre- latedtodataacquisition.

The architecture of single neuron in hidden layer and output layerisillustrated asFig.8.Sincethemosttimeconsumingstage isfeatureextractionstageratherthanclassiﬁcationstage,weadopt a morelight-weightneuron designto savetheon-chipresources.

Eachneuroncontainsaﬁnitestatemachinetocontroltheaccessof datafromtheoutputoftheformerneurons.Duringeachcycle,the neuron executes one multiply-accumulate operation toone of its inputdatawithits pre-trainedweight.Theimplementationofthe sigmoid function usea look-up table (LUT)stored inthe on-chip block Ram[33].TheuseoftheLUT reducesthe resourcerequire- ment andimprovesthe speed. Inour design,a global64Kdepth and16bitwidthLUTisusedtocompromisebetweentheresources utilizationandtheresolutionoftheoutput.

6. Experimentalresultsanddiscussion 6.1. Buildingexperimentalprototypeplatform

We have been involved in building mimicry computer [34], andused thisreconﬁgurableplatform (Fig.9) to evaluatetheour SOPC based architecture for cloud applications. The current prototype platformconsistsof16computingnodes,eachofwhichis packaged in a standard 2U rack-mountable, as shownin Fig. 10. Each computingnode houses one motherboardand fourdaugh- ter boards.Eachdaughterboardcontains:1) aXC6VLX550TFPGA fromXilinx;2)threeSO-DIMM channelsthatsupportupto 16GB

Fig. 9. Experimental prototype platform. (a) Reconﬁgurable prototype platform containing 16 2U rack-mountable computing node; (b) one of its quad-FPGA reconﬁg- urable computing node.

Fig. 10. Board architecture for single node in experimental prototype platform.

DDR3 SDRAM and72Mb SRAM;3) 10G Ethernet withXAUI/SFP+

interface; 4) SATA3 interface connecting an Intel 320GB SSD; 5) 32MBNORflashinBPIinterface;6)PCI-EX8interface.Datatrans- fer between adjacent FPGAs can be carried out directly through pointtopointinterconnectionsinmotherboard.AllFPGAsarein- terconnected by a fiber channel switch from Cisco with their IP addressedpre-definedintheflashthatisattachedtoeachFPGA.

Forthe purpose ofremote configuration, an onboard configu- rationmodule basedon an IntelAtomCPU is integratedoneach motherboard.Toolsareprovidedforuserstomanipulateallkinds ofoperation such as startingup orshutting down certain FPGA, downloadingthe bitstream,evendebugging remotelyusingChip- scope,when a PChostis connectedto theAtomsystem through network. By using BPI configuration interface, it only takes less thantwosecondtoconfiguretheFPGA.

6.2.Resourcesutilization

Asmentionedabove,weimplementedbothtwoapplicationson ourreconﬁgurableprototype platform.Eachinstance ofthe SOPC basedarchitecturewasimplementedinaXilinxXC6VLX550TFPGA.

The embedded microprocessor is implemented by Microblaze, a light-weight soft core provided by Xilinx. The system clock fre- quencyofbothtwoimplementationsare156.25Mhzwhichisequal tothe frequencyof10Gbps PHYinterface. Table1 showsthede- viceutilizationsummaryofmainmodulesforsinglecore,whichis generatedbyXilinxISE14.1.Allcommonmodulesintotaluseonly 6.4%of totalsliceregisters, 13.5% oftotalsliceLUTs, 25.1%of to- talBRAMsandlessthan1%oftotalDSP48E1s, whichleavesmost ofon-chipresources foruserdeﬁned applicationlogic. Notethat

(10)

Table 1

Device utilization summary for single core.

XC6VLX550T

Slice Reg LUTs BRAM DSP48 Available 687 ,360 343 ,680 2528 864 Common Modules

TCP/IP Stack 21 ,312 22 ,604 126 6

SATA Controller 3108 4816 10 0

Microblaze 1620 2185 498 3

DDR3 Controller 10 ,028 8200 0 0

AXI interconnect 4155 4989 0 0

Ethernet 3704 3555 0 0

User Deﬁned Application Logic

web server 9727 11 ,563 114 0 ECG classiﬁcation 87 ,242 72 ,972 162 450 The-state-of-the-art TOE in academia

TOE [21] 20 ,611 19 ,026 279

ourTCP/IPstackhasasimilarsourceusage asthe-state-of-the-art [21]TOEinacademia,butsupportsmoreconcurrentTCPsessions.

6.3.PerformanceevaluationoftheSOPCbasedwebserver

The performanceofthe SOPCbasedwebserverwasevaluated by Spirent Avalanche 3100 [35], a popular network performance testingequipment that is ableto evaluate OSI Layer 4–7stateful traﬃcperformanceover multi-10Gbpsphysicallayers. Threeper- formancemetricsaremeasuredincluding:

• Throughput, measured in bits per second (bps), which is the rateofsuccessfuldatadeliveryinagiventimeperiod;

• Number ofHTTP Transactionsper second, which refers tothe numberofcompleteHTTPtransactionincludingconnectiones- tablishment, ﬁle transmission, and connection disconnection, performed by web server per second. This metric is highly related to the response time of single transaction, when re- questedﬁlesareverysmall;

• Max number of concurrent connections, which refers to the maxnumberofactiveTCPlinksprocessedconcurrently.

Results werecompared withothertwo mainstreamindustrial systems:

• Apache HTTP webserver [36], whichisthe mostwidelyused open-source HTTPserverinindustry.We launchedApache 2.2 HTTPwebserverinstanceonIBMx3635M3serverswithIntel X5650CPU(2.66 GHz),8GBDDR3memoryandIntelX520-SR2 10GEnetworkadapter.

• Master-slavebasedwebserverwhereweconnectedourdaugh- ter board asa slavedevice to IBM x3635 M3servers via PCI- E interface. We integrated the QuickTCP [22] IP core, a third party 10GE TCP/IPstack inFPGA tooﬄoad networkdata processing,leavingthemastersideCPUhandlingapplicationlevel functionssuchasﬁlemanagementandlookup.

Files ofdifferentsizeswere storedin theSSD forSOPC based webserver, while themaster-slave basedimplementation system heldsuchﬁlesintheharddrivehostedonthemasterside.

6.3.1. Throughput

Fig. 11 shows that the peak throughput of SOPC based web serveris8.3GbpswhichisslightlylowerthanApacheandmaster- slavebased webserver. However, whenthe sizeofrequested ﬁle becamesmall,the standalone SOPC based web server performed muchbetterthanothertwosystems.Becauseinthissituation,the processingdelayofsinglecompleteHTTPtransactionincludingthe settingup andtearing downof TCPconnection turnedout tobe

Fig. 11. Throughput at different ﬁle sizes.

Fig. 12. HTTP transaction rate at different ﬁle sizes.

the bottleneck. For master-slave based implementation, a lot of messageinteractionandremotememoryaccessoperationsviaPCI- Ewereneededbetweenthemastersideandtheslaveside,which putmoredelayduringsingleHTTP transaction.ForApache based implementation,OSoperationssuch assystemcall, userspaceto kernalspacememorycopyandinterrupthandlingbringmorede- lay. In our standalone SOPC based web server, however, latency isreduced dueto thetightly-coupled architecture.Compare with [21], ourTCP/IP stack alsohas a similar throughput(8.3Gbps) as [21](8.6 Gbps).

6.3.2. NumberofHTTPtransactionspersecond

Fig.12illustratesourexplanationfromanotherperspective.The HTTPtransactionrateoftheSOPCbasedweb servercanreachup to 120,000 TPS(Transactions per Second) when transmitting the same 1KBﬁle, which is6.8 timesthat of the master-slavebased implementationand9.2timesthatoftheApachewebserver.How- ever,asﬁlesizeincreases,theperformancegapisgettingnarrower, andthethroughputlimitationisbecomingthebottleneck.

We can use the performance model introduced from Section 2 to analyze the results. In master-slave based implementation system, the communication latency between the master side and the slave side via PCI-E interface is nearly 4μs.

And during the process of a complete HTTP request-response operation including three-way handshake in the process of es- tablishing a connection and four-way handshake in the process of disconnecting a connection, there are 12 communications operations over PCI-E interface. According to Eq. (6), the time consumed by control I/O operation T_cio is approximately equal to 4

μ

^s×12=48

μ

^s^. Ôn ^the ôther ^hand, âccording ^to Êq. ⁽⁷⁾^,

the time consumed by data I/O operation T_dio is almost equal to _θ

PCI−E+_θ

HardDisk×cache_miss_rate. When ﬁle size is small and cachedinthememoryofthemasterside,theT_dio isnearlyequal to the PCI-E latency. The HTTP transaction rate of master-slave

(11)

based implementation is 17,500 TPS when ﬁle size is 1K byte, which means it needs 1000000

μ

^s/17500=57

μ

^s ^for^each ^com-

pletetransaction.Asperthecomputation, wecaninferthat most of the time is consumed by control I/O operations. This obser- vation indicates that the master-slave based architecture is quite unsuitable forsuch situation.Inour SOPCbasedimplementation, however,thecontrolI/O latencyissmallenoughtobeignored,so thatitdramaticallyreducestheprocesstimeforHTTPtransaction, which is only1000000

μ

^s/120000=8.3

μ

^s^.^This ^result^indicates

that our SOPCbasedweb server can offerbetter userexperience in terms oflower response time when sufferinghigh concurrent requests.

6.3.3. Maxnumberofconcurrentconnections

We continuously generated keep-alive HTTP requests to measure the max number of concurrent connections of three systems. TCPconnections established based on such request do not teardownafterrequestedﬁleiscompletelytransmitted.Thenum- ber ofmaxconcurrent connections were recordedwhenthe web server is not able to set up more TCPconnections. Experimental resultsshow that themaxconcurrent TCPconnections supported by our stand-alone SOPC based web server, Apache web server andthemaster-slavebasedweb serverarenearly 100K,80K, and 128 respectively. Since the QuickTCP [22] IP core only supports one TCP/IP session per instantiated pipeline, considering on-chip resource constrain,the maxnumberof concurrentconnections is limited for not able to integrate more pipelines. Also compared with [21], our TCP/IP stack supports much more concurrent TCP connections(100KVS.10K),withsimilarperformanceandresource consumption.

6.3.4. Energyeﬃciencyunderpeakperformance

We use the term million bits transmitted per joule (Mb/J) to define the energyefficiencyofdifferentsystems.The power consumption for one instance of the SOPC based web server run- ningundermaximumworkloadwas40W(includingthepowerof the off-chip devicesandthe cooling fan).Sothe peak energyefficiency wascalculated as 8320Mbps÷40W=208Mb/J. However, theApachewebserverconsumed160Wwithfullworkloadandits peak energy efficiencywas 58Mb/J,while the master-slavebased implementationconsumed180Wwhichleadstoaworsepeaken- ergyefficiencythatwas52Mb/J.Inmaster-slavebasedwebserver, mostoftheenergywasconsumedby theconventionalPConthe master side. Since the application mapped to such architecture does not have a large serial computation-intensive portion to be executed inthehighperformancegeneralpurposeprocessor,web servers basedonthemaster-slavestylearchitecture areobviously inefficient.

6.3.5. Energyeﬃciencyinmimicrycluster

We measure the energy efficiency of our SOPC based web serverinMimicryClusterwithonlinedynamicscheduling.Weem- ulated the dynamic changing requests where the peak workload is nearly 4 times of its average workload. 40 files ranging from 1MBto64MB wereusedasrequestedfilesand12IBMx3635M3 serverswere usedasclientmachines.TobalancetheQoSanddy- namic energyefficiency, we empirically setthe schedulingupper boundLuto85%,windowsizeto15andschedulingintervalto5s. The experimentwascarriedout for24hours, andatthepoint of requestpeak,12FPGAwereusedtoprovidewebservices.Aclus- ter thatcontains12master-slavebasedweb serversandacluster that contains12 Apacheweb serverswere testedunderthesame input.Onaccountofunsustainablecostthatfrequentlypowersup andshutsdowntheconventionalPC,wemadeall12nodesawake duringthetest.

Fig. 13. Comparison of energy eﬃciency among our standalone SOPC implementation, FPGA-in-salve-mode implementation and Apache implementation of webserver. We evaluate the energy eﬃciency of both single node under peak performance and cluster under dynamic load.

Table 2

Latency of the user deﬁned application logic in cloud based ECG classiﬁcation application.

Data acquisition Feature extraction Neural network

4096 9728 61

We measured thetotal energy consumption P(J) and totalfile sizesS(MB)downloadedduringthetestperiod,thenwecalculated theenergyefficiencyas8S/P(Mb/J).Fig.13 showsthe energyeffi- ciencyofthethreesystemsunderdynamicchangingworkload.The energyefficiencyofourstandaloneSOPCbasedwebservercluster was152Mb/J,whichwas73.8%ofitspeakenergyefficiency(energy efficiency under maximum workload). However, the energy effi- ciencyofthe master-slavebasedweb serverwas12.8Mb/J,which was24.7%ofits peak energyefficiency.Moreover, Apachecluster achieved26.4% of its peak energy efficiency. The results indicate that,using onlinedynamicscheduling,our SOPCbased systemis ableto effectivelyutilize theprocessing powerof activatedcom- puting node, while power off the rest of nodes to save energy.

Thus,itgainsmuchmoreenergyeﬃciencyinadynamicway,com- paredtoothertwostaticscheduledsystems.

6.4.PerformanceevaluationofthecloudbasedECGclassiﬁcationin FPGA

InourstudyofcloudbasedECG classiﬁcation,23ECG records were selected fromthe MTI-BIH arrhythmia database [37] which iswidely usedinresearchcommunityofECGanalysis.Allrecords weredigitizedat360samplespersecondwith11-bitresolutionfor slightlyover30minutes.TheserecordsincludesixECGbeattypes:

thenormalbeat(N), theleftbundlebranchblockbeat(LBBB),the rightbundle branch block beat(RBBB), theatrial premature beat (APB),theprematureventricularcontraction(PVC),andthepaced beat(PB).Foreachbeattype,wechoose50%beatsfortrainingand another50%beats fortesting. Theparameters relatedtotheneu- ronnetworkweretrainedofflineusingsoftwareandthenfixed to theimplementation inFPGA. Ten FEpipelines were integratedin featureextractionmodule tomaximize thethroughput andmake full useofreconfigurable resources. Thesize of theblock ram in eachFEpipelineis16KBwhichcanstore8KpointsoftheECGdata in16bitwide. Halfofthememoryspace isusedforcurrentpro- cessing,whiletheotherhalfisusedfornewECGdatabuffering.

Table 2 illustrates the latency of the user deﬁned application logic in our design. The latency of the FE pipeline to process a 4KpointsECG data containing12 heart beats is 9782 cycles.Al-