Deep Residual CNN Based Model for Human Activity Recognition System
By
Saifuddin Mohammad Tareque ID: 173-25-630
This Report Presented in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science and Engineering.
Supervised By
Md Zahid Hasan
Assistant Professor &Coordinator of MIS Department of CSE
Daffodil International University
DAFFODIL INTERNATIONAL UNIVERSITY DHAKA, BANGLADESH
May 2019
ACKNOWLEDGEMENT
First of all, my heartiest thanks and gratefulness to Almighty Allah for His divine blessing that makes us capable to complete this thesis successfully.
I would like to thank my honorable teacher & project supervisor MdZahid Hasan, Assistant Professor, Department of CSE, Daffodil International University for his endless patience, scholarly guidance, continual encouragement, constant and energetic supervision, constructive criticism, valuable advice, reading many inferior drafts and correcting them at all stage have made it possible to complete this project.
We would like to express our heartiest gratitude to Dr. Syed Akhter Hossain, Head, Department of CSE, for his kind help to finish our project and we are also thankful to all the other faculty and staff members of our department for their co-operation and help.
We must acknowledge with due respect the constant support and patients of our parents.
Finally, we would like to thank our entire course mate in Daffodil International University, who took part in this discuss while completing the course work.
ABSTRACT
Human Action Recognition (HAR) is a significant application realm in computer vision, but high precision recognition of human action in the complex background is still an open question. Recently, deep learning approach has been used widely in order to enhance the recognition accuracy with different application areas. In our research, as classifier, a deep Convolutional Neural Network (CNN) using ResNet-50 model is proposed for HAR because it is the most upper hand in compare to other classifiers. Our proposed research work have used publicly accessible UCF-101 dataset which provides the largest multiplicity in HAR filed as most of the available action recognition data sets are not realistic. Additionally, UCF-101 dataset intends to give support further research into action recognition by learning and surveying new pragmatic action categories.
TABLE OF CONTENTS
CONTENTS PAGE
Approval I
Declaration II
Acknowledgements III
Abstract IV
CHAPTERS
CHAPTER 1: INTRODUCTION 01-03
1.1 1.2
Introduction Objectives
01 02
1.3 Motivation 03
1.4 Expected Outcome 03
1.5 Report Layout 03
CHAPTER 2: RELATED WORKS 04-10
2.1 Introduction 04
2.2 Automated Agent Scenario 05
2.3 Save Model and Reuse 05
2.4 Related Works 05
2.5 Scope of the Problems 09
2.6 Challenges 10
CHAPTER 3: REQUIREMENT SPECIFICATION 11-18
3.1 Introduction to Dataset 11
3.2 Workflow of the Proposed Method 12
3.3 Workflow Graph 14
3.4 3.5
Prediction Flow Process Implementation Requirements
16 17
CHAPTER 4: PROPOSED METHODOLOGY 19-26
4.1 Introduction to CNN 19
4.2 ResNet Architecture 20
4.3 Classification Process 24
4.4 Developed Work 24
CHAPTER 5: EXPERIMENTAL RESULT 27-31
5.1 Result Graphs 27
5.2 Experimental Result 29
CHAPTER 6: CONCLUSION & FUTURE SCOPE 32
6.1 Conclusion 32
6.2 Future Scope 32
REFERENCES 33
LIST OF FIGURES
FIGURES PAGE NO
Figure 2.1: BoVWsRepresentation for Action Recognition
07
Figure 2.2: Optical Flow Sequence
08
Figure 2.3: Graphical Interpretation of M-PCCA
09
Figure 3.1: Action classes of UCF101
12
Figure 3.2: Proposed Method Workflow
13
Figure 3.3: Flow Graph
15
Figure 3.4: Prediction Workflow
16
Figure 4.1: CNN Architecture
19
Figure 4.2: Residual Learning Building Block
21
Figure 4.3: ReLu
22
Figure 4.4: ResNet Architecture
23
Figure 4.5: ImageNet Benchmark
23
Figure 4.6: Image Conversion
24
Figure 4.7: Training ResNet-50
25
Figure 4.8: Prediction
26
Figure 4.9: Prediction Plotting
26
Figure 5.1: Validation Accuracy
27
Figure 5.2: Validation Loss
27
Figure 5.3: Overall Accuracy
28
Figure 5.4: Overall Loss
28
Figure 5.5: Cost Function Summary
29
Figure 5.6: Comparison with Previous Work
30
LIST OFTABLES
TABLE
PAGE NO
Table-1 Summary of Major Action Recognition Datasets 12
Table-2 Comparison with Previous Work 31
CHAPTER 1 INTRODUCTION
1.1 Introduction
In the last decade, human action recognition (HAR) is becoming a more and more attractive research topic with several applications, such as video surveillance, virtual reality, intelligent human-computer interactions, etc. However, accurate recognition of actions is a highly challenging task due to cluttered backgrounds, occlusions, and viewpoint variations.
HAR consist of several stages, which describe the features that define activities or low level actions. A generic description of human action recognition from image sequence consist of two steps: 1) extract complex handcrafted features from raw input video frames, 2)build a classifier basedon these features Some of the commonly used features for human action recognition are Histogramof Oriented Gradient (HOG)[1], Histogram of Optical Flow (HOF), Motion Interchange Patters(MIP), Space-Time Interest Points (STIP), action bank features [2] and dense trajectories[3].However, these approaches are difficult and time consuming to extend these featuresto other systems. A large part of hand-design features are driven by task and different tasks may usecompletely different features. But in reality, it is hard to know what kind of feature is important to aspecific task, so the feature selection is highly dependent on the specific problem. Especially forhuman action recognition, different kinds of sports show a very big difference in the appearance andmotion model, it is hard to get the essential feature of action in the drastic change ofenvironment .Therefore a generic feature extraction method is needed to be proposed to alleviatethe need for hand-engineered features and reduces the calculation scale.
CNN [4] is a deep model that obtains complicated hierarchical features via convolutional operation alternating with sub-sampling operation on the raw input images. It is confirmed that
CNN can gain more excellent performance in visual target recognition tasks through appropriate adjustment during the training. And CNN has invariance for a particular pose, illumination, anddisorderly environmental change.The first attempt for HAR using CNN was
by[5] developing a novel 3D CNN model that extractfeatures from both spatial and temporal dimensions by performing 3D convolutions, therebycapturing the motion information encoded in multiple adjacent frames. The developed modelgenerated multiple channels of information from the input frames, and the final featurerepresentation is obtained by combining information from all channels. [7] Proposed a deep convolutional network architecture for recognizing human actions in videos using action bank features of UCF50 database. [8] Proposed a novel dynamic neural network model which can recognize dynamic visual image patterns of human actions based on learning. Convolutional neural network (CNN) and the multiple timescale recurrent neural networks (MTRNN) were introduced. [9]
Proposed a newmethod which combines part-based models and deep learning by training pose-normalized CNN.
Although CNN is a good option for HAR, this method still has a weakness that the kernels/weights employed in the convolution are trained by BP neural networks, which are very time consuming. In this paper, to solve this problem of HAR based on CNN, a convolutional auto-encoder (CAE) pre-training strategy has proposed. This method discovers good CNN initializations that avoid the numerous distinct local minima of highly non-convex objective functions arising in virtually all deep learning problems.
1.2 Objectives
The main objective of this thesis is to develop an automated agent for recognizing human action using UCF-101 dataset [10].
The goals of our thesis are:
• Divide every video into image frame (1 frame per 10 seconds)
• To reshape differently sized images and convert them into NPZ arrays.
• Successfully divide the dataset into test and train folders.
• Retrain ResNet-50 model with our training dataset
• Save model progress as model checkpoints
• Save best model weights for future prediction
• Predict human action type from unknown test dataset.
1.3 Motivation
Nowadaysihumaniactionirecognitionialgorithmsiempowerimanyirealiworldiapplications.iSec urityiissueiisibecomingimoreiimportantiiniouridailyilifeiandiitiisioneiofitheimostifrequentlyi discusseditopicsinowadays.iOurimotivationiisitoidevelopianditrainianiagentiwhichicanirecog nizeihumaniaction.iWithitheiinputiofiainetworkioficameras,iaivisualisurveillanceisystemipo weredibyiactionirecognitioniandipredictionialgorithmsimayiincreaseitheichancesioficapturin giaicriminalionivideo,iandireduceitheiriskicausedibyicriminaliactions.iAnotheriimportantimo tivationiofiouriresearchiisitoiincreaseihuman-robotiinteraction.iHuman-
robotiinteractioniisipopularlyiappliediinihomeiandiindustryienvironment.iImagineithatiaipers oniisiinteractingiwithiairobotiandiaskingiititoiperformicertainitasks,isuchiasi“passingiaicupio fiwater”iori“performingianiassemblingitask”.iSuchianiinteractionirequiresicommunicationsib etweenirobotsiandihumans,iandivisualicommunicationiisioneiofitheimostiefficientiways.
1.4 ExpectediOutcome
IniouriThesisiourimainifocusiisitoidevelopianditrainianiagentiwhichicanipredictihumaniactio nitypeiwithi13320ivideosifromi101iactionicategories.iTheiagentiisibasedioniconvolutionalin euralinetworkiwhereiweihaveiusediaipre-trainedimodeli(ResNet-
50)iratherithaniaihandicraftedione.iResNet-
50ihasiaiperformanceiboostiofi10%iinicomparisoniwithihandicraftedimodels.iAfteritheitraini ngitheiagentihasitheiaccuracyiofi92.25%.
1.5iReportiLayout
FirstiChaptericontainsitheiIntroduction,iObjectives,iMotivation,iExpectediOutcomeiandiRep ortilayoutiofiouriproject.iThenisecondichaptericontainsiProjectiIntroduction,iRelatediworks,i ComparativeiStudies,iScopeiofitheiproblemiandialsoiChallengesiofiouriproject.iThirdichapte ricontainialliaboutiRequirementiSpecificationiwhichiareiUseiCaseiModelingiandiDescription ,iLogicaliDataiModel,iDesigniRequirementsiandiDescriptioniofitheiDatasetiweihaveiused.i FourthiChapteridescribesiproposedimethodologyiinidetailsiandiTrainingistrategiesiforitheiag ent.iOurififthichapteriisialliaboutiImplementationiandiAccuracyitesting.iThisicontainsiImple mentationiofiResNet-50,iPredictioniVisualizeriandiTestingimodules.i
Ourilastichaptericontainiconclusioniofitheifullithesis.iThisireporticontainsialliaboutiouriprop osedisystem,iitsiproblem,iandisolutioniandifutureiimprovements.
CHAPTERi2
BACKGROUNDiANDiRELATEDiWORKS
2.1iIntroduction
Humaniactionirecognitioniisiaidynamicipointiinitheifieldioficomputerivision.iThisiisibecausei ofitheiquicklyiexpandingimeasureiofivideoirecordsianditheihugeinumberiofipotentialiapplicat ionsitakingiintoiaccountiprogramedivideoiexamination,iforiexample,ivisualiobservation,ihum animachineiinterfaces,isportsivideoiinvestigation,iandivideoirecovery.iAmongitheseiapplicati ons,iaistandoutiamongitheimostifascinatingiisihumaniactionirecognitioniparticularlyiabnorma l
stateibehaviorirecognition.iAniactioniisiaisuccessioniofihumanibodyidevelopmentsiandimight iincludeiaifewibodyipartsisimultaneously.iFromitheiperspectiveioficomputerivision,itheirecog nitioniofiactioniisitoicoordinateitheiperceptioni(e.g.,ivideo)iwithibeforehandicharacterizedipat ternsiandiafterithatirelegateiitiailabel,ii.e.,iactionitype.iContingentiuponimultifacetedinature,i humaniactivitiesicanibeiarrangediintoifourilevels:iGestures,iactions,iinteractionsiandigroupiac tivitiesi[11],iandimuchiresearchitakesiafteriaibaseiupidevelopmentiofihumanimovementirecog nition.iSignificantisegmentsiofisuchiframeworksiincorporateifeatureiextraction,iactionilearnin g,iclassification,iactionirecognition,iandisegmentationi[12].iAistraightforwardiprocedureicom prisesithreeistages,iiniparticularidiscoveryiofihumaniand/oriitsibodyiparts,ifollowing,iandiafte rithatirecognitioniutilizingitheifollowingiresults.iCaseiinipoint,itoiperceivei“shakingihands”ia ctivities,itwoiman’siarmsiandihandsiareiinitiallyirecognizediandifolloweditoiproduceiaispatial -
temporalidescriptioniofitheiridevelopment.iThisidescriptioniisicontrastediandiexistingiexampl esiinitheitrainingidataitoidecideitheiactionisort.iThisistandardioficlassifyingiactionirecognitio nimethodsiareiintenselyidependsionitheiexactnessiofitracking,iwhichiisinotisolidiiniclutteredi scenes.iNumerousidifferentisystemsiwereiproposediandicanibeiorderedibyidistinctiveicriteriai asiiniexistingisurveyipapers.iPoppei[12]iexaminedihumaniactionirecognitionifromipictureirep resentationiandiactioniclassificationiindependently.iWeinlandietial.i[13]isurveyedisystemsifor iactionirepresentation,isegmentationiandirecognition.iTuragaietial.i[14]iisolateditheirecogniti oniissueienergeticallyiandiactioniasiindicatedibyiitsiunpredictability,iandiarrangedimethodolo
giesiasiindicatedibyitheiricapacityitoihandleifluctuatingidegreesiofimany- sidediquality.iThereiexistinumerousiothericlassificationicriteria.
2.2iAutomatediAgentiScenario
Humaniactionirecognitionihasibecomeianiimportantiissueiinithisimodernitechnologyibasedier a.iRecognizingihumaniactioniusingiaitrainediagentiisiveryihelpfulibecauseiofiitsiautomatedis ystem.iMoreover,ihumaniisiproneitoifatigueiandihasiailimitationiofihowimuchiworkitheicani doieachiday.iToiovercomeihisitypeiofilimitationsianiautomatediagenticanibeiofigreatihelp.iM achinesiareifreeifromifatigueiandihaveinoilimitationiofiworkingihours.iAitrainediagenticanipr edictiquickerithanihumaniandicaniprocessilargeiamountiofidata.iNewiclassificationsicanibeip rogrammeditoibeitrainediagainiwhichiinitermsiincreasesiagent’siaccuracy.
2.3iSavediModeliandiReuse
Modelsicanibeisavediafteritheitrainingiisidoneiwhichicanibeireusediforilatterimodelitraining.i Reusingimodeliweightsireducesihugeitimeirequireditoitrainianiagent.iWeireuseiairecentlyibuil timodelidesignianditheivastimajorityiofitheilearnediweights,iandiafterithatiutilizationistandard ipreparingistrategiesitoitakeiinitheiremaining,inon-
reusediparameters.iWheniyouibuildiyouriKerasidisplayiutilizingitheiusefuliinterface,iyouicani likewiseiassembleiextraimodelsionianyisubsetiofitheiwaysithroughitheisystemibyireusingithei go-
betweenicapacities.iAtithatipointiyouicaniprepareionijustipartsiofitheisystemi(givenithatiyoui haveifocusesiforitheiyields).iIihaven'tiendeavoreditoiprepareionisub-
systemsiofiaisystem,ihoweveriIidoiutilizeitheseimiddleipersonimodelsitoiengenderienactment sibetweeniinteriorilayers.iTheiPythonibundleiconxithatiisibasedioveriKerasiwilliconstructithe seimiddleipersonimodelsiforiyou,icoincidentally.
2.4iRelatediWorks
Actionirecognitionihasibeenistudiediforiyears.iEarlyiworksifocusionidevelopingigoodihand- craftedifeaturesiforirepresentingiactions,isuchiasi3DiSIFTi[15]iandidenseitrajectoryi[16].iThe iperformancesiforitheseimethodsiareioftenirestrainedidueitoitheilimitedidifferentiationicapabil ityiofihand-
craftedifeatures.iWithitheidevelopmentiofideepiConvNets,imanyiConvNetbasedimethodsiwer
eirecentlyiproposediforiactionirecognition,iwhichiutilizeiConvNetsitoiautomaticallyiobtainith eifeatureirepresentationiforiactions.iJiietial.i[17]iutilizeiai3DiConvNetitoirecognizeiactionsiin ivideo.iSimonyaniandiZissermani[18]iproposeiaitwoistreamiframeworkiwhichiusesitwoiConv Netsitoirespectivelyiextractifeaturesifromitwoiinformationistreamsi(i.e.,iappearanceiandimoti on)iandifuseithemiforirecognition.iBasedionithisiframework,irecentiresearchesifurtheriimprov eitheieffectivenessiofiConvNetifeaturesibyiincludingiadditionaliinformationisourcesi[19],iCo nvolutionalineuralinetwork-
basedirobotinavigationiusingiuncelebratedisphericaliimagesi[20].iMostiofitheiexistingiworksi areitargetediatilearningifeaturesiforidirectlyidescribingiactions’iindividualiactioniclasses,iwhil eitheisharedicharacteristicsiinidifferentiactioniclassigranularitiesiareilessistudied.iThisirestrai nsithemifromipreciselyidistinguishingitheisubtleidifferenceiamongiambiguousiactions.iAlthou ghisomeimethodsi[21]iobtainidifferentilevelsiofigeneralityibyiintegratingifeaturesiinimulti- ConvNetilayers,itheyistillifocusionidirectlyirepresentingitheiindividualiactioniclassesiandidoi noticonsideritheisharedicharacteristicsiinidifferentiactioniclassigranularities.iBesidesitheideriv ationiofiproperifeatures,iotheriresearchesifocusionitheipropericombinationiofimultipleiinform ationistreamsitoiboostitheiactionirecognitioniperformancei[22],i[23],iandi[24].iForiexample,i Feichtenhoferietial.i[22]iintroduceiresidualiconnectionsibetweeniinformationistreamsitoireme dyitheideficiencyiofilateifusionistrategyiinitheitwo-
streamiframework.iWuietial.i[25]ialsoiimproveitheifusioniefficiencyiofitheitwo- streamiframeworkibyiperformingibothisequenceilevelifusioniandivideo-
levelifusionioveritheiinformationistreams.iHowever,imostiofitheseiworksifuseistream- wiseiinformationithatihappenisimultaneously,iwhichihaveilimitationsiinihandlingitheilonger- termiasynchronousipatterniamongiinformationistreams.iAsiwillibeishowniinithisipaper,itheias ynchronyiamongiinformationistreamsiisiainon-
trivialifactoriwhichicanibringinoticeableiperformanceigainsiforiactionirecognition.
Ouriproposediworkiisibasedionithreeiexistingiresearchesi[18],i[26]iandi[27]iwhereiallithesei worksiuseiUCF-
101idataseti[10]iforitrainingianditestingithereicorrespondingimodels.iAicomparativeistudyiofi theseithreeimodelsiisigivenibelow.
2.4.1iBoVWiandiFusioniMethod
AsishowniiniFigurei2.1,itheipipelineiofiBagiofiVisualiWordsi(BoVWs)i[26]iframeworkicons istsiofifiveisteps:ifeatureiextraction,ifeatureipre-
processing,icodebookigeneration,ifeatureiencoding,iandipoolingiandinormalization.iThenithei globalirepresentationiisifediintoiaiclassifierisuchiasilineariSVMiforiactionirecognition.iInithis isection,itheyigaveidetailedidescriptionsiofitheipopularitechnicalichoicesiinieachistep,iwhichi areiveryiimportantiforiconstructingiairecognitionisystem.iFurthermore,itheyisummarizeisever aliuseditechniquesiinitheseiencodingimethodsiandiprovideiaiunifiedigenerativeiperspectiveio veritheseidifferentiencodingimethods.iThisipaperiaimsitoiprovideiaicomprehensiveistudyiofia llistepsiiniBoVWiandidifferentifusionimethods,iandiuncoverisomeigoodipracticeitoiproduceia niactionirecognitionisystem.iSpecifically,itheyiexploreitwoikindsiofilocalifeatures,itenikindsi ofiencodingimethods,ieightikindsiofipoolingiandinormalizationistrategies,iandithreeikindsiofi fusionimethods.iTheyiconcludeithatieveryistepiisicrucialiforicontributingitoitheifinalirecognit ionirateiandiimproperichoiceiinioneiofitheistepsimayicounteractitheiperformanceiimproveme ntiofiotheristeps.iFurthermore,ibasedionitheiricomprehensiveistudy,itheyiproposeiaisimpleiye tieffectiveirepresentation,icalledihybridirepresentation,ibyiexploringitheicomplementarityiofi differentiBoVWiframeworksiandilocalidescriptors.iUsingithisirepresentation,itheyiobtainiania ccuracyiofi87i.9%ioniUCF101idataseti[10].
i
Figurei2.1:iTheipipelineiofiobtainingiBoVWsirepresentationiforiactionirecognition.
2.4.2iOpticaliFlowiSequenceimethodiusingiAlexnetiandiVGG-16 Theyiproposeiaithree-
streamiCNNi[18]isetupiforiactionirecognition.iThisiarchitectureiisianiextensioniofitheipopula ritwo
streamimodelithatitakesiasiinputiindividualiRGBiframesiinioneistreamiandiaismallistackiofio pticaliflowiframesiinitheiother.iOneishortcomingiofithisimodeliisithatiiticannotiseeilong- rangeiactionievolution,iforiwhichitheyiproposeitoiuseitheiridynamiciflowiimages.iTheiriovera lliframeworkiisiillustratediiniFigurei2.2.iToibeiprecise,iforitheidynamiciflowistream,iforieach ivideoisequence,itheyigenerateimultipleidynamiciflowiimages.iIniorderitoiachieveithis,itheyif irstisplititheiinputiflowivideoiintoiseveralisub-
sequencesieachiofilengthiandigeneratediatiaitemporalistride.iForieachisub-
sequence,itheyiconstructiaidynamiciflowiimageiusingitheiopticaliflowiimagesiinithisiwindow .iTheyiassociateitheisameigrounditruthiactionilabeliforiallitheisub-
sequences,ithusieffectivelyiincreasingitheinumberiofitrainingivideos.iTheyiuseiaiseparateiCN Nistreamionidynamiciflowiimages.iGivenithatiactionirecognitionidatasetsiareiusuallyitiny,iini comparisonitoiimageidatasetsi(suchiasiImageNet),iincreasingitheitrainingisetiisiusuallyineces saryiforitheieffectiveitrainingiofitheinetwork.iTheyiuseitheiTVL1iopticaliflowialgorithmitoig enerateitheiflowiimagesiusingiitsiOpenCViimplementation.iForitraining,itheyiuseitheitwoisuc cessfuliCNNiarchitectures,inamelyiAlexnetiandiVGG-
16.iTheyiuseitheiCaffeitoolboxi[30]iforitheiimplementation.iAsitheinumberiofitrainingivideo siisisubstantiallyilimiteditoitrainiaistandardideepinetworkifromiscratch,itheyidecideditoifine- tuneitheinetworksifromimodelsipre-
trainediforiimageirecognitionitasks.iTheyiuseiAlexnetiCNNiarchitectureioniUCF101idataseti[
10]iandiaccuracyiwasi88.63%.
Figurei2.2:iArchitectureiofiOpticaliFlowiSequenceiwithithreeistreamiCNN.
2.4.3iMulti-ViewiSuperiVectoriMethod
PartlyiinspiredibyitheiGaussianimixtureimodeli(GMM)ibasediFisheriVectorirepresentationi[2 8]ianditheiFactorizediOrthogonaliLatentiSpacesi(FOLS)iapproachi[29]iforimultiiviewilearnin g,iinithisipaper,itheyiproposeiaiMixtureimodeliofiProbabilisticiCanonicaliCorrelationiAnalyz ersi(M-
PCCA),iandiutilizeithisimodelitoijointlyiencodeimultipleitypesiofidescriptorsiforivideoirepres entation.iTheirimotivationiisitoifactorizeitheijointispaceiofidescriptoripairiintoitheirisharedico mponentiandimutuallyiindependentiprivateicomponents,isoithatieachicomponentihasistrongii nneridependencyiwhileidifferenticomponentsiareiasiindependentiasipossible.iTheyitheniapply ikerneliaverageionitheseicomponents.iInithisiway,itheyimakeitheimostiofidifferentilocalidescr iptorsitoiimproveirecognitioniaccuracy.iTheyifirstideriveianiEMialgorithmiforilearningiM- PCCA.iEachivideoiisiencodedibasedionithisiMPCCAiviailatentispaceiandigradientiembeddin g.iAsitheyisaw,itheiresultingivideoirepresentationiisiconsistediofitwoicomponents:ioneiisithei latentifactors,iwhichiencodesiinformationisharedibyidifferentifeatureidescriptors;itheiotheriisi theigradientivector,iwhichiencodesiinformationispecificitoieachitypeiofithem.iInterestingly,i mathematicaliformulationsiofitheitwoicomponentsiturnioutitoibeitheicounterpartsiofiFViandi VLADirepresentations,irespectively.iTheyirevisitiCanonicaliCorrelationiAnalysisi(CCA)iandi proposeitheimixtureimodelioficanonicalicorrelationianalyzersiandiitsicorrespondingilearningi algorithmiasiwelliasipresentitheirivideoirepresentationibasedioniM-
PCCA.iAfterithatitheyideriveitheiMVSVi[27]irepresentationifromiM-
PCCA.iTheniweipresentianiinterpretationiofitheirepresentationiandicompareiititoipreviousico dingimethods.iTheiperformanceiofitheimethodiareiexperimentallyiexaminedionitheiUCF101i datasetiandihasianiaccuracyiofi83.5%.
i
Figurei2.3:iAigraphicaliinterpretationiofiM-PCCA 2.5iScopeiofitheiProblems
Mainiscopeiofithisithesisiisiasifollows
1. DevelopianditrainianiagentiwhichicanipredictiHumaniActionifromi101ihumaniactionic lassesibasedioniUCF-101idataseti[10].
2. Saveibestiweightsifromieachiiterationiforifutureiuseiwhichicaniinitermsireduceitimeire quirediforiprediction.
2.6 iChallenges
Throughoutitheiworkiweihaveifacediseveralichallenges.iMostiprominentichallengesiareistated ibelow
1. Overfitting
Theimainichallengesiofithisithesisiwereitoireduceioverfittingithroughoutitheitrainingi epochs.iSometimesitheiagentiperformediveryipoorionitheiunseenidata.iToireduceithis itypeiofiproblemiweihaveifolloweditheiseveralistepsilikeicross-
validation,iincreasingitrainingidataivolume,ireduceifeatures,iregularization,iensemble iandiearlyistopping.
2. SelectioniofiActivationiFunction
WeihaveitriediseveraliactivationifunctionsilikeiBinaryiStep,
i
Sigmoid,iTanh,iReLU,iL eakyiReLU,iSoftmaxietciforitheipulloutilayeriofiouriCNNimodel.iThoughieachiactiv ationifunctionihasiitsistrongipoints,iReLuiworksibestiforitheiResNet-50iarchitectureiandihaveitheibestiweightidistribution.
3. DistributioniofiTensors
WeihaveiusediNvidiaiQuadroiK200miGPUiforiouriwork,iwhichiinitermiisiaiveryilo wipoweriGPUiofionlyihavingi1.67GBiofivideoimemory.iThisilowimemoryiwasiveryi frustratingiwhileiassigningitensorsiasiitialwaysiranioutimemoryiallitime.iToiovercom eithisiproblem,iweihaveiusedifusioniofibothiGPUiandiCPUiwhileiassigningitensor.i WeiwereineededitoiactiveimultipleiCPUiworkersiatiaitime.
4. ChoosingitheiBestiWeights
Inieachiepochimodeligeneratesiseveraliweightsiitihasitrainedion.iChoosingitheibestiw eightsiinitheipastiwasibitiofichallenging.iInimodernidaysiseveraliframeworksiareiusei toichooseibestiweightsifromitheiseveraliiterationsiinieachiepoch.iKerasiworkedibestii nithisitypeiofiscenarioitoitheibestiofiknowledge.
CHAPTERi3
REQUIREMENTiSPECIFICATION
3.1iIntroductionitoiDataset
UCF101iwhichiisicurrentlyioneiofitheilargestidatasetsiofihumaniactions.iIticonsistsiofi101iac tioniclasses,ioveri13kiclipsiandi27ihoursiofivideoidata.iTheidatabaseiconsistsiofirealisticiuser -
uploadedivideosicontainingicameraimotioniandiclutteredibackground.iToitheibestiofiourikno wledge,iUCF101iisicurrentlyioneiofitheimostichallengingidatasetsiofiactionsidueitoiitsilargei numberioficlasses,ilargeinumberioficlipsiandialsoiunconstrainedinatureiofisuchiclips.iTheima jorityiofiexistingiactionirecognitionidatasetsisufferifromitwoidisadvantages:i1)iTheinumberio fitheiriclassesiisitypicallyiveryilowicompareditoitheirichnessiofiperformediactionsibyihumans iinireality,ie.g.iKTHi[31],iWeizmanni[32],iUCFiSportsi[33],iIXMASi[34]idatasetsiincludesio nlyi6,i9,i9,i11iclassesirespectively.i2)iTheivideosiareirecordediiniunrealisticallyicontrolledien vironments.iForiinstance,iKTH,iWeizmann,iIXMASiareistagedibyiactors;iHOHAi[35]iandiU CFiSportsiareicomposediofimovieiclipsicapturedibyiprofessionalifilmingicrew.iRecently,iwe bivideosihaveibeeniusediiniorderitoiutilizeiunconstrainediuser-
uploadedidataitoialleviateitheisecondiissue.iHowever,itheifirstidisadvantageiremainsiunresolv ediasitheilargestiexistingidatasetidoesinotiincludeimoreithani51iactionsiwhileiseveraliworksis howedithatitheinumberioficlassesiplayiaicrucialiroleiinievaluatingianiactionirecognitionimeth od.iTherefore,iweihaveicompilediainewidatasetiwithi101iactionsiandi13320iclipsiwhichiisine arlyitwiceibiggerithanitheilargestiexistingidatasetiinitermsiofinumberiofiactionsiandiclips.i(H MDB51i[36]iandiUCF50i[37]iareitheicurrentlyitheilargestionesiwithi6766iclipsiofi51iactions iandi6681iclipsiofi50iactionsirespectively.)iTheidatasetiisicomposediofiwebivideosiwhichiare irecordediiniunconstrainedienvironmentsianditypicallyiincludeicameraimotion,ivariousilightin giconditions,ipartialiocclusion,ilowiqualityiframes,ietc.iFigurei3.1ishowsisampleiframesiofi6i actioniclassesifromiUCF101.iTheiclipsiofioneiactioniclassiareidividediintoi25igroupsiwhichic ontaini4-
7iclipsieach.iTheiclipsiinioneigroupishareisomeicommonifeatures,isuchiasitheibackgroundiori actors.iTheivideosiareidownloadedifromiYouTubeianditheiirrelevantionesiareimanuallyiremo
ved.iAlliclipsihaveifixediframeirateiandiresolutioniofi25iFPSiandi320i×i240irespectively.iTh eivideosiareisavediini.aviifilesicompressediusingiDivXicodeciavailableiinik-liteipackage.
i
Figurei3.1:iSampleiframesifori6iactioniclassesiofiUCF101i[10].
Tablei1.iSummaryiofiMajoriActioniRecognitioniDatasetsi[10]
i
3.2iWorkflowiofitheiProposediMethod
WeihaveiusediConvolutionaliNeuraliNetworki(CNN)ibasedioniResNet-
50iarchitecture.iCNNihasiseveralimodulesilikeiinputilayer,iconvolutionalilayer,ipoolingilayer ,iReLUiandifullyiconnectedilayer.iWeihaveiusediResNet
Figurei3.2:iProposediMethodiWorkflow Dataset
FrameiCreator
ResizeiModule
Createitest/trainifolder
LoadiResNet-50
SplitiTest:iValidationi(60:
10)
SaveiModel
LoadiModeliwithiBestiWe ighti
TestiAgentiAccuracyionibasis ioniUnknowniTestiData
CreateiModeliChec kipoint
PlotiPredictediResu lt
Start
Afterireshapingitheiimages,iweihaveiconvertedithemiintoiSciKitiarraysiknowniasiNPZiarrays itoireduceihugeitimeirequireditoitrainihighiresolutioniimages.iThisistepiaddsiminoriexecution itimeibutireducesioverallisystemiexecutionitime.iWeifeeditheiNPZiarraysitoitheiinputilayerib utibeforeithatiweineediloadiemptyiResNet-
50imodel.iFromitheidatasetiweihaveicreateditestianditrainifolderiandiplacedirandomiimages.i Trainifolderiisiuseditoitrainitheimodelianditestifolderiserveiasitheiunknowniimagesisource.i WeihaveitraineditheiResNet-
50iwithitheitrainidatasetiwhichicontainsi60%iofitotaliimagesiandieachitrainingiiterationiisival idatediagainsti10%iofitotaliimagesiknowniasivalidationiset.iAfteriretrainingitheiResNet- 50iwithicurrentitrainidatasetiweisaveitheibestiweightsifromieachiiterationiandisaveimodelistat e.iSavedistateicanibeilateriuseditoiretrainiwhereileftioff.iWeihaveichosenitheibestiweightifro mialmosti1000iweightsitoipredictitheiunseenidataifromitheitestidataset.iTestidataseticontainsi 30%iofitotaliimagesiwhichiareiunseenibyitheimodel.iAfteritheipredictioniweiplotitheibestipro babilisticipredictionionitheiunseeniimage.iTheiwholeiworkiflowicanibeivisualizedifromitheifi gurei3.2.
3.3iWorkflowiGraphiofitheiProposediMethod
UCF101idataseti[10]icontainsi101iactioniclassesiasivideoiclipifromiwhichiusingiOpenCViwe ihaveitoicreateiframe.iEveryivideoiclipiisisubdividediintoimanyipartsiinieveryitenisecondiinte rvalii.e.iweigetioneiimageiperitenisecondivideo.iRGBicolorichanneliallocationiareidoneiinithi silayerito.iIniouricaseiweihaveitakeni3ichannels.iInifeatureilearningiprocessithereiareithreeip arts.iConvolutionilayeriextractsitheihighlevelifeaturesiofieachiimagesifromitheiinputiimages.i Afteriweihaveiextractedihighilevelifeaturesifromitheiinputiimages,iweiapplyiReLui(Non- LineariRectifiediUnit)ionieachiconvolutionilayeriimmediatelyitheipurposeiofithisilayeriisitoii ntroduceinonlinearityitoiaisystemithatibasicallyihasijustibeenicomputingilinearioperationsidur ingitheiconvilayersi(justielementiwiseimultiplicationsiandisummations).iAfteritheiReLuiweia pplyimaxipoling.iMaxipoolingilayerichoosesitheibestifeaturesifromitheiprimaryifeaturesiextra ctedibyitheiconvolutionilayer.iMaxipoolingigivesiusitheibestifeaturesiwhichiareimultidimensi onaliarrays.iAsiourifullyiconnectedilayersionlyilearnsionisingleidimensionaliarray,iweineedit oiplatenitheimultidimensionaliarrayibeforeifeedingitoitheifullyiconnectedilayers.iFullyiconne ctedilayersilearnionitheiflatteniinputsibyiapplyingitheibackipropagation.iForitheibackipropaga
tioniandidistributioniofitheiimagesiweihaveiusediADAMifunctioniratherithaniStochasticiGra dientiDescenti(SGD).iADAMiisimuchimoreioptimizediandireducesicompilationitime.iFullyic onnectedilayersioutputsianiNidimensionalivectoriwhereiNiisitheinumberioficlassesithatitheipr ogramihasitoichooseifrom.iEachinumberiofithisiNidimensionalivectorirepresentsitheiprobabil ityiofiaicertainiclass.
i
Figurei3.3:iSystemiWorkflowiGraph
Weihaveiventurediclassificationiproblemiratherithanilocalizationiandidetectionithroughiourith esisilastilayeriofitheimodeliisitheiSoftmaxifunction.iSoftmaxiassignsidecimaliprobabilityitoie achiclassiinitheimulti-
dimensionalifeatureiarray.iTotaliprobabilityiofitheiSoftmaxifunctioniassigneditoieachiclassim ustibei1.iSoftmaxialsoihelpsitheimodelitrainingitoiconvergeimoreiquicklyiwhichiwillitakeimu chilongeriwithoutitheiSoftmax.
3.4iPredictioniflowiofitheiProposediMethod
Weihaveigenerateditheibestiweightiofieachiiteration.iAfteritheimodeliisifinishedilearningiwit hiallitheiimagesiinitheitrainidatasetitheimodeliwilligenerateimoreithani100iweights.iWeiuseit hisiweightitoipredictitheiclassitheiunknowniimageibelongsito.iFirstitheimodeliisiloadediwithi theibestiweight.iWeithenifedithisimodeliainewiunknowniimageifromitheitestidataset.iTrained imodelipredictsitheiclassiofitheiunknowniimage.iAfteritheiprobabilisticipredictioniweiplotithe iresultiwithitheihelpiofiimageiplotter.iCNNimodelsicompareieachipredictioniwithitheigroundi truth.iImageiplotteriprintsitheihighestiprobabilityianditheinestibestimatchionitheiunseeniimag e.
Figurei3.4:iPredictioniWorkflowi
InitializeModel
LoadBestWeigh t
PredicttheClass
ComparethePre dictionwithGro
undTruth
PlottheBestPro
bability
3.5iImplementationiRequirements
Weihaveiusediseveraliframeworksiandipythoniasiprogrammingilanguage.iFrameworksiweiha veiusedithroughitheithesisiareilistedibelow
• Keras
KerasiisiANiopenisupplyineuralinetworkilibraryiwritteniiniPython.
ItiisicapableiofirunningioniprimeiofiTensorFlow,iMicrosoftipsychologicalifeatureiToo lkit,ioriTheano.iDesigneditoialteriquickiexperimentationiwithideepineuralinetworks,iit ifocusesionibeingieasy,imodular,iandiextensile.iItiwasidevelopediasipartiofitheiresearc hieffortiofiprojectiONEIROSi(Open-endediNeuro-
ElectroniciIntelligentiRobotiOperatingiSystem)iandiitsiprimaryiauthoriandimaintainer iisiFrançoisiChollet,iaiGoogleiengineer.
Ini2017,iGoogle'siTensorFlowiteamisetitoisupportiKerasiiniTensorFlow'sicoreilibrary.
CholletiexplainedithatiKerasiwasiplanneditoibeiANiinterfaceiinsteadiofiaistandalonei machine-learningiframework.iItioffersiaihigher-
level,iadditionaliintuitiveisetiofiabstractionsithaticreateiitistraightforwarditoidevelopid eepilearningimodelsidespiteitheiprocessibackendiused.iMicrosoftisuperimposediaiCN TKibackenditoiKerasisimilarly,iofferediasiofiCNTKiv2.0.
• Tensorflow
TensorFlowiisiassociateiopenisupplyicodeilibraryiforinumericalicomputationimistreat mentiinformationiflowigraphs.iTheigraphinodesirepresentimathematicalioperations,iw hereasitheigraphiedgesirepresentitheimultidimensionaliinformationiarraysi(tensors)ith atiflowibetweenithem.ithisiversatileidesigniallowsiyouitoideployicomputationitoiatilea stioneioriadditionaliCPUsioriGPUsiiniaiveryidesktop,iserver,iorimobileideviceiwhilei notirevisingicode.iTensorFlowiadditionallyiincludesiTensorBoard,iaiknowledgeiment aliimageitoolkit.
TensorFlowiwasioriginallyidevelopedibyianalyzersiandiengineersiperformingionithei GoogleiBrainiteamiatiintervalsiGoogle'siMachineiIntelligenceianalysisiorganizationif oritheineedsioficonductingimachineilearningiandideepineuralinetworksiresearch.iThei systemiisigeneralienoughitoibeiapplicableiiniaiveryibigivarietyiofidifferentidomains,ia siwell.
TensorFlowiprovidesistableiPythoniAPIiandiCiApisifurthermoreiasiwhileinotiAPIibac kwardsicompatibilityiguaranteeilikeiC++,iGo,iJava,iJavaScriptiandiSwift.i
• SciKitiLearn Scikit-
learni(formerlyiscikits.learn)imayibeiaifreeicomputericodeimachineilearningilibraryifo ritheiPythoniprogramingilanguage.iItioptionsinumerousiclassifications,iregressioniand ibunchialgorithmsitogetheriwithisupportivectorimachines,irandomiforests,igradientibo osting,ik-
meansiandiDBSCAN,iandiisiintendeditoiinteroperateiwithitheiPythoninumericaliandis cientificilibrariesiNumPyiandiSciPy.
Theiscikit-
learniprojectistartediasiscikits.learn,iaiGoogleiSummeriofiCodeiprojectibyiDavidiCou rnapeau.iItsinameistemsifromitheinotionithatiit'siai"SciKit"i(SciPyiToolkit),iaiseparat ely-developediandidistributedithird-
partyiextensionitoiSciPy.itheiinitialicodebaseiwasilaterirewrittenibyidifferentidevelope rs.iIni2010iFabianiPedregosa,iGaeliVaroquaux,iAlexandreiGramfortiandiVincentiMic hel,iallifromiINRIAitookileadershipiofitheiprojectiandicreateditheiprimaryipubliciunh arnessioniFebitheifirsti2010.Ofitheiassortediscikits,iscikit-learnilikewiseiasiscikit- imageiwereidelineateiasi"well-
maintainediandipopular"iiniNovemberi2012.iAsiofi2018,iscikit- learniisiiniactiveidevelopment.i
• OpenCV
OpenCVi(Openisupplyipcivision)imayibeiailibraryiofiprogrammingifunctionsiinithei mainigeareditowardiperiodipcivision.iOriginallyidevelopedibyiIntel,iitiwasilaterisupp ortedibyiWillowiGarageitheniItseezi(whichiwasilaterinon-
heritableibyiIntel).iTheilibraryiisicross-
platformiandifreeitoibeiusedibeneathitheiASCIIitextifileiBSDilicense.iOpenCVisuppo rtsitheideepilearningiframeworksiTensorFlow,iTorch/PyTorchiandiCaffe.
Officiallyilaunchediini1999,itheiOpenCViprojectiwasiabiinitioiassociateiIntelianalysis iinitiativeitoiadvanceiCPU-
intensiveiapplications,iaipartiofiaiseriesioficomesitogetheriwithiperiodirayitracingiand i3Dishowiwalls.iTheimainicontributorsitoitheiprojectienclosedivarietyiofioptimization iconsultantsiiniInteliRussia,ialsoiasiIntel'siPerformanceiLibraryiTeam.i
CHAPTERi4 PROPOSEDiWORK
4.1iIntroductionitoiCNN
Deepilearningiexploresitheichancesiofilearningifeaturesidirectlyifromiinputiimages,iavoidingi hand-
craftedimodels.iTheikeyiconceptiofideepilearningiisitoiexploreimultipleilevelsiofiillustrationi aimingithatihigher-
levelifeaturesirepresentianiabstractiviewiofitheiimages.iConvolutionaliNeuraliNetworksi(CN Ns)inowadaysiisiusediinieverywhere.iCNNiisiconstructediofimultipleiconvolutionalilayersista ckedionitopiofieachiother,ifollowedibyiaisupervisedideepinetiknowniasifullyiconnectedilayeri andisetsifeatureimapsirepresentibothiinputiandioutputiofieachiconvolutionalilayers.iInputima yiveryilikeiimage,iaudio,iandivideo.iIniouricaseiweihaveiusedicoloriimages,iatitheiinputilayer ieachifeatureimapiisiaitwo-
dimensionaliarrayistoringiRGBichanneliofitheiinputiimage.iOutputifromieachilayericonsistsio fiaisetiofiarraysiwhereifeatureimapirepresentsiaiparticularifeatureiextractediatiaiparticulariinp utilayer.iAideepinetiisitrainedibyifeedingiitiinputiandilettingiiticomputeilayer-by-
layeritoigenerateitheifinalioutputiforicomparisoniwithitheicorrectianswer.iADAMifunctioniw orkiasiweightidistributoriinieachiiterationiandierroriareibackipropagateditoinet.iAtieachistepi backward,itheimodeliparametersiareitunediiniaidirectionithatitriesitoireduceitheierror.iThisipr ocessiincreasesimodeliaccuracyiasitheilearningiprogresses.iGenerally,itrainingiisidoneibyifee dingitheimodelitrainidataisetiagainiandiagainiinianiiterativeifashioniuntilitheimodeliconverge s.
Figurei4.1:iAiCNNiarchitecture
4.2iResNet-50iArchitecturei
Deepiconvolutionalineuralinetworksi[4]ihaveileditoiaiseriesiofibreakthroughsiforiimageiclass ification.iDeepinetworksinaturallyiintegrateilow/mid/highilevelifeaturesiandiclassifiersiiniani end-to-
endimultilayerifashion,iandithei“levels”iofifeaturesicanibeienrichedibyitheinumberiofistacked ilayersi(depth).iDrivenibyitheisignificanceiofidepth,iaiquestioniarises:iIsilearningibetterinetw orksiasieasyiasistackingimoreilayers?iAniobstacleitoiansweringithisiquestioniwasitheinotorio usiproblemiofivanishing/explodingigradients,iwhichihampericonvergenceifromitheibeginning .iThisiproblem,ihowever,ihasibeenilargelyiaddressedibyinormalizediinitializationiandiinterme diateinormalizationilayers,iwhichienableinetworksiwithitensiofilayersitoistarticonvergingifori stochasticigradientidescenti(SGD)iwithibackpropagation.iWhenideeperinetworksiareiableitois tarticonverging,iaidegradationiproblemihasibeeniexposed:iwithitheinetworkidepthiincreasing, iaccuracyigetsisaturatedi(whichimightibeiunsurprising)iandithenidegradesirapidly.iUnexpecte dly,isuchidegradationiisinoticausedibyioverfitting,iandiaddingimoreilayersitoiaisuitablyideepi modelileadsitoihigheritrainingierroriandithoroughlyiverifiedibyiouriexperiments.iTheidegrada tioni(ofitrainingiaccuracy)iindicatesithatinotiallisystemsiareisimilarlyieasyitoioptimize.iLetius iconsideriaishalloweriarchitectureiandiitsideepericounterpartithatiaddsimoreilayersiontoiit.iTh ereiexistsiaisolutionibyiconstructionitoitheideeperimodel:itheiaddedilayersiareiidentityimappi
ng,ianditheiotherilayersiareicopiedifromitheilearnedishallowerimodel.iTheiexistenceiofithisic onstructedisolutioniindicatesithatiaideeperimodelishouldiproduceinoihigheritrainingierroritha niitsishallowericounterpart.iButiexperimentsishowithatiouricurrentisolversionihandiareiunable itoifindisolutionsithatiareicomparablyigoodioribetterithanitheiconstructedisolution.iThisidegra dationiproblemicanibeiaddressibyiintroducingiaideepiresidualilearningiframework.
Figurei4.2:iResidualiLearningiBuildingiBlock
iLetiusiconsideriH(x)iasianiunderlyingimappingitoibeifitibyiaifewistackedilayersi(notinecessa rilyitheientireinet),iwithixidenotingitheiinputsitoitheifirstiofitheseilayers.iIfioneihypothesizesi thatimultipleinonlinearilayersicaniasymptoticallyiapproximateicomplicatedifunctions,itheniitii siequivalentitoihypothesizeithatitheyicaniasymptoticallyiapproximateitheiresidualifunctions,ii .e.,iH(x)i−ixi(assumingithatitheiinputiandioutputiareiofitheisameidimensions).iSoiratherithani expectistackedilayersitoiapproximateiH(x),iweiexplicitlyiletitheseilayersiapproximateiairesidu alifunctioniF(x)i=iH(x)i−ix.iTheioriginalifunctionithusibecomesiF(x)i+ix.iAlthoughibothifor msishouldibeiableitoiasymptoticallyiapproximateitheidesiredifunctionsi(asihypothesized),ithei easeiofilearningimightibeidifferent.iAitypicaliResNet-50ihaveitheifollowingicomponents:
•iInputilayer:iInputiimagesiareifeeditoithisilayersiandioutputiisifeeditoiconvolutionilayers.iRe shapingiandifeature-
scalingiisidoneiinithisistageitoipreventimodelifromidimensionalityierror.iBrekHisidataseticont ainsiimagesiofidifferentishapeiasiresultitoicorrectidimensionalityierrorireshapingiisimust.iWei haveireshapediallitheiimagesitoi(64*64)ipixel.iAsiallitheiimagesiareicolored,iweihaveichosen ichannelisizeitoibei3i(RGBichannel).i
•iConvolutionalilayers:iConvolutionilayersiconvolvesitheiinputiimagesiintoiaimulti-
dimensionalifeatureimapiwithiasisetiofilearnableifilters.iWeihaveiusedithreeiconvolutionalila
yers.iTheikernelsiorifiltersiareiofisizei5i×i5,ipaddingiisisetitoi2iandistrideiisisetitoiSAMEime ansistrideiandipaddingiisiofisameisize.iTheifirstitwoiconvolutionalilayersilearni32ifiltersieach ioneiandiGaussianidistributioniwithistandardideviationiofi0.0001iandi0.01,irespectivelyiisiinit ializediforiconvolutionalilayeri1iandi2irespectively.iTheilastilayerilearnsi64ifiltersiandiaiGau ssianidistributioniwithistandardideviationiofi0.0001iisiuseditoiinitializeithisilayer.
•iPoolingilayers:iPoolingilayersiareiresponsibleiforidown-
isamplingitheispatialidimensioniofitheiinput.iAfterieachiconvolutionalilayerithereiisiaipolling ilayer.iWeihaveiusediai2*2ikerneliforieachipolingilayeriandiaistrideiofisizei2.iTheifirstipooli ngilayeriusesitheimaxipoolingioveritheigeneratedifeatureimapianditheilastitwoiperformiavera geipooling.i
•iReLUilayers:iWeihaveitestedidifferentiactivationifunctionsilikeiTanh,iSigmoidiReLu,iLeak yiReLu,iBinaryistepietc.iFromiourifindingsiweihaveiseenithatiReLuiworksibestionhitheihisto pathologyiimages.iAfterieachipollingithereiisiaiReLuilayer.iForianiinputivalueiofixiReLuico mputesitheineuron’sioutputifi(x)iasixiifixi>i0iandi(αi×ix)iifixi<=i0.iαispecifiesiwhetheritoiom ititheinegativeipartibyimultiplyingiitiwithitheislopeivaluei(0.01ioriso)iratherithanisettingiititoi 0.iTheidefaultivalueiofiαiisi0.iIfitheivalueiofiαiisinotisetitheniitiworksiasiaistandardiReLUifu nctionifi(x)i=imaxi(0,ix),iiniotheriwords,itheiactivationiisisimplyithresholdiatizero.
Figurei4.3:iReLU
•iFullyiconnectedilayers:iFullyiconnectedilayersiareiinitheoryiaiconnectedideepineuralinetwo rkithatitakesiaimulti-
dimensionalifeatureimapiasiinputiandiproducesiaisingleidimensionalifeatureimapiasioutput.i Weihaveiusedithreeifullyiconnectedilayersiasithisigivesitheibestiaccuracyiandioptimumiweig
htidistribution.iAfteritheifullyiconnectedilayerithereiisiaiprobabilisticiSoftmaxiactivationifunc tion.iSoftmaxiisiaiprobabilisticifunctionithatigivesitheibestimatchidependingionitheinumberio ficlassesiinitheiclassificationiproblem.iForiouricaseiit’sianyiactioniclass.
i
Figurei4.4:iResNet-50iCNNiarchitecture.
IfiweicompareitheierrorirateiofiresNetiwithiotheriCNNiasiwelliasishallowimethodionitheibasisio fiimageNeticlassification,iweicaniseeithatiitsierrorirateimuchilessithaniothericlassifier.iThisiisial soioneiofitheibestisideiofiresNetimodel.
Figurei4.5:iImageNetiBenchmark.
4.3iClassification
Toiclassifyianiimage,iweihaveicombineditheipatchiresultiforitheiwholeiimage.iWeihaveidivid editheiinputiimagesiintoibecauseiourimodeliwasitrainedionipatchesiofiimages.iSoinewlyicreat edipatchesiwereirunithroughitheimodelianditheiresultsiwereicombinediforitheiclassification.i Weihaveiextractedigridipatchesiwhichihasiaigoodibalanceibetweeniclassificationiproblemian dicomputationaliproblem.iRunitheimodeliwithitheibestiweightsianditheipatchesioutputsitheib estiprobableimatchiofitheiclasses.iWeihaveiuseditheiSumiruleitoicombineitheipatchiresultsiw hichigivesitheibestiresultiinicomparisonitoidifferentifusionirules.
4.4iDevelopediWork
Followingifiguresiillustratesitheidifferentiaspectsiofiouridevelopediwork
• ConvertingiDatasetiintoiNPZiarrays
WeihaveiusediNumPyitoiconvertiimagesiintoiNPZiarrays.iByiconvertingitheiimagesii ntoipixelatediarraysireducesicomputationitimeisignificantly.iDatasetidirectoryiisipaste diintoitheiCMDiandiunderlingipythoniscripticonvertsiallitheiimagesiinitoiNPZiarrays.
Figurei4.6:iConvertingiImagesiintoiNPZiArrays
• TrainiResNet-50
AftericonvertingitheidatasetiintoiNPZiarraysiweifeeditoitheiResNet- 50.iWeifirstihaveitoidefineisomeiparametersi-
itheimaximuminumberiofitrainingiexamplesiweiwantitoisampleifromitheiimageiandith eisizeiofitheipatchesisampledifromiouriimages.iNextiweineeditoidefineitheidirectories iwhereitheitrainingidataiisilocated.iWeialsoineeditoispecifyitheinumberiofifeaturesi(ce lliedgesiandicelliinterior)ithatiweihaveiannotated.iNextiweineeditoigetiailistiofiallitheifile siinitheitrainingidirectoriesiandiinitializeisomeivariables.iResNet-
50itrainsionitheitestidatasetiandisavesitheibestiweightsiinitheiweightidirectoryiforieac hiiteration.iWhileitheimodelitrainsionilargeivolumeiofidataiweineeditoisaveitheimodel istateitoiavoidianyiinterruption.
Figurei4.7:iTrainingiResNet-50
• PredictioniandiPlotting
Afteritheitrainingiisidone,iweicaniuseitheisavedimodeliweightsitoipredictihumaniactio nitypeifromitheiunknownitestidataset.iOurimodelipredictsitheioutcomeiandiplotitheibe stimatchiprobabilityiintoiinputiunknowniimage.
i
Figurei4.8:iPredictingitheiHumaniActioniClass
i
Figurei4.9:iPlottediModeliOutcome
CHAPTERi5
EXPERIMENTALiRESULT
5.1iResultiGraphs
IniouriproposedimethodiweihaveiusediCNNiwithiResNet-50iinsteadiofihand-
craftedimodel.iHandicraftedimodelsigiveiriseitoiunwantedicomplexityiandimightihaveiinferio rioutcomeithanialreadyiexistedibenchmarkedione.i
Figurei5.1ishowsitheivalidationiaccuracyiofitheiproposediagent.
i
i
Figurei5.1:iValidationiaccuracyiofitheiagent
Theivalidationilossiofiouriproposediarchitectureicanibeideterminedibyifigurei5.2.
i
Figurei5.2:iValidationilossiofitheiagent
Theioveralliaccuracyiofiouriproposediworkioniucf101idatasetiusingiResNet- 50ideepiCNNioveri60iepochsihasibeenishownibyifigurei5.3.
Figurei5.3:iOveralliaccuracyiofitheiagent
Theigradualireductioniofilossiwheniagentipredictsifromiepochioneithroughisixtyishowniinifig urei5.4.
i
Figurei5.4:iGradualireductioniofiloss
Atifirstitheiagentilearnsiatiaisteadyipaceiandihasicostifunctioniofiexiatitheibeginningiofitheile arning.iCostifunctioniofitheiagenticanibeivisualizedifromitheifigurei5.5.
Figurei5.5:iCostifunctionisummery
Theirecognitionirateiofitheiagentiisicomputediatitheiimageilevelithusiprovidingiaimeansitoies timateisolelyitheiimageiclassificationiaccuracyiofitheiCNNimodels.iLetiIallibeitheinumberiofi actioniclassiimagesiofitheitestiset.iIfitheisystemiclassifiesicorrectlyiIclsfyiactioniimages,ithen itheirecognitionirateiatitheiimageileveliis:
𝐼𝑚𝑎𝑔𝑒𝑖𝑅𝑒𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑜𝑛𝑖𝑅𝑎𝑡𝑒 = 𝐼𝑎𝑙𝑙
𝐼𝑐𝑙𝑠𝑓𝑦× 100
5.2iComparisoniwithiPreviousiWork
Beforeiouriproposediworkiweiexaminedisomeipreviousiworkiwhichialsoiuseiucf101idatasetif oritheiriexperiment.
Firstly,iweidiscussediaboutiBagiofiVisualiWordsiandiFusioni(BoVW)imethodiforihumaniacti onirecognitioniandigetianiaccuracyiofi87.9%.iMainiproblemiofithisimethodiisitheiuseiofiextr aifeatureiextractionimethod.iTheigeneraliideaiofibagiofivisualiwordsi(BOVW)iisitoirepresent ianiimageiasiaisetiofifeatures.iFeaturesiconsistsiofikeyipointsiandidescriptors.iInithisiexperim ent,itheyiuseisapaceitimeiinterestipointi(STIP)iandiImproveiDenseiTrajectoriesi(iDTs)ilocalif eatures.iForidetectingifeaturesiandiextractingidescriptorifromieachiimageiinitheidatasetitheyi useiSIFTifeatureiextractorialgorithms.iThisifeatureiextractionialgorithmiwhichicaniloseiailoti
ofispatialiinteractionibetweenipixels,ibutiCNNihasiautomaticifeatureiextractor.iCNNiincludei multilayeriprocessing,isubsamplingilayeritoigiveibetteriperformanceiusingiauto-encoder.
Secondly,iweiconsiderediMulti-
ViewiSuperiVectori(MVSV)imethodiwhichiuseimoreifouriviewsiandimethodsiandicombinedi themitoigetianioptimizediresultiwhichiresultianiaccuracyiofi83.5%.
And,ifinally,iweidiscussedianotherideepiCNNiapproachiwhichiisibasedioniAlexNetiandiVGG -
16iarchitectureiandimakeianiaccuracyiofi88.63%iandi61.70%irespectively.iInitheseitwoiarchi tecturesiwhenideeperinetworksistartsiconverging,iaidegradationiproblemihasibeeniexposed:iw ithitheinetworkidepthiincreasing,iaccuracyigetsisaturatediandithenidegradesirapidly.iBecausei initheseitwoiarchitecturesiplainiblockihasibeeniuseditoimakeideeperilayeribyiaddingimoreilay eriwithishallowilayer.iForithisiiniworstiscenarioideeperimodel’siearlyilayersicanibeireplacedi withishallowinetworkianditheiremainingilayersicanijustiactiasianiidentityifunctioni(Inputiequ alitoioutput).
Figurei5.6:iComparisonibetweeniresNetiandiotheriCNNi
Forirewardingiscenario,iinitheideeperinetworkitheiadditionalilayersibetteriapproximatesithei mappingithaniitsishallowericounterpartiandireducesitheierroribyiaisignificantimargin.
Initheiworsticaseiscenario,ibothitheishallowinetworkiandideeperivariantiofiitishouldigiveithei sameiaccuracy.iInitheirewardingiscenarioicase,itheideeperimodelishouldigiveibetteriaccuracyi
Stackedneuralnet worklayer
PlainiBlock
Yi=if(x) x
F
ResidualiBlock
Stackedneuralnet worklayer
Yi=if(x)i+ix x
F x
thaniitsishallowericounterpart.iButiexperimentsiwithitheiripresentisolversirevealithatideeperi modelsidoesn’tiperformiwell.iSoiusingideeperinetworksiisidegradingitheiperformanceiofithei modelibutiouriapproachitriesitoisolveithisiproblemiusingideepiresidualilearningiframework.iI nsteadiofilearningiaidirectimappingiofixi->iyiwithiaifunctioniH(x)i(Aifewistackedinon- linearilayers),ithisiapproachidefinesitheiresidualifunctioniusingiF(x)i=iH(x) -
ix,iwhichicanibeireframediintoiH(x)i=iF(x)i+ix,iwhereiF(x)iandixirepresentsitheistackedinon- linearilayersianditheiidentityifunctioni(input=output)irespectively.iAccordingitoiresidualihyp othesisiitiisieasyitoioptimizeitheiresidualimappingifunctioniF(x)ithanitoioptimizeitheioriginal, iunreferencedimappingiH(x).
Tablei2.iComparisoniwithiPreviousiWork
Method Accuracyi(%)
BoVW 87.9
MVSV 83.5
OFS(VGG-16) 61.70
OFSi(AlexNet) 88.63
ProposediResNet-50 92.25
CHAPTERi6
CONCLUSIONiANDiFUTUREiWORK
6.1iConclusion
Videoiclassificationitoirecognizeihumaniactionicanibeiperformediinivariousiwaysibutiquestio niisihowiitiaffectitheioveralliperformance.iIniouriwork,iweiatifirstiusediOpenCVitoicreateifra meiimageifromieveryitenisecondivideoiandigotimanyiframeiimageiforioneiactioniclass.iAfter ithatiweitrainediouriagentiusingiResNet-
50.iOuriworkihasishownithatiexistingimodelsiparticularlyiResNet-
50iperformibetterithanihandicraftedimodelsiwheniclassifyingicoloriimagesiandiobjectiwithidi verseiimageipatterns.
Iniouriworkiweihaveiusedislidingiwindowimechanism,ithatiallowitoidealiwithitheihigh- resolutioniofitexturediimagesiwhichicanialsoiperformiasiexpectediwithilowiresolutioniimages iwithoutichangingitheiwholeiarchitecture.iExperimentaliresultsihaveishownithatiouriproposed imethodiworksibetterithaniexitingiclassifiers.
6.2iFutureiWork
FutureiWorkicaniexploreidifferentiCNNiarchitectureitoigetibetteriperformanceionithisidataset ioricaniuseithisiarchitectureitoiotherimoreirecentidatasetisuchiasiKinecticidatasetiwithimoreit hani400iactioniclasses.iAdditionally,ibyiexploringiotheripre-
trainedimodelsilikeiInceptioniV3,iGoogleNetietciwouldibeiaigreatioptionitoigetibetteriaccura cy.
REFERENCE:
[1] Huang Y, Yang H, Huang P. Action recognition using hog feature in different resolution videosequences[C]//Computer Distributed Control and Intelligent Environmental Monitoring(CDCIEM), 2012 International Conference on. IEEE, 2012: 85-88.
[2] Sadanand S, Corso J. Action bank: A high-level representation of activity in video. In IEEE,2012: 1234-1241.
[3] Wang H, Kläser A, Schmid C. et al. Dense trajectories and motion boundary descriptors foraction recognition [J]. International journal of computer vision, 2013, 103(1):
60-79.
[4] LeCun Y, Bottou L, Bengio Y. et al. Gradient-based learning applied to document recognition [J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[5] Ji S, Xu W, Yang M. et al. 3D convolutional neural networks for human action recognition [J].Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2013, 35(1): 221-231.
[6] Karpathy A, Toderici G, Shetty S. et al. Large-scale video classification with convolutionalneural networks[C]//Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014: 1725-1732.
[7] Ijjina E P, Mohan C K. Human Action Recognition Based on Recognition of Linear Patterns inAction Bank Features Using Convolutional Neural Networks[C]//Machine Learning andApplications (ICMLA), 2014 13th International Conference on. IEEE, 2014:
178-182.
[8] Jung M, Hwang J, Tani J. Multiple spatio-temporal scales neural network for contextual visualrecognition of human actions[C]//Development and Learning and Epigenetic Robotics
(ICDL-Epirob), 2014 Joint IEEE International Conferences on. IEEE, 2014: 235-241.
[9] Zhang N, Paluri M, Ranzato M. A. et al. Panda: Pose aligned networks for deep attributemodeling[C]//Computer Vision and Pattern Recognition (CVPR), 2014 IEEE
37 Copyright© Daffodil International University
[10] K. Soomro, A. R. Zamir and M. Shah, UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild. CRCV-TR-12-01, November, 2012.
[11] Aggarwal J, Ryoo M. Human activity analysis: A survey. ACM ComputSurv2011;
43:1-43.
[12] Poppe R. A survey on vision-based human action recognition. ImageVis Comput 2010; 28:976-90.
[13] Weinland D, Ronfard R, Boyer E. A survey of vision-based methods foraction representation, segmentation and recognition. Comput Vis ImageUnderst 2011; 115:224-41.
[14] Turaga P, Chellappa R, Subrahmanian VS, Udrea O. Machine recognition of human activities: A survey. IEEE Trans Circuits SystemVideo Technol 2008; 18:1473-88.
[15] Scovanner, P., Ali and S. Shah, M. 2007. A 3-dimensional SIFT descriptor and itsapplication to action recognition. In ACM MM.
[16] Wang, H. Klaser, A. Schmidand Liu C. 2013. Dense trajectories and motion boundary descriptors for action recognition. Intl. Journal Comp. Vision103(1):60–79.
[17] Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for humanaction recognition. IEEE T-PAMI 35(1) (2013) 221-231.
[18] Jue Wang, Anoop Cherian, FatihPorikli: Ordered Pooling of Optical Flow Sequences for Action Recognition. In WACV, 2017.
[19] Song, S. Lan, C. Xing, J. Zeng and Liu J. 2017. An end-to-end spatial-temporal attention modelfor human action recognition from skeleton data. In AAAI.
[20] Ran, L., Zhang, Y., Zhang, Q., Yang, T.: Convolutional neural network-basedrobot navigation using uncelebrated spherical images. Sensors 17(6) (2017).
[21] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semanticsegmentation. In: CVPR. (2015) 3431-3440.
[22] Feichtenhofer C., Pinz A., and Wildes R. 2016. Spatiotemporal residualnetworks for video action recognition. In NIPS
[23] Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV.
(2013) 3551-3558
[24] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.:
Temporalsegment networks: Towards good practices for deep action recognition. In ECCV.
(2016) 20-36
[25] Wu Z., Wang X., Jiang Y., Ye H., andXue X. 2015. Modeling spatial-temporal clues in a hybriddeep learning framework for video classification. In ACMMM.
[26] Xiaojiang Peng, Limin Wang, Xingxing Wang and Yu Qiao. 2014. Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice. In UTC.
[27] ZhuoweiCai, Limin Wang, Xiaojiang Peng, Yu Qiao. Multi-View Super Vector for Action Recognition. In IEEE, 2014.
[28] F. Perronnin and C. R. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.
[29] M. Salzmann, C. H. Ek, R. Urtasun, and T. Darrell. Factorized orthogonal latent spaces. In AISTATS, 2010.
[30] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Intl.
Conf. on Multimedia. ACM, 2014.
[31] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm approach, 2004. International Conference on Pattern Recognition (ICPR). 1, 5
[32] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes, 2005. International Conference on Computer Vision (ICCV). 1, 5
[33] M. Rodriguez, J. Ahmed, and M. Shah. Action mach: A spatiotemporal maximum average correlation height lter for action recognition, 2008. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1, 5
[34] D. Weinland, E. Boyer, and R. Ronfard. Action recognition from arbitrary views using 3d exemplars, 2007. International Conference on Computer Vision (ICCV). 1, 5
[35] M. Marszaek, I. Laptev, and C. Schmid. Actions in context, 2009. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1, 4, 5
[36] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition, 2011. International Conference on Computer Vision (ICCV). 1, 5
[37] K. Reddy and M. Shah. Recognizing 50 human action categories of web videos, 2012.
Machine Vision and Applications Journal (MVAP). 1, 5