Deep Residual CNN Based Model for Human Activity Recognition System

(1)

Deep Residual CNN Based Model for Human Activity Recognition System

By

Saifuddin Mohammad Tareque ID: 173-25-630

This Report Presented in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science and Engineering.

Supervised By

Md Zahid Hasan

Assistant Professor &Coordinator of MIS Department of CSE

Daffodil International University

DAFFODIL INTERNATIONAL UNIVERSITY DHAKA, BANGLADESH

May 2019

(2)

(3)

(4)

ACKNOWLEDGEMENT

First of all, my heartiest thanks and gratefulness to Almighty Allah for His divine blessing that makes us capable to complete this thesis successfully.

I would like to thank my honorable teacher & project supervisor MdZahid Hasan, Assistant Professor, Department of CSE, Daffodil International University for his endless patience, scholarly guidance, continual encouragement, constant and energetic supervision, constructive criticism, valuable advice, reading many inferior drafts and correcting them at all stage have made it possible to complete this project.

We would like to express our heartiest gratitude to Dr. Syed Akhter Hossain, Head, Department of CSE, for his kind help to finish our project and we are also thankful to all the other faculty and staff members of our department for their co-operation and help.

We must acknowledge with due respect the constant support and patients of our parents.

Finally, we would like to thank our entire course mate in Daffodil International University, who took part in this discuss while completing the course work.

(5)

ABSTRACT

Human Action Recognition (HAR) is a significant application realm in computer vision, but high precision recognition of human action in the complex background is still an open question. Recently, deep learning approach has been used widely in order to enhance the recognition accuracy with different application areas. In our research, as classifier, a deep Convolutional Neural Network (CNN) using ResNet-50 model is proposed for HAR because it is the most upper hand in compare to other classifiers. Our proposed research work have used publicly accessible UCF-101 dataset which provides the largest multiplicity in HAR filed as most of the available action recognition data sets are not realistic. Additionally, UCF-101 dataset intends to give support further research into action recognition by learning and surveying new pragmatic action categories.

(6)

TABLE OF CONTENTS

CONTENTS PAGE

Approval I

Declaration II

Acknowledgements III

Abstract IV

CHAPTERS

CHAPTER 1: INTRODUCTION 01-03

1.1 1.2

Introduction Objectives

01 02

1.3 Motivation 03

1.4 Expected Outcome 03

1.5 Report Layout 03

CHAPTER 2: RELATED WORKS 04-10

2.1 Introduction 04

2.2 Automated Agent Scenario 05

2.3 Save Model and Reuse 05

2.4 Related Works 05

2.5 Scope of the Problems 09

2.6 Challenges 10

CHAPTER 3: REQUIREMENT SPECIFICATION 11-18

3.1 Introduction to Dataset 11

3.2 Workflow of the Proposed Method 12

3.3 Workflow Graph 14

3.4 3.5

Prediction Flow Process Implementation Requirements

16 17

CHAPTER 4: PROPOSED METHODOLOGY 19-26

4.1 Introduction to CNN 19

4.2 ResNet Architecture 20

4.3 Classification Process 24

4.4 Developed Work 24

CHAPTER 5: EXPERIMENTAL RESULT 27-31

5.1 Result Graphs 27

5.2 Experimental Result 29

CHAPTER 6: CONCLUSION & FUTURE SCOPE 32

6.1 Conclusion 32

6.2 Future Scope 32

REFERENCES 33

(7)

LIST OF FIGURES

FIGURES PAGE NO

Figure 2.1: BoVWsRepresentation for Action Recognition

07

Figure 2.2: Optical Flow Sequence

08

Figure 2.3: Graphical Interpretation of M-PCCA

09

Figure 3.1: Action classes of UCF101

12

Figure 3.2: Proposed Method Workflow

13

Figure 3.3: Flow Graph

15

Figure 3.4: Prediction Workflow

16

Figure 4.1: CNN Architecture

19

Figure 4.2: Residual Learning Building Block

21

Figure 4.3: ReLu

22

Figure 4.4: ResNet Architecture

23

Figure 4.5: ImageNet Benchmark

23

Figure 4.6: Image Conversion

24

Figure 4.7: Training ResNet-50

25

Figure 4.8: Prediction

26

Figure 4.9: Prediction Plotting

26

Figure 5.1: Validation Accuracy

27

Figure 5.2: Validation Loss

27

Figure 5.3: Overall Accuracy

28

Figure 5.4: Overall Loss

28

Figure 5.5: Cost Function Summary

29

Figure 5.6: Comparison with Previous Work

30

LIST OFTABLES

TABLE

PAGE NO

Table-1 Summary of Major Action Recognition Datasets 12

Table-2 Comparison with Previous Work 31

(8)

(9)

CHAPTER 1 INTRODUCTION

1.1 Introduction

In the last decade, human action recognition (HAR) is becoming a more and more attractive research topic with several applications, such as video surveillance, virtual reality, intelligent human-computer interactions, etc. However, accurate recognition of actions is a highly challenging task due to cluttered backgrounds, occlusions, and viewpoint variations.

HAR consist of several stages, which describe the features that define activities or low level actions. A generic description of human action recognition from image sequence consist of two steps: 1) extract complex handcrafted features from raw input video frames, 2)build a classifier basedon these features Some of the commonly used features for human action recognition are Histogramof Oriented Gradient (HOG)[1], Histogram of Optical Flow (HOF), Motion Interchange Patters(MIP), Space-Time Interest Points (STIP), action bank features [2] and dense trajectories[3].However, these approaches are difficult and time consuming to extend these featuresto other systems. A large part of hand-design features are driven by task and different tasks may usecompletely different features. But in reality, it is hard to know what kind of feature is important to aspecific task, so the feature selection is highly dependent on the specific problem. Especially forhuman action recognition, different kinds of sports show a very big difference in the appearance andmotion model, it is hard to get the essential feature of action in the drastic change ofenvironment .Therefore a generic feature extraction method is needed to be proposed to alleviatethe need for hand-engineered features and reduces the calculation scale.

CNN [4] is a deep model that obtains complicated hierarchical features via convolutional operation alternating with sub-sampling operation on the raw input images. It is confirmed that

CNN can gain more excellent performance in visual target recognition tasks through appropriate adjustment during the training. And CNN has invariance for a particular pose, illumination, anddisorderly environmental change.The first attempt for HAR using CNN was

(10)

by[5] developing a novel 3D CNN model that extractfeatures from both spatial and temporal dimensions by performing 3D convolutions, therebycapturing the motion information encoded in multiple adjacent frames. The developed modelgenerated multiple channels of information from the input frames, and the final featurerepresentation is obtained by combining information from all channels. [7] Proposed a deep convolutional network architecture for recognizing human actions in videos using action bank features of UCF50 database. [8] Proposed a novel dynamic neural network model which can recognize dynamic visual image patterns of human actions based on learning. Convolutional neural network (CNN) and the multiple timescale recurrent neural networks (MTRNN) were introduced. [9]

Proposed a newmethod which combines part-based models and deep learning by training pose-normalized CNN.

Although CNN is a good option for HAR, this method still has a weakness that the kernels/weights employed in the convolution are trained by BP neural networks, which are very time consuming. In this paper, to solve this problem of HAR based on CNN, a convolutional auto-encoder (CAE) pre-training strategy has proposed. This method discovers good CNN initializations that avoid the numerous distinct local minima of highly non-convex objective functions arising in virtually all deep learning problems.

1.2 Objectives

The main objective of this thesis is to develop an automated agent for recognizing human action using UCF-101 dataset [10].

The goals of our thesis are:

• Divide every video into image frame (1 frame per 10 seconds)

• To reshape differently sized images and convert them into NPZ arrays.

• Successfully divide the dataset into test and train folders.

• Retrain ResNet-50 model with our training dataset

• Save model progress as model checkpoints

• Save best model weights for future prediction

• Predict human action type from unknown test dataset.

(11)

1.3 Motivation

Nowadaysihumaniactionirecognitionialgorithmsiempowerimanyirealiworldiapplications.iSec urityiissueiisibecomingimoreiimportantiiniouridailyilifeiandiitiisioneiofitheimostifrequentlyi discusseditopicsinowadays.iOurimotivationiisitoidevelopianditrainianiagentiwhichicanirecog nizeihumaniaction.iWithitheiinputiofiainetworkioficameras,iaivisualisurveillanceisystemipo weredibyiactionirecognitioniandipredictionialgorithmsimayiincreaseitheichancesioficapturin giaicriminalionivideo,iandireduceitheiriskicausedibyicriminaliactions.iAnotheriimportantimo tivationiofiouriresearchiisitoiincreaseihuman-robotiinteraction.iHuman-

robotiinteractioniisipopularlyiappliediinihomeiandiindustryienvironment.iImagineithatiaipers oniisiinteractingiwithiairobotiandiaskingiititoiperformicertainitasks,isuchiasi“passingiaicupio fiwater”iori“performingianiassemblingitask”.iSuchianiinteractionirequiresicommunicationsib etweenirobotsiandihumans,iandivisualicommunicationiisioneiofitheimostiefficientiways.

1.4 ExpectediOutcome

IniouriThesisiourimainifocusiisitoidevelopianditrainianiagentiwhichicanipredictihumaniactio nitypeiwithi13320ivideosifromi101iactionicategories.iTheiagentiisibasedioniconvolutionalin euralinetworkiwhereiweihaveiusediaipre-trainedimodeli(ResNet-

50)iratherithaniaihandicraftedione.iResNet-

50ihasiaiperformanceiboostiofi10%iinicomparisoniwithihandicraftedimodels.iAfteritheitraini ngitheiagentihasitheiaccuracyiofi92.25%.

1.5iReportiLayout

FirstiChaptericontainsitheiIntroduction,iObjectives,iMotivation,iExpectediOutcomeiandiRep ortilayoutiofiouriproject.iThenisecondichaptericontainsiProjectiIntroduction,iRelatediworks,i ComparativeiStudies,iScopeiofitheiproblemiandialsoiChallengesiofiouriproject.iThirdichapte ricontainialliaboutiRequirementiSpecificationiwhichiareiUseiCaseiModelingiandiDescription ,iLogicaliDataiModel,iDesigniRequirementsiandiDescriptioniofitheiDatasetiweihaveiused.i FourthiChapteridescribesiproposedimethodologyiinidetailsiandiTrainingistrategiesiforitheiag ent.iOurififthichapteriisialliaboutiImplementationiandiAccuracyitesting.iThisicontainsiImple mentationiofiResNet-50,iPredictioniVisualizeriandiTestingimodules.i

(12)

Ourilastichaptericontainiconclusioniofitheifullithesis.iThisireporticontainsialliaboutiouriprop osedisystem,iitsiproblem,iandisolutioniandifutureiimprovements.

(13)

CHAPTERi2

BACKGROUNDiANDiRELATEDiWORKS

2.1iIntroduction

Humaniactionirecognitioniisiaidynamicipointiinitheifieldioficomputerivision.iThisiisibecausei ofitheiquicklyiexpandingimeasureiofivideoirecordsianditheihugeinumberiofipotentialiapplicat ionsitakingiintoiaccountiprogramedivideoiexamination,iforiexample,ivisualiobservation,ihum animachineiinterfaces,isportsivideoiinvestigation,iandivideoirecovery.iAmongitheseiapplicati ons,iaistandoutiamongitheimostifascinatingiisihumaniactionirecognitioniparticularlyiabnorma l

stateibehaviorirecognition.iAniactioniisiaisuccessioniofihumanibodyidevelopmentsiandimight iincludeiaifewibodyipartsisimultaneously.iFromitheiperspectiveioficomputerivision,itheirecog nitioniofiactioniisitoicoordinateitheiperceptioni(e.g.,ivideo)iwithibeforehandicharacterizedipat ternsiandiafterithatirelegateiitiailabel,ii.e.,iactionitype.iContingentiuponimultifacetedinature,i humaniactivitiesicanibeiarrangediintoifourilevels:iGestures,iactions,iinteractionsiandigroupiac tivitiesi[11],iandimuchiresearchitakesiafteriaibaseiupidevelopmentiofihumanimovementirecog nition.iSignificantisegmentsiofisuchiframeworksiincorporateifeatureiextraction,iactionilearnin g,iclassification,iactionirecognition,iandisegmentationi[12].iAistraightforwardiprocedureicom prisesithreeistages,iiniparticularidiscoveryiofihumaniand/oriitsibodyiparts,ifollowing,iandiafte rithatirecognitioniutilizingitheifollowingiresults.iCaseiinipoint,itoiperceivei“shakingihands”ia ctivities,itwoiman’siarmsiandihandsiareiinitiallyirecognizediandifolloweditoiproduceiaispatial -

temporalidescriptioniofitheiridevelopment.iThisidescriptioniisicontrastediandiexistingiexampl esiinitheitrainingidataitoidecideitheiactionisort.iThisistandardioficlassifyingiactionirecognitio nimethodsiareiintenselyidependsionitheiexactnessiofitracking,iwhichiisinotisolidiiniclutteredi scenes.iNumerousidifferentisystemsiwereiproposediandicanibeiorderedibyidistinctiveicriteriai asiiniexistingisurveyipapers.iPoppei[12]iexaminedihumaniactionirecognitionifromipictureirep resentationiandiactioniclassificationiindependently.iWeinlandietial.i[13]isurveyedisystemsifor iactionirepresentation,isegmentationiandirecognition.iTuragaietial.i[14]iisolateditheirecogniti oniissueienergeticallyiandiactioniasiindicatedibyiitsiunpredictability,iandiarrangedimethodolo

(14)

giesiasiindicatedibyitheiricapacityitoihandleifluctuatingidegreesiofimany- sidediquality.iThereiexistinumerousiothericlassificationicriteria.

2.2iAutomatediAgentiScenario

Humaniactionirecognitionihasibecomeianiimportantiissueiinithisimodernitechnologyibasedier a.iRecognizingihumaniactioniusingiaitrainediagentiisiveryihelpfulibecauseiofiitsiautomatedis ystem.iMoreover,ihumaniisiproneitoifatigueiandihasiailimitationiofihowimuchiworkitheicani doieachiday.iToiovercomeihisitypeiofilimitationsianiautomatediagenticanibeiofigreatihelp.iM achinesiareifreeifromifatigueiandihaveinoilimitationiofiworkingihours.iAitrainediagenticanipr edictiquickerithanihumaniandicaniprocessilargeiamountiofidata.iNewiclassificationsicanibeip rogrammeditoibeitrainediagainiwhichiinitermsiincreasesiagent’siaccuracy.

2.3iSavediModeliandiReuse

Modelsicanibeisavediafteritheitrainingiisidoneiwhichicanibeireusediforilatterimodelitraining.i Reusingimodeliweightsireducesihugeitimeirequireditoitrainianiagent.iWeireuseiairecentlyibuil timodelidesignianditheivastimajorityiofitheilearnediweights,iandiafterithatiutilizationistandard ipreparingistrategiesitoitakeiinitheiremaining,inon-

reusediparameters.iWheniyouibuildiyouriKerasidisplayiutilizingitheiusefuliinterface,iyouicani likewiseiassembleiextraimodelsionianyisubsetiofitheiwaysithroughitheisystemibyireusingithei go-

betweenicapacities.iAtithatipointiyouicaniprepareionijustipartsiofitheisystemi(givenithatiyoui haveifocusesiforitheiyields).iIihaven'tiendeavoreditoiprepareionisub-

systemsiofiaisystem,ihoweveriIidoiutilizeitheseimiddleipersonimodelsitoiengenderienactment sibetweeniinteriorilayers.iTheiPythonibundleiconxithatiisibasedioveriKerasiwilliconstructithe seimiddleipersonimodelsiforiyou,icoincidentally.

2.4iRelatediWorks

Actionirecognitionihasibeenistudiediforiyears.iEarlyiworksifocusionidevelopingigoodihand- craftedifeaturesiforirepresentingiactions,isuchiasi3DiSIFTi[15]iandidenseitrajectoryi[16].iThe iperformancesiforitheseimethodsiareioftenirestrainedidueitoitheilimitedidifferentiationicapabil ityiofihand-

craftedifeatures.iWithitheidevelopmentiofideepiConvNets,imanyiConvNetbasedimethodsiwer

(15)

eirecentlyiproposediforiactionirecognition,iwhichiutilizeiConvNetsitoiautomaticallyiobtainith eifeatureirepresentationiforiactions.iJiietial.i[17]iutilizeiai3DiConvNetitoirecognizeiactionsiin ivideo.iSimonyaniandiZissermani[18]iproposeiaitwoistreamiframeworkiwhichiusesitwoiConv Netsitoirespectivelyiextractifeaturesifromitwoiinformationistreamsi(i.e.,iappearanceiandimoti on)iandifuseithemiforirecognition.iBasedionithisiframework,irecentiresearchesifurtheriimprov eitheieffectivenessiofiConvNetifeaturesibyiincludingiadditionaliinformationisourcesi[19],iCo nvolutionalineuralinetwork-

basedirobotinavigationiusingiuncelebratedisphericaliimagesi[20].iMostiofitheiexistingiworksi areitargetediatilearningifeaturesiforidirectlyidescribingiactions’iindividualiactioniclasses,iwhil eitheisharedicharacteristicsiinidifferentiactioniclassigranularitiesiareilessistudied.iThisirestrai nsithemifromipreciselyidistinguishingitheisubtleidifferenceiamongiambiguousiactions.iAlthou ghisomeimethodsi[21]iobtainidifferentilevelsiofigeneralityibyiintegratingifeaturesiinimulti- ConvNetilayers,itheyistillifocusionidirectlyirepresentingitheiindividualiactioniclassesiandidoi noticonsideritheisharedicharacteristicsiinidifferentiactioniclassigranularities.iBesidesitheideriv ationiofiproperifeatures,iotheriresearchesifocusionitheipropericombinationiofimultipleiinform ationistreamsitoiboostitheiactionirecognitioniperformancei[22],i[23],iandi[24].iForiexample,i Feichtenhoferietial.i[22]iintroduceiresidualiconnectionsibetweeniinformationistreamsitoireme dyitheideficiencyiofilateifusionistrategyiinitheitwo-

streamiframework.iWuietial.i[25]ialsoiimproveitheifusioniefficiencyiofitheitwo- streamiframeworkibyiperformingibothisequenceilevelifusioniandivideo-

levelifusionioveritheiinformationistreams.iHowever,imostiofitheseiworksifuseistream- wiseiinformationithatihappenisimultaneously,iwhichihaveilimitationsiinihandlingitheilonger- termiasynchronousipatterniamongiinformationistreams.iAsiwillibeishowniinithisipaper,itheias ynchronyiamongiinformationistreamsiisiainon-

trivialifactoriwhichicanibringinoticeableiperformanceigainsiforiactionirecognition.

Ouriproposediworkiisibasedionithreeiexistingiresearchesi[18],i[26]iandi[27]iwhereiallithesei worksiuseiUCF-

101idataseti[10]iforitrainingianditestingithereicorrespondingimodels.iAicomparativeistudyiofi theseithreeimodelsiisigivenibelow.

(16)

2.4.1iBoVWiandiFusioniMethod

AsishowniiniFigurei2.1,itheipipelineiofiBagiofiVisualiWordsi(BoVWs)i[26]iframeworkicons istsiofifiveisteps:ifeatureiextraction,ifeatureipre-

processing,icodebookigeneration,ifeatureiencoding,iandipoolingiandinormalization.iThenithei globalirepresentationiisifediintoiaiclassifierisuchiasilineariSVMiforiactionirecognition.iInithis isection,itheyigaveidetailedidescriptionsiofitheipopularitechnicalichoicesiinieachistep,iwhichi areiveryiimportantiforiconstructingiairecognitionisystem.iFurthermore,itheyisummarizeisever aliuseditechniquesiinitheseiencodingimethodsiandiprovideiaiunifiedigenerativeiperspectiveio veritheseidifferentiencodingimethods.iThisipaperiaimsitoiprovideiaicomprehensiveistudyiofia llistepsiiniBoVWiandidifferentifusionimethods,iandiuncoverisomeigoodipracticeitoiproduceia niactionirecognitionisystem.iSpecifically,itheyiexploreitwoikindsiofilocalifeatures,itenikindsi ofiencodingimethods,ieightikindsiofipoolingiandinormalizationistrategies,iandithreeikindsiofi fusionimethods.iTheyiconcludeithatieveryistepiisicrucialiforicontributingitoitheifinalirecognit ionirateiandiimproperichoiceiinioneiofitheistepsimayicounteractitheiperformanceiimproveme ntiofiotheristeps.iFurthermore,ibasedionitheiricomprehensiveistudy,itheyiproposeiaisimpleiye tieffectiveirepresentation,icalledihybridirepresentation,ibyiexploringitheicomplementarityiofi differentiBoVWiframeworksiandilocalidescriptors.iUsingithisirepresentation,itheyiobtainiania ccuracyiofi87i.9%ioniUCF101idataseti[10].

i

Figurei2.1:iTheipipelineiofiobtainingiBoVWsirepresentationiforiactionirecognition.

(17)

2.4.2iOpticaliFlowiSequenceimethodiusingiAlexnetiandiVGG-16 Theyiproposeiaithree-

streamiCNNi[18]isetupiforiactionirecognition.iThisiarchitectureiisianiextensioniofitheipopula ritwo

streamimodelithatitakesiasiinputiindividualiRGBiframesiinioneistreamiandiaismallistackiofio pticaliflowiframesiinitheiother.iOneishortcomingiofithisimodeliisithatiiticannotiseeilong- rangeiactionievolution,iforiwhichitheyiproposeitoiuseitheiridynamiciflowiimages.iTheiriovera lliframeworkiisiillustratediiniFigurei2.2.iToibeiprecise,iforitheidynamiciflowistream,iforieach ivideoisequence,itheyigenerateimultipleidynamiciflowiimages.iIniorderitoiachieveithis,itheyif irstisplititheiinputiflowivideoiintoiseveralisub-

sequencesieachiofilengthiandigeneratediatiaitemporalistride.iForieachisub-

sequence,itheyiconstructiaidynamiciflowiimageiusingitheiopticaliflowiimagesiinithisiwindow .iTheyiassociateitheisameigrounditruthiactionilabeliforiallitheisub-

sequences,ithusieffectivelyiincreasingitheinumberiofitrainingivideos.iTheyiuseiaiseparateiCN Nistreamionidynamiciflowiimages.iGivenithatiactionirecognitionidatasetsiareiusuallyitiny,iini comparisonitoiimageidatasetsi(suchiasiImageNet),iincreasingitheitrainingisetiisiusuallyineces saryiforitheieffectiveitrainingiofitheinetwork.iTheyiuseitheiTVL1iopticaliflowialgorithmitoig enerateitheiflowiimagesiusingiitsiOpenCViimplementation.iForitraining,itheyiuseitheitwoisuc cessfuliCNNiarchitectures,inamelyiAlexnetiandiVGG-

16.iTheyiuseitheiCaffeitoolboxi[30]iforitheiimplementation.iAsitheinumberiofitrainingivideo siisisubstantiallyilimiteditoitrainiaistandardideepinetworkifromiscratch,itheyidecideditoifine- tuneitheinetworksifromimodelsipre-

trainediforiimageirecognitionitasks.iTheyiuseiAlexnetiCNNiarchitectureioniUCF101idataseti[

10]iandiaccuracyiwasi88.63%.

(18)

Figurei2.2:iArchitectureiofiOpticaliFlowiSequenceiwithithreeistreamiCNN.

2.4.3iMulti-ViewiSuperiVectoriMethod

PartlyiinspiredibyitheiGaussianimixtureimodeli(GMM)ibasediFisheriVectorirepresentationi[2 8]ianditheiFactorizediOrthogonaliLatentiSpacesi(FOLS)iapproachi[29]iforimultiiviewilearnin g,iinithisipaper,itheyiproposeiaiMixtureimodeliofiProbabilisticiCanonicaliCorrelationiAnalyz ersi(M-

PCCA),iandiutilizeithisimodelitoijointlyiencodeimultipleitypesiofidescriptorsiforivideoirepres entation.iTheirimotivationiisitoifactorizeitheijointispaceiofidescriptoripairiintoitheirisharedico mponentiandimutuallyiindependentiprivateicomponents,isoithatieachicomponentihasistrongii nneridependencyiwhileidifferenticomponentsiareiasiindependentiasipossible.iTheyitheniapply ikerneliaverageionitheseicomponents.iInithisiway,itheyimakeitheimostiofidifferentilocalidescr iptorsitoiimproveirecognitioniaccuracy.iTheyifirstideriveianiEMialgorithmiforilearningiM- PCCA.iEachivideoiisiencodedibasedionithisiMPCCAiviailatentispaceiandigradientiembeddin g.iAsitheyisaw,itheiresultingivideoirepresentationiisiconsistediofitwoicomponents:ioneiisithei latentifactors,iwhichiencodesiinformationisharedibyidifferentifeatureidescriptors;itheiotheriisi theigradientivector,iwhichiencodesiinformationispecificitoieachitypeiofithem.iInterestingly,i mathematicaliformulationsiofitheitwoicomponentsiturnioutitoibeitheicounterpartsiofiFViandi VLADirepresentations,irespectively.iTheyirevisitiCanonicaliCorrelationiAnalysisi(CCA)iandi proposeitheimixtureimodelioficanonicalicorrelationianalyzersiandiitsicorrespondingilearningi algorithmiasiwelliasipresentitheirivideoirepresentationibasedioniM-

(19)

PCCA.iAfterithatitheyideriveitheiMVSVi[27]irepresentationifromiM-

PCCA.iTheniweipresentianiinterpretationiofitheirepresentationiandicompareiititoipreviousico dingimethods.iTheiperformanceiofitheimethodiareiexperimentallyiexaminedionitheiUCF101i datasetiandihasianiaccuracyiofi83.5%.

i

Figurei2.3:iAigraphicaliinterpretationiofiM-PCCA 2.5iScopeiofitheiProblems

Mainiscopeiofithisithesisiisiasifollows

1. DevelopianditrainianiagentiwhichicanipredictiHumaniActionifromi101ihumaniactionic lassesibasedioniUCF-101idataseti[10].

2. Saveibestiweightsifromieachiiterationiforifutureiuseiwhichicaniinitermsireduceitimeire quirediforiprediction.

2.6 iChallenges

Throughoutitheiworkiweihaveifacediseveralichallenges.iMostiprominentichallengesiareistated ibelow

(20)

1. Overfitting

Theimainichallengesiofithisithesisiwereitoireduceioverfittingithroughoutitheitrainingi epochs.iSometimesitheiagentiperformediveryipoorionitheiunseenidata.iToireduceithis itypeiofiproblemiweihaveifolloweditheiseveralistepsilikeicross-

validation,iincreasingitrainingidataivolume,ireduceifeatures,iregularization,iensemble iandiearlyistopping.

2. SelectioniofiActivationiFunction

WeihaveitriediseveraliactivationifunctionsilikeiBinaryiStep,

i

Sigmoid,iTanh,iReLU,iL eakyiReLU,iSoftmaxietciforitheipulloutilayeriofiouriCNNimodel.iThoughieachiactiv ationifunctionihasiitsistrongipoints,iReLuiworksibestiforitheiResNet-

50iarchitectureiandihaveitheibestiweightidistribution.

3. DistributioniofiTensors

WeihaveiusediNvidiaiQuadroiK200miGPUiforiouriwork,iwhichiinitermiisiaiveryilo wipoweriGPUiofionlyihavingi1.67GBiofivideoimemory.iThisilowimemoryiwasiveryi frustratingiwhileiassigningitensorsiasiitialwaysiranioutimemoryiallitime.iToiovercom eithisiproblem,iweihaveiusedifusioniofibothiGPUiandiCPUiwhileiassigningitensor.i WeiwereineededitoiactiveimultipleiCPUiworkersiatiaitime.

4. ChoosingitheiBestiWeights

Inieachiepochimodeligeneratesiseveraliweightsiitihasitrainedion.iChoosingitheibestiw eightsiinitheipastiwasibitiofichallenging.iInimodernidaysiseveraliframeworksiareiusei toichooseibestiweightsifromitheiseveraliiterationsiinieachiepoch.iKerasiworkedibestii nithisitypeiofiscenarioitoitheibestiofiknowledge.

(21)

CHAPTERi3

REQUIREMENTiSPECIFICATION

3.1iIntroductionitoiDataset

UCF101iwhichiisicurrentlyioneiofitheilargestidatasetsiofihumaniactions.iIticonsistsiofi101iac tioniclasses,ioveri13kiclipsiandi27ihoursiofivideoidata.iTheidatabaseiconsistsiofirealisticiuser -

uploadedivideosicontainingicameraimotioniandiclutteredibackground.iToitheibestiofiourikno wledge,iUCF101iisicurrentlyioneiofitheimostichallengingidatasetsiofiactionsidueitoiitsilargei numberioficlasses,ilargeinumberioficlipsiandialsoiunconstrainedinatureiofisuchiclips.iTheima jorityiofiexistingiactionirecognitionidatasetsisufferifromitwoidisadvantages:i1)iTheinumberio fitheiriclassesiisitypicallyiveryilowicompareditoitheirichnessiofiperformediactionsibyihumans iinireality,ie.g.iKTHi[31],iWeizmanni[32],iUCFiSportsi[33],iIXMASi[34]idatasetsiincludesio nlyi6,i9,i9,i11iclassesirespectively.i2)iTheivideosiareirecordediiniunrealisticallyicontrolledien vironments.iForiinstance,iKTH,iWeizmann,iIXMASiareistagedibyiactors;iHOHAi[35]iandiU CFiSportsiareicomposediofimovieiclipsicapturedibyiprofessionalifilmingicrew.iRecently,iwe bivideosihaveibeeniusediiniorderitoiutilizeiunconstrainediuser-

uploadedidataitoialleviateitheisecondiissue.iHowever,itheifirstidisadvantageiremainsiunresolv ediasitheilargestiexistingidatasetidoesinotiincludeimoreithani51iactionsiwhileiseveraliworksis howedithatitheinumberioficlassesiplayiaicrucialiroleiinievaluatingianiactionirecognitionimeth od.iTherefore,iweihaveicompilediainewidatasetiwithi101iactionsiandi13320iclipsiwhichiisine arlyitwiceibiggerithanitheilargestiexistingidatasetiinitermsiofinumberiofiactionsiandiclips.i(H MDB51i[36]iandiUCF50i[37]iareitheicurrentlyitheilargestionesiwithi6766iclipsiofi51iactions iandi6681iclipsiofi50iactionsirespectively.)iTheidatasetiisicomposediofiwebivideosiwhichiare irecordediiniunconstrainedienvironmentsianditypicallyiincludeicameraimotion,ivariousilightin giconditions,ipartialiocclusion,ilowiqualityiframes,ietc.iFigurei3.1ishowsisampleiframesiofi6i actioniclassesifromiUCF101.iTheiclipsiofioneiactioniclassiareidividediintoi25igroupsiwhichic ontaini4-

7iclipsieach.iTheiclipsiinioneigroupishareisomeicommonifeatures,isuchiasitheibackgroundiori actors.iTheivideosiareidownloadedifromiYouTubeianditheiirrelevantionesiareimanuallyiremo

(22)

ved.iAlliclipsihaveifixediframeirateiandiresolutioniofi25iFPSiandi320i×i240irespectively.iTh eivideosiareisavediini.aviifilesicompressediusingiDivXicodeciavailableiinik-liteipackage.

i

Figurei3.1:iSampleiframesifori6iactioniclassesiofiUCF101i[10].

Tablei1.iSummaryiofiMajoriActioniRecognitioniDatasetsi[10]

i

3.2iWorkflowiofitheiProposediMethod

WeihaveiusediConvolutionaliNeuraliNetworki(CNN)ibasedioniResNet-

50iarchitecture.iCNNihasiseveralimodulesilikeiinputilayer,iconvolutionalilayer,ipoolingilayer ,iReLUiandifullyiconnectedilayer.iWeihaveiusediResNet

(23)

Figurei3.2:iProposediMethodiWorkflow Dataset

FrameiCreator

ResizeiModule

Createitest/trainifolder

LoadiResNet-50

SplitiTest:iValidationi(60:

10)

SaveiModel

LoadiModeliwithiBestiWe ighti

TestiAgentiAccuracyionibasis ioniUnknowniTestiData

CreateiModeliChec kipoint

PlotiPredictediResu lt

Start

(24)

Afterireshapingitheiimages,iweihaveiconvertedithemiintoiSciKitiarraysiknowniasiNPZiarrays itoireduceihugeitimeirequireditoitrainihighiresolutioniimages.iThisistepiaddsiminoriexecution itimeibutireducesioverallisystemiexecutionitime.iWeifeeditheiNPZiarraysitoitheiinputilayerib utibeforeithatiweineediloadiemptyiResNet-

50imodel.iFromitheidatasetiweihaveicreateditestianditrainifolderiandiplacedirandomiimages.i Trainifolderiisiuseditoitrainitheimodelianditestifolderiserveiasitheiunknowniimagesisource.i WeihaveitraineditheiResNet-

50iwithitheitrainidatasetiwhichicontainsi60%iofitotaliimagesiandieachitrainingiiterationiisival idatediagainsti10%iofitotaliimagesiknowniasivalidationiset.iAfteriretrainingitheiResNet- 50iwithicurrentitrainidatasetiweisaveitheibestiweightsifromieachiiterationiandisaveimodelistat e.iSavedistateicanibeilateriuseditoiretrainiwhereileftioff.iWeihaveichosenitheibestiweightifro mialmosti1000iweightsitoipredictitheiunseenidataifromitheitestidataset.iTestidataseticontainsi 30%iofitotaliimagesiwhichiareiunseenibyitheimodel.iAfteritheipredictioniweiplotitheibestipro babilisticipredictionionitheiunseeniimage.iTheiwholeiworkiflowicanibeivisualizedifromitheifi gurei3.2.

3.3iWorkflowiGraphiofitheiProposediMethod

UCF101idataseti[10]icontainsi101iactioniclassesiasivideoiclipifromiwhichiusingiOpenCViwe ihaveitoicreateiframe.iEveryivideoiclipiisisubdividediintoimanyipartsiinieveryitenisecondiinte rvalii.e.iweigetioneiimageiperitenisecondivideo.iRGBicolorichanneliallocationiareidoneiinithi silayerito.iIniouricaseiweihaveitakeni3ichannels.iInifeatureilearningiprocessithereiareithreeip arts.iConvolutionilayeriextractsitheihighlevelifeaturesiofieachiimagesifromitheiinputiimages.i Afteriweihaveiextractedihighilevelifeaturesifromitheiinputiimages,iweiapplyiReLui(Non- LineariRectifiediUnit)ionieachiconvolutionilayeriimmediatelyitheipurposeiofithisilayeriisitoii ntroduceinonlinearityitoiaisystemithatibasicallyihasijustibeenicomputingilinearioperationsidur ingitheiconvilayersi(justielementiwiseimultiplicationsiandisummations).iAfteritheiReLuiweia pplyimaxipoling.iMaxipoolingilayerichoosesitheibestifeaturesifromitheiprimaryifeaturesiextra ctedibyitheiconvolutionilayer.iMaxipoolingigivesiusitheibestifeaturesiwhichiareimultidimensi onaliarrays.iAsiourifullyiconnectedilayersionlyilearnsionisingleidimensionaliarray,iweineedit oiplatenitheimultidimensionaliarrayibeforeifeedingitoitheifullyiconnectedilayers.iFullyiconne ctedilayersilearnionitheiflatteniinputsibyiapplyingitheibackipropagation.iForitheibackipropaga

(25)

tioniandidistributioniofitheiimagesiweihaveiusediADAMifunctioniratherithaniStochasticiGra dientiDescenti(SGD).iADAMiisimuchimoreioptimizediandireducesicompilationitime.iFullyic onnectedilayersioutputsianiNidimensionalivectoriwhereiNiisitheinumberioficlassesithatitheipr ogramihasitoichooseifrom.iEachinumberiofithisiNidimensionalivectorirepresentsitheiprobabil ityiofiaicertainiclass.

i

Figurei3.3:iSystemiWorkflowiGraph

Weihaveiventurediclassificationiproblemiratherithanilocalizationiandidetectionithroughiourith esisilastilayeriofitheimodeliisitheiSoftmaxifunction.iSoftmaxiassignsidecimaliprobabilityitoie achiclassiinitheimulti-

dimensionalifeatureiarray.iTotaliprobabilityiofitheiSoftmaxifunctioniassigneditoieachiclassim ustibei1.iSoftmaxialsoihelpsitheimodelitrainingitoiconvergeimoreiquicklyiwhichiwillitakeimu chilongeriwithoutitheiSoftmax.

(26)

3.4iPredictioniflowiofitheiProposediMethod

Weihaveigenerateditheibestiweightiofieachiiteration.iAfteritheimodeliisifinishedilearningiwit hiallitheiimagesiinitheitrainidatasetitheimodeliwilligenerateimoreithani100iweights.iWeiuseit hisiweightitoipredictitheiclassitheiunknowniimageibelongsito.iFirstitheimodeliisiloadediwithi theibestiweight.iWeithenifedithisimodeliainewiunknowniimageifromitheitestidataset.iTrained imodelipredictsitheiclassiofitheiunknowniimage.iAfteritheiprobabilisticipredictioniweiplotithe iresultiwithitheihelpiofiimageiplotter.iCNNimodelsicompareieachipredictioniwithitheigroundi truth.iImageiplotteriprintsitheihighestiprobabilityianditheinestibestimatchionitheiunseeniimag e.

Figurei3.4:iPredictioniWorkflowi

InitializeModel

LoadBestWeigh t

PredicttheClass

ComparethePre dictionwithGro

undTruth

PlottheBestPro

bability

(27)

3.5iImplementationiRequirements

Weihaveiusediseveraliframeworksiandipythoniasiprogrammingilanguage.iFrameworksiweiha veiusedithroughitheithesisiareilistedibelow

• Keras

KerasiisiANiopenisupplyineuralinetworkilibraryiwritteniiniPython.

ItiisicapableiofirunningioniprimeiofiTensorFlow,iMicrosoftipsychologicalifeatureiToo lkit,ioriTheano.iDesigneditoialteriquickiexperimentationiwithideepineuralinetworks,iit ifocusesionibeingieasy,imodular,iandiextensile.iItiwasidevelopediasipartiofitheiresearc hieffortiofiprojectiONEIROSi(Open-endediNeuro-

ElectroniciIntelligentiRobotiOperatingiSystem)iandiitsiprimaryiauthoriandimaintainer iisiFrançoisiChollet,iaiGoogleiengineer.

Ini2017,iGoogle'siTensorFlowiteamisetitoisupportiKerasiiniTensorFlow'sicoreilibrary.

CholletiexplainedithatiKerasiwasiplanneditoibeiANiinterfaceiinsteadiofiaistandalonei machine-learningiframework.iItioffersiaihigher-

level,iadditionaliintuitiveisetiofiabstractionsithaticreateiitistraightforwarditoidevelopid eepilearningimodelsidespiteitheiprocessibackendiused.iMicrosoftisuperimposediaiCN TKibackenditoiKerasisimilarly,iofferediasiofiCNTKiv2.0.

• Tensorflow

TensorFlowiisiassociateiopenisupplyicodeilibraryiforinumericalicomputationimistreat mentiinformationiflowigraphs.iTheigraphinodesirepresentimathematicalioperations,iw hereasitheigraphiedgesirepresentitheimultidimensionaliinformationiarraysi(tensors)ith atiflowibetweenithem.ithisiversatileidesigniallowsiyouitoideployicomputationitoiatilea stioneioriadditionaliCPUsioriGPUsiiniaiveryidesktop,iserver,iorimobileideviceiwhilei notirevisingicode.iTensorFlowiadditionallyiincludesiTensorBoard,iaiknowledgeiment aliimageitoolkit.

TensorFlowiwasioriginallyidevelopedibyianalyzersiandiengineersiperformingionithei GoogleiBrainiteamiatiintervalsiGoogle'siMachineiIntelligenceianalysisiorganizationif oritheineedsioficonductingimachineilearningiandideepineuralinetworksiresearch.iThei systemiisigeneralienoughitoibeiapplicableiiniaiveryibigivarietyiofidifferentidomains,ia siwell.

(28)

TensorFlowiprovidesistableiPythoniAPIiandiCiApisifurthermoreiasiwhileinotiAPIibac kwardsicompatibilityiguaranteeilikeiC++,iGo,iJava,iJavaScriptiandiSwift.i

• SciKitiLearn Scikit-

learni(formerlyiscikits.learn)imayibeiaifreeicomputericodeimachineilearningilibraryifo ritheiPythoniprogramingilanguage.iItioptionsinumerousiclassifications,iregressioniand ibunchialgorithmsitogetheriwithisupportivectorimachines,irandomiforests,igradientibo osting,ik-

meansiandiDBSCAN,iandiisiintendeditoiinteroperateiwithitheiPythoninumericaliandis cientificilibrariesiNumPyiandiSciPy.

Theiscikit-

learniprojectistartediasiscikits.learn,iaiGoogleiSummeriofiCodeiprojectibyiDavidiCou rnapeau.iItsinameistemsifromitheinotionithatiit'siai"SciKit"i(SciPyiToolkit),iaiseparat ely-developediandidistributedithird-

partyiextensionitoiSciPy.itheiinitialicodebaseiwasilaterirewrittenibyidifferentidevelope rs.iIni2010iFabianiPedregosa,iGaeliVaroquaux,iAlexandreiGramfortiandiVincentiMic hel,iallifromiINRIAitookileadershipiofitheiprojectiandicreateditheiprimaryipubliciunh arnessioniFebitheifirsti2010.Ofitheiassortediscikits,iscikit-learnilikewiseiasiscikit- imageiwereidelineateiasi"well-

maintainediandipopular"iiniNovemberi2012.iAsiofi2018,iscikit- learniisiiniactiveidevelopment.i

• OpenCV

OpenCVi(Openisupplyipcivision)imayibeiailibraryiofiprogrammingifunctionsiinithei mainigeareditowardiperiodipcivision.iOriginallyidevelopedibyiIntel,iitiwasilaterisupp ortedibyiWillowiGarageitheniItseezi(whichiwasilaterinon-

heritableibyiIntel).iTheilibraryiisicross-

platformiandifreeitoibeiusedibeneathitheiASCIIitextifileiBSDilicense.iOpenCVisuppo rtsitheideepilearningiframeworksiTensorFlow,iTorch/PyTorchiandiCaffe.

(29)

Officiallyilaunchediini1999,itheiOpenCViprojectiwasiabiinitioiassociateiIntelianalysis iinitiativeitoiadvanceiCPU-

intensiveiapplications,iaipartiofiaiseriesioficomesitogetheriwithiperiodirayitracingiand i3Dishowiwalls.iTheimainicontributorsitoitheiprojectienclosedivarietyiofioptimization iconsultantsiiniInteliRussia,ialsoiasiIntel'siPerformanceiLibraryiTeam.i

CHAPTERi4 PROPOSEDiWORK

4.1iIntroductionitoiCNN

Deepilearningiexploresitheichancesiofilearningifeaturesidirectlyifromiinputiimages,iavoidingi hand-

craftedimodels.iTheikeyiconceptiofideepilearningiisitoiexploreimultipleilevelsiofiillustrationi aimingithatihigher-

levelifeaturesirepresentianiabstractiviewiofitheiimages.iConvolutionaliNeuraliNetworksi(CN Ns)inowadaysiisiusediinieverywhere.iCNNiisiconstructediofimultipleiconvolutionalilayersista ckedionitopiofieachiother,ifollowedibyiaisupervisedideepinetiknowniasifullyiconnectedilayeri andisetsifeatureimapsirepresentibothiinputiandioutputiofieachiconvolutionalilayers.iInputima yiveryilikeiimage,iaudio,iandivideo.iIniouricaseiweihaveiusedicoloriimages,iatitheiinputilayer ieachifeatureimapiisiaitwo-

dimensionaliarrayistoringiRGBichanneliofitheiinputiimage.iOutputifromieachilayericonsistsio fiaisetiofiarraysiwhereifeatureimapirepresentsiaiparticularifeatureiextractediatiaiparticulariinp utilayer.iAideepinetiisitrainedibyifeedingiitiinputiandilettingiiticomputeilayer-by-

layeritoigenerateitheiﬁnalioutputiforicomparisoniwithitheicorrectianswer.iADAMifunctioniw orkiasiweightidistributoriinieachiiterationiandierroriareibackipropagateditoinet.iAtieachistepi backward,itheimodeliparametersiareitunediiniaidirectionithatitriesitoireduceitheierror.iThisipr ocessiincreasesimodeliaccuracyiasitheilearningiprogresses.iGenerally,itrainingiisidoneibyifee dingitheimodelitrainidataisetiagainiandiagainiinianiiterativeifashioniuntilitheimodeliconverge s.

(30)

Figurei4.1:iAiCNNiarchitecture

4.2iResNet-50iArchitecturei

Deepiconvolutionalineuralinetworksi[4]ihaveileditoiaiseriesiofibreakthroughsiforiimageiclass ification.iDeepinetworksinaturallyiintegrateilow/mid/highilevelifeaturesiandiclassifiersiiniani end-to-

endimultilayerifashion,iandithei“levels”iofifeaturesicanibeienrichedibyitheinumberiofistacked ilayersi(depth).iDrivenibyitheisignificanceiofidepth,iaiquestioniarises:iIsilearningibetterinetw orksiasieasyiasistackingimoreilayers?iAniobstacleitoiansweringithisiquestioniwasitheinotorio usiproblemiofivanishing/explodingigradients,iwhichihampericonvergenceifromitheibeginning .iThisiproblem,ihowever,ihasibeenilargelyiaddressedibyinormalizediinitializationiandiinterme diateinormalizationilayers,iwhichienableinetworksiwithitensiofilayersitoistarticonvergingifori stochasticigradientidescenti(SGD)iwithibackpropagation.iWhenideeperinetworksiareiableitois tarticonverging,iaidegradationiproblemihasibeeniexposed:iwithitheinetworkidepthiincreasing, iaccuracyigetsisaturatedi(whichimightibeiunsurprising)iandithenidegradesirapidly.iUnexpecte dly,isuchidegradationiisinoticausedibyioverfitting,iandiaddingimoreilayersitoiaisuitablyideepi modelileadsitoihigheritrainingierroriandithoroughlyiverifiedibyiouriexperiments.iTheidegrada tioni(ofitrainingiaccuracy)iindicatesithatinotiallisystemsiareisimilarlyieasyitoioptimize.iLetius iconsideriaishalloweriarchitectureiandiitsideepericounterpartithatiaddsimoreilayersiontoiit.iTh ereiexistsiaisolutionibyiconstructionitoitheideeperimodel:itheiaddedilayersiareiidentityimappi

(31)

ng,ianditheiotherilayersiareicopiedifromitheilearnedishallowerimodel.iTheiexistenceiofithisic onstructedisolutioniindicatesithatiaideeperimodelishouldiproduceinoihigheritrainingierroritha niitsishallowericounterpart.iButiexperimentsishowithatiouricurrentisolversionihandiareiunable itoifindisolutionsithatiareicomparablyigoodioribetterithanitheiconstructedisolution.iThisidegra dationiproblemicanibeiaddressibyiintroducingiaideepiresidualilearningiframework.

Figurei4.2:iResidualiLearningiBuildingiBlock

iLetiusiconsideriH(x)iasianiunderlyingimappingitoibeifitibyiaifewistackedilayersi(notinecessa rilyitheientireinet),iwithixidenotingitheiinputsitoitheifirstiofitheseilayers.iIfioneihypothesizesi thatimultipleinonlinearilayersicaniasymptoticallyiapproximateicomplicatedifunctions,itheniitii siequivalentitoihypothesizeithatitheyicaniasymptoticallyiapproximateitheiresidualifunctions,ii .e.,iH(x)i−ixi(assumingithatitheiinputiandioutputiareiofitheisameidimensions).iSoiratherithani expectistackedilayersitoiapproximateiH(x),iweiexplicitlyiletitheseilayersiapproximateiairesidu alifunctioniF(x)i=iH(x)i−ix.iTheioriginalifunctionithusibecomesiF(x)i+ix.iAlthoughibothifor msishouldibeiableitoiasymptoticallyiapproximateitheidesiredifunctionsi(asihypothesized),ithei easeiofilearningimightibeidifferent.iAitypicaliResNet-50ihaveitheifollowingicomponents:

•iInputilayer:iInputiimagesiareifeeditoithisilayersiandioutputiisifeeditoiconvolutionilayers.iRe shapingiandifeature-

scalingiisidoneiinithisistageitoipreventimodelifromidimensionalityierror.iBrekHisidataseticont ainsiimagesiofidifferentishapeiasiresultitoicorrectidimensionalityierrorireshapingiisimust.iWei haveireshapediallitheiimagesitoi(64*64)ipixel.iAsiallitheiimagesiareicolored,iweihaveichosen ichannelisizeitoibei3i(RGBichannel).i

•iConvolutionalilayers:iConvolutionilayersiconvolvesitheiinputiimagesiintoiaimulti-

dimensionalifeatureimapiwithiasisetiofilearnableifilters.iWeihaveiusedithreeiconvolutionalila

(32)

yers.iTheikernelsiorifiltersiareiofisizei5i×i5,ipaddingiisisetitoi2iandistrideiisisetitoiSAMEime ansistrideiandipaddingiisiofisameisize.iTheifirstitwoiconvolutionalilayersilearni32ifiltersieach ioneiandiGaussianidistributioniwithistandardideviationiofi0.0001iandi0.01,irespectivelyiisiinit ializediforiconvolutionalilayeri1iandi2irespectively.iTheilastilayerilearnsi64ifiltersiandiaiGau ssianidistributioniwithistandardideviationiofi0.0001iisiuseditoiinitializeithisilayer.

•iPoolingilayers:iPoolingilayersiareiresponsibleiforidown-

isamplingitheispatialidimensioniofitheiinput.iAfterieachiconvolutionalilayerithereiisiaipolling ilayer.iWeihaveiusediai2*2ikerneliforieachipolingilayeriandiaistrideiofisizei2.iTheiﬁrstipooli ngilayeriusesitheimaxipoolingioveritheigeneratedifeatureimapianditheilastitwoiperformiavera geipooling.i

•iReLUilayers:iWeihaveitestedidifferentiactivationifunctionsilikeiTanh,iSigmoidiReLu,iLeak yiReLu,iBinaryistepietc.iFromiourifindingsiweihaveiseenithatiReLuiworksibestionhitheihisto pathologyiimages.iAfterieachipollingithereiisiaiReLuilayer.iForianiinputivalueiofixiReLuico mputesitheineuron’sioutputifi(x)iasixiifixi>i0iandi(αi×ix)iifixi<=i0.iαispeciﬁesiwhetheritoiom ititheinegativeipartibyimultiplyingiitiwithitheislopeivaluei(0.01ioriso)iratherithanisettingiititoi 0.iTheidefaultivalueiofiαiisi0.iIfitheivalueiofiαiisinotisetitheniitiworksiasiaistandardiReLUifu nctionifi(x)i=imaxi(0,ix),iiniotheriwords,itheiactivationiisisimplyithresholdiatizero.

Figurei4.3:iReLU

•iFullyiconnectedilayers:iFullyiconnectedilayersiareiinitheoryiaiconnectedideepineuralinetwo rkithatitakesiaimulti-

dimensionalifeatureimapiasiinputiandiproducesiaisingleidimensionalifeatureimapiasioutput.i Weihaveiusedithreeifullyiconnectedilayersiasithisigivesitheibestiaccuracyiandioptimumiweig

(33)

htidistribution.iAfteritheifullyiconnectedilayerithereiisiaiprobabilisticiSoftmaxiactivationifunc tion.iSoftmaxiisiaiprobabilisticifunctionithatigivesitheibestimatchidependingionitheinumberio ficlassesiinitheiclassiﬁcationiproblem.iForiouricaseiit’sianyiactioniclass.

i

Figurei4.4:iResNet-50iCNNiarchitecture.

IfiweicompareitheierrorirateiofiresNetiwithiotheriCNNiasiwelliasishallowimethodionitheibasisio fiimageNeticlassification,iweicaniseeithatiitsierrorirateimuchilessithaniothericlassifier.iThisiisial soioneiofitheibestisideiofiresNetimodel.

(34)

Figurei4.5:iImageNetiBenchmark.

4.3iClassiﬁcation

Toiclassifyianiimage,iweihaveicombineditheipatchiresultiforitheiwholeiimage.iWeihaveidivid editheiinputiimagesiintoibecauseiourimodeliwasitrainedionipatchesiofiimages.iSoinewlyicreat edipatchesiwereirunithroughitheimodelianditheiresultsiwereicombinediforitheiclassification.i Weihaveiextractedigridipatchesiwhichihasiaigoodibalanceibetweeniclassificationiproblemian dicomputationaliproblem.iRunitheimodeliwithitheibestiweightsianditheipatchesioutputsitheib estiprobableimatchiofitheiclasses.iWeihaveiuseditheiSumiruleitoicombineitheipatchiresultsiw hichigivesitheibestiresultiinicomparisonitoidifferentifusionirules.

4.4iDevelopediWork

Followingifiguresiillustratesitheidifferentiaspectsiofiouridevelopediwork

• ConvertingiDatasetiintoiNPZiarrays

(35)

WeihaveiusediNumPyitoiconvertiimagesiintoiNPZiarrays.iByiconvertingitheiimagesii ntoipixelatediarraysireducesicomputationitimeisignificantly.iDatasetidirectoryiisipaste diintoitheiCMDiandiunderlingipythoniscripticonvertsiallitheiimagesiinitoiNPZiarrays.

Figurei4.6:iConvertingiImagesiintoiNPZiArrays

• TrainiResNet-50

AftericonvertingitheidatasetiintoiNPZiarraysiweifeeditoitheiResNet- 50.iWeifirstihaveitoidefineisomeiparametersi-

itheimaximuminumberiofitrainingiexamplesiweiwantitoisampleifromitheiimageiandith eisizeiofitheipatchesisampledifromiouriimages.iNextiweineeditoidefineitheidirectories iwhereitheitrainingidataiisilocated.iWeialsoineeditoispecifyitheinumberiofifeaturesi(ce lliedgesiandicelliinterior)ithatiweihaveiannotated.iNextiweineeditoigetiailistiofiallitheifile siinitheitrainingidirectoriesiandiinitializeisomeivariables.iResNet-

50itrainsionitheitestidatasetiandisavesitheibestiweightsiinitheiweightidirectoryiforieac hiiteration.iWhileitheimodelitrainsionilargeivolumeiofidataiweineeditoisaveitheimodel istateitoiavoidianyiinterruption.

(36)

Figurei4.7:iTrainingiResNet-50

• PredictioniandiPlotting

Afteritheitrainingiisidone,iweicaniuseitheisavedimodeliweightsitoipredictihumaniactio nitypeifromitheiunknownitestidataset.iOurimodelipredictsitheioutcomeiandiplotitheibe stimatchiprobabilityiintoiinputiunknowniimage.

i

Figurei4.8:iPredictingitheiHumaniActioniClass

(37)

i

Figurei4.9:iPlottediModeliOutcome

CHAPTERi5

EXPERIMENTALiRESULT

5.1iResultiGraphs

IniouriproposedimethodiweihaveiusediCNNiwithiResNet-50iinsteadiofihand-

craftedimodel.iHandicraftedimodelsigiveiriseitoiunwantedicomplexityiandimightihaveiinferio rioutcomeithanialreadyiexistedibenchmarkedione.i

Figurei5.1ishowsitheivalidationiaccuracyiofitheiproposediagent.

i

(38)

i

Figurei5.1:iValidationiaccuracyiofitheiagent

Theivalidationilossiofiouriproposediarchitectureicanibeideterminedibyifigurei5.2.

i

Figurei5.2:iValidationilossiofitheiagent

Theioveralliaccuracyiofiouriproposediworkioniucf101idatasetiusingiResNet- 50ideepiCNNioveri60iepochsihasibeenishownibyifigurei5.3.

(39)

Figurei5.3:iOveralliaccuracyiofitheiagent

Theigradualireductioniofilossiwheniagentipredictsifromiepochioneithroughisixtyishowniinifig urei5.4.

i

Figurei5.4:iGradualireductioniofiloss

Atifirstitheiagentilearnsiatiaisteadyipaceiandihasicostifunctioniofiexiatitheibeginningiofitheile arning.iCostifunctioniofitheiagenticanibeivisualizedifromitheifigurei5.5.

(40)

Figurei5.5:iCostifunctionisummery

Theirecognitionirateiofitheiagentiisicomputediatitheiimageilevelithusiprovidingiaimeansitoies timateisolelyitheiimageiclassiﬁcationiaccuracyiofitheiCNNimodels.iLetiIallibeitheinumberiofi actioniclassiimagesiofitheitestiset.iIfitheisystemiclassiﬁesicorrectlyiIclsfyiactioniimages,ithen itheirecognitionirateiatitheiimageileveliis:

𝐼𝑚𝑎𝑔𝑒𝑖𝑅𝑒𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑜𝑛𝑖𝑅𝑎𝑡𝑒 = 𝐼_𝑎𝑙𝑙

𝐼_{𝑐𝑙𝑠𝑓𝑦}× 100

5.2iComparisoniwithiPreviousiWork

Beforeiouriproposediworkiweiexaminedisomeipreviousiworkiwhichialsoiuseiucf101idatasetif oritheiriexperiment.

Firstly,iweidiscussediaboutiBagiofiVisualiWordsiandiFusioni(BoVW)imethodiforihumaniacti onirecognitioniandigetianiaccuracyiofi87.9%.iMainiproblemiofithisimethodiisitheiuseiofiextr aifeatureiextractionimethod.iTheigeneraliideaiofibagiofivisualiwordsi(BOVW)iisitoirepresent ianiimageiasiaisetiofifeatures.iFeaturesiconsistsiofikeyipointsiandidescriptors.iInithisiexperim ent,itheyiuseisapaceitimeiinterestipointi(STIP)iandiImproveiDenseiTrajectoriesi(iDTs)ilocalif eatures.iForidetectingifeaturesiandiextractingidescriptorifromieachiimageiinitheidatasetitheyi useiSIFTifeatureiextractorialgorithms.iThisifeatureiextractionialgorithmiwhichicaniloseiailoti

(41)

ofispatialiinteractionibetweenipixels,ibutiCNNihasiautomaticifeatureiextractor.iCNNiincludei multilayeriprocessing,isubsamplingilayeritoigiveibetteriperformanceiusingiauto-encoder.

Secondly,iweiconsiderediMulti-

ViewiSuperiVectori(MVSV)imethodiwhichiuseimoreifouriviewsiandimethodsiandicombinedi themitoigetianioptimizediresultiwhichiresultianiaccuracyiofi83.5%.

And,ifinally,iweidiscussedianotherideepiCNNiapproachiwhichiisibasedioniAlexNetiandiVGG -

16iarchitectureiandimakeianiaccuracyiofi88.63%iandi61.70%irespectively.iInitheseitwoiarchi tecturesiwhenideeperinetworksistartsiconverging,iaidegradationiproblemihasibeeniexposed:iw ithitheinetworkidepthiincreasing,iaccuracyigetsisaturatediandithenidegradesirapidly.iBecausei initheseitwoiarchitecturesiplainiblockihasibeeniuseditoimakeideeperilayeribyiaddingimoreilay eriwithishallowilayer.iForithisiiniworstiscenarioideeperimodel’siearlyilayersicanibeireplacedi withishallowinetworkianditheiremainingilayersicanijustiactiasianiidentityifunctioni(Inputiequ alitoioutput).

Figurei5.6:iComparisonibetweeniresNetiandiotheriCNNi

Forirewardingiscenario,iinitheideeperinetworkitheiadditionalilayersibetteriapproximatesithei mappingithaniitsishallowericounterpartiandireducesitheierroribyiaisignificantimargin.

Initheiworsticaseiscenario,ibothitheishallowinetworkiandideeperivariantiofiitishouldigiveithei sameiaccuracy.iInitheirewardingiscenarioicase,itheideeperimodelishouldigiveibetteriaccuracyi

Stackedneuralnet worklayer

PlainiBlock

Yi=if(x) x

F

ResidualiBlock

Stackedneuralnet worklayer

Yi=if(x)i+ix x

F x

(42)

thaniitsishallowericounterpart.iButiexperimentsiwithitheiripresentisolversirevealithatideeperi modelsidoesn’tiperformiwell.iSoiusingideeperinetworksiisidegradingitheiperformanceiofithei modelibutiouriapproachitriesitoisolveithisiproblemiusingideepiresidualilearningiframework.iI nsteadiofilearningiaidirectimappingiofixi->iyiwithiaifunctioniH(x)i(Aifewistackedinon- linearilayers),ithisiapproachidefinesitheiresidualifunctioniusingiF(x)i=iH(x) -

ix,iwhichicanibeireframediintoiH(x)i=iF(x)i+ix,iwhereiF(x)iandixirepresentsitheistackedinon- linearilayersianditheiidentityifunctioni(input=output)irespectively.iAccordingitoiresidualihyp othesisiitiisieasyitoioptimizeitheiresidualimappingifunctioniF(x)ithanitoioptimizeitheioriginal, iunreferencedimappingiH(x).

Tablei2.iComparisoniwithiPreviousiWork

Method Accuracyi(%)

BoVW 87.9

MVSV 83.5

OFS(VGG-16) 61.70

OFSi(AlexNet) 88.63

ProposediResNet-50 92.25

(43)

CHAPTERi6

CONCLUSIONiANDiFUTUREiWORK

6.1iConclusion

Videoiclassificationitoirecognizeihumaniactionicanibeiperformediinivariousiwaysibutiquestio niisihowiitiaffectitheioveralliperformance.iIniouriwork,iweiatifirstiusediOpenCVitoicreateifra meiimageifromieveryitenisecondivideoiandigotimanyiframeiimageiforioneiactioniclass.iAfter ithatiweitrainediouriagentiusingiResNet-

50.iOuriworkihasishownithatiexistingimodelsiparticularlyiResNet-

50iperformibetterithanihandicraftedimodelsiwheniclassifyingicoloriimagesiandiobjectiwithidi verseiimageipatterns.

Iniouriworkiweihaveiusedislidingiwindowimechanism,ithatiallowitoidealiwithitheihigh- resolutioniofitexturediimagesiwhichicanialsoiperformiasiexpectediwithilowiresolutioniimages iwithoutichangingitheiwholeiarchitecture.iExperimentaliresultsihaveishownithatiouriproposed imethodiworksibetterithaniexitingiclassifiers.

6.2iFutureiWork

FutureiWorkicaniexploreidifferentiCNNiarchitectureitoigetibetteriperformanceionithisidataset ioricaniuseithisiarchitectureitoiotherimoreirecentidatasetisuchiasiKinecticidatasetiwithimoreit hani400iactioniclasses.iAdditionally,ibyiexploringiotheripre-

trainedimodelsilikeiInceptioniV3,iGoogleNetietciwouldibeiaigreatioptionitoigetibetteriaccura cy.

(44)

REFERENCE:

[1] Huang Y, Yang H, Huang P. Action recognition using hog feature in different resolution videosequences[C]//Computer Distributed Control and Intelligent Environmental Monitoring(CDCIEM), 2012 International Conference on. IEEE, 2012: 85-88.

[2] Sadanand S, Corso J. Action bank: A high-level representation of activity in video. In IEEE,2012: 1234-1241.

[3] Wang H, Kläser A, Schmid C. et al. Dense trajectories and motion boundary descriptors foraction recognition [J]. International journal of computer vision, 2013, 103(1):

60-79.

[4] LeCun Y, Bottou L, Bengio Y. et al. Gradient-based learning applied to document recognition [J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.

[5] Ji S, Xu W, Yang M. et al. 3D convolutional neural networks for human action recognition [J].Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2013, 35(1): 221-231.

[6] Karpathy A, Toderici G, Shetty S. et al. Large-scale video classification with convolutionalneural networks[C]//Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014: 1725-1732.

[7] Ijjina E P, Mohan C K. Human Action Recognition Based on Recognition of Linear Patterns inAction Bank Features Using Convolutional Neural Networks[C]//Machine Learning andApplications (ICMLA), 2014 13th International Conference on. IEEE, 2014:

178-182.

[8] Jung M, Hwang J, Tani J. Multiple spatio-temporal scales neural network for contextual visualrecognition of human actions[C]//Development and Learning and Epigenetic Robotics

(ICDL-Epirob), 2014 Joint IEEE International Conferences on. IEEE, 2014: 235-241.

[9] Zhang N, Paluri M, Ranzato M. A. et al. Panda: Pose aligned networks for deep attributemodeling[C]//Computer Vision and Pattern Recognition (CVPR), 2014 IEEE

(45)

[10] K. Soomro, A. R. Zamir and M. Shah, UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild. CRCV-TR-12-01, November, 2012.

[11] Aggarwal J, Ryoo M. Human activity analysis: A survey. ACM ComputSurv2011;

43:1-43.

[12] Poppe R. A survey on vision-based human action recognition. ImageVis Comput 2010; 28:976-90.

[13] Weinland D, Ronfard R, Boyer E. A survey of vision-based methods foraction representation, segmentation and recognition. Comput Vis ImageUnderst 2011; 115:224-41.

[14] Turaga P, Chellappa R, Subrahmanian VS, Udrea O. Machine recognition of human activities: A survey. IEEE Trans Circuits SystemVideo Technol 2008; 18:1473-88.

[15] Scovanner, P., Ali and S. Shah, M. 2007. A 3-dimensional SIFT descriptor and itsapplication to action recognition. In ACM MM.

[16] Wang, H. Klaser, A. Schmidand Liu C. 2013. Dense trajectories and motion boundary descriptors for action recognition. Intl. Journal Comp. Vision103(1):60–79.

[17] Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for humanaction recognition. IEEE T-PAMI 35(1) (2013) 221-231.

[18] Jue Wang, Anoop Cherian, FatihPorikli: Ordered Pooling of Optical Flow Sequences for Action Recognition. In WACV, 2017.

[19] Song, S. Lan, C. Xing, J. Zeng and Liu J. 2017. An end-to-end spatial-temporal attention modelfor human action recognition from skeleton data. In AAAI.

[20] Ran, L., Zhang, Y., Zhang, Q., Yang, T.: Convolutional neural network-basedrobot navigation using uncelebrated spherical images. Sensors 17(6) (2017).

[21] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semanticsegmentation. In: CVPR. (2015) 3431-3440.

[22] Feichtenhofer C., Pinz A., and Wildes R. 2016. Spatiotemporal residualnetworks for video action recognition. In NIPS

(46)

[23] Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV.

(2013) 3551-3558

[24] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.:

Temporalsegment networks: Towards good practices for deep action recognition. In ECCV.

(2016) 20-36

[25] Wu Z., Wang X., Jiang Y., Ye H., andXue X. 2015. Modeling spatial-temporal clues in a hybriddeep learning framework for video classification. In ACMMM.

[26] Xiaojiang Peng, Limin Wang, Xingxing Wang and Yu Qiao. 2014. Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice. In UTC.

[27] ZhuoweiCai, Limin Wang, Xiaojiang Peng, Yu Qiao. Multi-View Super Vector for Action Recognition. In IEEE, 2014.

[28] F. Perronnin and C. R. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.

[29] M. Salzmann, C. H. Ek, R. Urtasun, and T. Darrell. Factorized orthogonal latent spaces. In AISTATS, 2010.

[30] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Intl.

Conf. on Multimedia. ACM, 2014.

[31] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm approach, 2004. International Conference on Pattern Recognition (ICPR). 1, 5

[32] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes, 2005. International Conference on Computer Vision (ICCV). 1, 5

[33] M. Rodriguez, J. Ahmed, and M. Shah. Action mach: A spatiotemporal maximum average correlation height lter for action recognition, 2008. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1, 5

[34] D. Weinland, E. Boyer, and R. Ronfard. Action recognition from arbitrary views using 3d exemplars, 2007. International Conference on Computer Vision (ICCV). 1, 5

[35] M. Marszaek, I. Laptev, and C. Schmid. Actions in context, 2009. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1, 4, 5

(47)

[36] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition, 2011. International Conference on Computer Vision (ICCV). 1, 5

[37] K. Reddy and M. Shah. Recognizing 50 human action categories of web videos, 2012.

Machine Vision and Applications Journal (MVAP). 1, 5