Human Action Recognition in Videos Using Intermediate Matching Kernel

Human action recognition can be thought of as the process of tagging videos with appropriate action tags. Coming to the fields of computer vision, video perception has become an important area of research. Videos containing human actions can be treated as samples of different lengths, because actions in videos can last for different lengths of time.

To solve this problem, a paradigm of dynamically constructing intermediate matching kernel is used in order to achieve similarity between patterns of different lengths. The idea of the intermediate matching kernel is to use a generative model as a reference and obtain the similarity between videos. A video is a sequence of frames which can be represented as a sequence of feature vectors and thus the hidden Markov model is used as the generating model as it captures the stochastic information.

The entire idea of this thesis can be described as building intermediate matching kernels using the hidden markov model as a generative model, using the SVM as a descriptive model for calibrating the actions based on the calculated kernels . The problem of action class recognition can be classified into the increasing complexity of gestures, action, interaction and group activity. From these categories, a simple action classification is chosen for this thesis, which consists of one instance of action.

The general approach to solving the problem of action classification is to extract desired features from the frames of the videos and use a classification algorithm based on the features used.

Variations among and within the action classes

In the applications mentioned above, most of them require an automated system for recognizing human activities that can either be low-level actions or the composition of several such low-level actions. By using the training set of videos, the system generally learns to classify an action into the corresponding action class. When performing the above operation, the following problems may arise that affect the choice of features to extract from the frames and the classification algorithm.

Environment and Recording settings

Obtaining and labelling of Training Data

Varying pattern lengths

Davis et al.[1] proposed two concepts, namely MEI(Motion Energy Image) and MHI(Motion History Image) were used to recognize the human activity. Assumptions such as the static background or the movement of the object can be separated from the camera. Using the MEI and the MHI, the temporary templates for each of the considered actions are constructed.

Laptev et al.[2] using the application of the SVM over the set of local features was implemented for the action recognition. Therefore, the approach of using the local features started with the intuition that the local measurements in terms of the spatio-temporal points of interest. The PII is done by exploring the characteristics of the motion frequency and the OFII is obtained by the motion mode analysis.

In the classification step, we use a Kernel-Based Spatial Pyramid (SPMK) classifier to recognize activities. Ali et al.[4] Proposed the idea of using kinematic features extracted from optical flow information from videos to recognize human activity in videos. A spatiotemporal pattern can be obtained by computing each kinematic feature from the optical stream of the image sequence.

Here, every action video is presented as a word bag of cinematic modes. 8] Focused primarily on object contouring by calculating points of interest in 2D space. For localized multiple learning, they used polynomial (homogeneous), polynomial (inhomogeneous), and Gaussian radial basis function kernels for each of the features.

For each template in the bank, we define the correlations of the various volumes in the video with the template, and this consists of the feature descriptor for the input video. On top of this, maximum volume coupling is applied and three levels of octres are obtained. Each pair of degrees of action gives 13+23+43, so 73 dimensional vectors. The total length of the action bank feature vector is NaXNsX73. With this they have captured the local motion and the dense representation of the motion in the foreground and the surroundings as well.

Kernel Methods For Varying Length Pattern Analysis

Fisher Kernel
Probabilistic Sequence Kernel
GMM Supervector Kernel
Summation kernel
Matching Kernel
Intermediate Matching Kernel
GMM based IMK
Kernels for Sequence Of feature vectors

Fisher Kernel exploits the generative model GMM and maps the sets of local feature vectors into a fixed dimensional Fisher Score Vector. In the Fisher score space, a set of local feature vectors in a fixed dimension can be represented as a supervector of all the Fisher score vectors of each of the components. For a set of local feature vectorsX= [x1,x2,x3, . .xT] the log-likelihood gradient vector with respect to the mean µq of a qth component of a GMM is given by. In GMM Supervector Kernel[18], the mapping is done from the set of local feature vectors to the corresponding GMM supervectors in the higher fixed dimensional space.

Then, all these similarities are summed to obtain the similarity between two sets of local feature vectors. The summation kernel between two sets of local feature vectors Xm= [Xm1,Xm2, ..XmTm] and Xm= [Xn1,Xn2, . .XnTn] is given as . The total number of comparisons made in the summation kernel is equal to the product of the cardinalities of the two sets of local feature vectors. For two sets of local feature vectors Xm= [Xm1,Xm2, ..XmTm] and Xm = [Xn1, Xn2, ..XnTn], the number of performed comparisons is Tm∗Tn. The sum of the similarities of all pairs of matched local feature vectors from both sets gives the matching kernel.

The summation kernel between two sets of local feature vectors Xm= [Xm1,Xm2, ..XmTm] and Xm= [Xn1,Xn2, ..XnTn] is given as Then, for each of the virtual feature vectors Q, find the closest local feature vector from each set of local feature vectors. Now the basis kernel is calculated between the local feature vectors from the two sets of local features closest to the specified virtual feature vector. The IMK between two sets of local feature vectors is the sum of the basic kernels Q.

For the virtual feature vector aqth Vq, two local feature vectors xm∗q, xn∗q closest to Vq are selected from Xm and Xn, respectively. The number of calculations of D, the distance between the local feature vector and the virtual feature vector for finding a closest pair of local feature vectors from two sets is (Tm+Tn). The total complexity can be written as O(T), where T is the cardinality of the largest set of local feature vectors.

As the base kernel, Gaussian kernel was used to obtain the closest pair of local feature vectors to the virtual feature vectors from the set of local feature vectors. If the GMM is modeled with Q components, we have a total of 2Q local feature vectors closest to the Q components from both sets of local feature vectors. Let Xm and Xn represent two sequences of local feature vectors,λ be the set of parameters of an HMM,x∗miq,x∗niq represent the local feature vectors selected from Xm,Xn from state and q respectively. component of GMM in condition.

The sum of the IMKs of all the countries based on the country-specific GMM gives the CSHIMK between the two sequences of local feature vectors. So, the total computations required to select the local feature vectors in all states of the HMM can be written as PN.

Figure 2.1: Intermediate Matching Kernel

Features And Approach

For evaluation of the proposed approach, experiments were performed on standard datasets such as KTH[2],UCF11[28],UCF101[29] and hmdb51[30]. The HMMs are constructed using the training examples, and the training and testing kernels were built in the oneVsRest approach. Experiments were performed by varying the number of states of HMMS in the window of two mixtures per condition.

From these experiments it was observed that the seven-state model with two mixtures per state gave the best performance.

UCF11

Results

UCF101

Results

HMDB51

Experiments were carried out on all the distributions and the best ones are presented in figure 4.7. In this thesis, a short study on dynamic kernels for pattern matching of different lengths was done and the HMM-based intermediate matching kernel was used to classify human actions in videos. The idea of using svm-based classifier with HMM-based IMK was evaluated using various standard datasets such as KTH,UCF11,UCF101,hmdb51 and results proved to be promising and almost surpassed the state of art results.

It can also be noted that using the posterior probability, weighted dynamic kernels can be built which can be more accurate in classification. It is also observed that the computation time for building HMM-based IMK is much less when compared to other dynamic kernels such as Fisher kernel, summation kernel, etc. High-level feature like Motion Boundary Histogram (MBH) which recently proved more useful can increase classification accuracy.

Class-Specific GMM-Based Matching Intermediate Kernel for Classifying Long-Duration Different-Length Speech Patterns Using Support Vector Machines. HMM-based intermediate matching kernel for classifying sequential speech patterns using support vector machines.