A CTION RECOGNITION USING LOCAL FEATURES
4.3 Interest Point Detector for Videos based on Differential MotionDifferential Motion
4.3.5 Qualitative comparison of interest point detectors
We have compared the visual quality of the proposed detector with state-of-the-art methods. As shown in the Fig. 4.12 the proposed interest points appear only on relative motion boundaries. Other methods such as Mo-SIFT, n-SIFT, and Cuboid give very sparse points, some of which also lie on the background. The proposed method gives neither too sparse nor too dense interest points.
4.3.5.1 Comparison of interest point detectors and descriptors for action recognition
We have compared the proposed interest point detector with state-of-the-art detectors along with various descriptors as shown in Table 4.6 and Table 4.7. For comparison, we have done human action classification on two simple datasets KTH and Weizmann.
We have used HOG, HOF, their combination, and the location feature proposed in Sec- tion 4.3.3.1 to see the relative contribution of each feature. The features were quantized using k-means with a code-book size of 2,000 and used in a bag-of-words model. A support vector machine was trained using LibSVM [11] for classification.
It is evident from Table 4.6 and Table 4.7 that HOF performs slightly better than HOG, but their combination performs much better than both on KTH dataset, and slightly better than both on Weizmann dataset. When the location feature is added to the mix, the results improve substantially, especially on Weizmann dataset. Going forward we use HOG, HOF, and location features. In later experiments, we also optimize the code-book size and use a GMM based feature computation instead of k-means, which further improved the recognition rates.
4.3.5.2 Application of the proposed video representation to action recognition
The combination of interest point detector and descriptor has many applications such as human action recognition, event detection, video retrieval etc. We tested the utility of our proposed method for human action application and compared it with the state-of- the-art methods. Human action recognition can also be useful for event detection and video retrieval tasks. We have considered KTH, UCF11, and UCF101 datasets for this application, Weizmann is very small dataset and already very high accuracies have been achieved on it by others.
The framework for human action recognition includes feature extraction, code-book generation and classification. In [76] best practices were described for action recogni- tion using a bag-of-words model, which we have used here. Feature extraction is done using HOG-HOF because they are widely used due to their easy usages and good perfor- mance. Descriptors (HOG-HOF-loc) were concatenated and PCA-whitened to ensure that features have same variance through different dimensions. For code-book generation generally k-means or GMM is used. K-means performs hard assignment of feature to codeword while GMM makes soft assignment of feature to each mixture component based on posterior probabilities. But unlike k-means, GMM delivers not only the mean information of code words, but also the shape of their distribution. Code-book genera- tion was done using a Gaussian mixture model (GMM). GMM is a generative model to describe a distribution as an additive mixture of Gaussian distributions with different means, variances, and mixing coefficients. It can be described as follows:
(4.11) p(x;θ)=
K
X
k=1
πkN(x;µk,X
k)
whereK is the mixture andθ=π1,µ1,P
1, ...,πk,µk,P
kis the set of model parameters.
N=(πk;µk,P
k) is an M-dimensional Gaussian distribution. The optimal parameters of GMM are learned through maximum likelihood estimation as argmaxθln p(X;θ) for a
given feature set X=x1,x2, ...xM.
0 10 20 30 40 50 60 70 80 90 100
0 100 200 300 400 500 600
Accuracy (%)
GMM Size
KTH UCF11 UCF101
Figure 4.14: Comparison of accuracy with different GMM size.
After GMM, feature encoding was done using super vector method as described in [76] to get the final representation of the video. There are different ways to implement a super vector. One of them is Fisher vector, which uses Fisher kernel that combines the benefits of generative and discriminative approaches. As suggested in [76], we also used l2-normalization. After encoding a linear SVM was used for action classification.
To generate the codebook, we randomly sampled 256,000 features from the training set to learn GMMs. The number of Gaussian components ranged from 16 to 512. The comparison with different GMM sizes is shown in Fig. 4.14. While for KTH and UCF11, the recognition accuracy saturated at 256, and for UCF101 it kept increasing till 512, presumably due to the diversity of action classes in the dataset.
After selecting the optimal GMM size, we compared the performance of our method with state-of-the-art action classification methods given in the literature. Table 4.8 shows the comparison of the proposed method with state-of-the-art methods. The proposed method outperformed all the state-of-the-art methods on KTH and UCF11 datasets. For UCF-101 the results were comparable to CNN-based (deep learning) methods with much faster training and testing. For UCF-101, our method takes less than 3 hours to train
Table 4.8: Action recognition accuracy of our and state-of-the-art methods.
KTH UCF11 UCF101
Proposed method 99% 94% 84%
Yadav, Shukla, and Sethi [118] 98.2% 91.3% –%
Jhuang et. al. [43] 96.00% – –
Gilbert, Illingworth,and Bowden [32] 94.50% – –
Laptev et al.[57] 92% – –
Wang et al. [106] 95% 85% –
Peng et al. [77] 93% 67% –
Ballan et al. [2] 92.66% –% –
Zhu et al. [126] – 89% –
Wang, Qiao, and Tang [110] – – 92%
Karpathy et al. [48] (CNN) – – 63%
Shi et al. [91] (CNN) 96.8% – 92.2%
Sapienza, Cuzzolin, and Torr [86] – 87% –
Kovashka and Grauman [53] 95% – –
Simonyan and Zisserman [93] (CNN) – – 88%
Shuiwang et al. [44] (CNN) 90% – –
Mahdyar et al. [81] (CNN) – 90% –
Wong and Cipolla [114] 87.7% – –
Niebles, Wang, and Fei-Fei [73] 81.5% – – Klaser , Marszalek , and Schmid [50] 91.4% – – Willems, Tuytelaars , and Van Gool [112] 84.3% – –
Liu and Shah [65] 93.8% – –
Yang, Wang, and Mori [119] 75.71% – –
Sun, Chen, and Hauptmann [99] 94% – –
Jia Liu et al. [66] 73.5% – –
Fei Hu et al. [40] 74.4% – –
Samanta and Chanda [85] 93.51 – –
Schuldt, Laptev , and Caputo [89] 71.83% – –
Wang et. al. [107] – 84.20 % –
Mota et. al. [71] – 75.40 % –
Ikizler-Cinbis and Sclaroff [41] – 75.21 % –
Liu, Luo, and Shah [64] – 71.20% –
without using GPU, while CNN take days to train on the same machine while using a GPU.