Online Object Tracking using L1/L2 Sparse Coding and Multi Scale Max Pooling

(1)

________________________________________________________________________________________________

Online Object Tracking using L1/L2 Sparse Coding and Multi Scale Max Pooling

1V.S.R Kumari, ²K.Srinivasarao

1Professor & HOD, Dept. of ECE, Sri Mittapalli College of Engineering, Guntur, A.P, India.

2PG Student (M. Tech), Dept. of ECE, Sri Mittapalli College of Engineering, Guntur, A.P, India.

Email : [email protected] Abstract: Online object tracking is a major challenge

problem in computer vision due difficulties to account for appearance changes of object. For efficient and robust tracking, we propose a new novel online object tracking algorithm is, SPCA (Sparse prototype based Principal Component Analysis). In order to achieve PCA reconstruction to represent an object by sparse prototype, we introduce l1/l2 sparse coding and multi-scale max pooling. In order to reduce tracking drift, we present a method that takes occlusion and motion blur into account rather than simply includes image observations for model update and compare the proposed method with several existing methods like IVT, MIL, PN, VTD, Frag, l1 minimization by considering Overlap Rate, Center Error parameters.

Keywords: PCA, Frag Tracker, l1/l2 sparse coding, object Representation, Object Tracking, Sparse Prototype, Overlap Rate and Center Error.

I. INTRODUCTION

Tracking is simply, a problem of estimating the trajectory of an object in the image plane as it moves around a scene. The achievement of object tracking is one of the major problems in the field of computer vision. Object tracking is plays a critical role in various applications like Motion - based recognition, Automated Surveillance, Video indexing, HCI, Traffic Monitoring, Vehicle navigation.[1].

A tracking system consists of three components there are, 1. Appearance or Observation Model: This evaluates the state of the observed image patch i.e., whether observed image patch is belonging to the object class or not. 2. Dynamic or Motion Model: This describes the state of the objects over a time. 3. Search Strategy Model: which aims to finding the likely states in current frame when compared with previous frame.

In this paper, we propose a new robust and efficient appearance models by considering the occlusion and motion blur.

The main challenge in developing a robust tracking algorithm is to account for large appearance variations of the target object and background over time. In this paper, we tackle this problem with both prior and online

visual information. Novel Tracking Algorithm is used to report the appearance variations.

The rest of the paper is summarized as follows as

II. RELATED WORK

Dong Wang, et al, proposed a novel Incremental Multiple Principal Component Analysis (IMPCA) method for online learning dynamic tensor streams.

When newly added tensor set arrives, the mean tenor and the covariance matrices of different modes can be updated easily, and then projection matrices can be effectively calculated based on covariance matrices.[2]

Amit Adam and Ehud Rivlin proposed every patch votes on the possible positions and scales of the object in the current frame, by comparing its histogram with the corresponding image patch histogram. They tried to minimize a robust statistic in order to combine the vote maps of the multiple patches. [3]

Xiaoyu Wang, et al proposed the occlusion likelihood map is then segmented by Meanshift approach. The segmented portion of the window with a majority of negative response is inferred as an occluded region. If partial occlusion is indicated with high likelihood in a certain scanning window, part detectors are applied on the unclouded regions to achieve the final classification on the current scanning window. [4]

Thang Ba Dinh, et al proposed the generative model encodes all of the appearance variations using a low dimension subspace, which helps provide strong reacquisition ability. Meanwhile, the discriminative classifier, an online support vector machine, focuses on separating the object from the background using a Histograms of Oriented Gradients (HOG) feature set.[5]

Jakob Santner, et al proposed semi supervised or multiple-instance learning, which shows augmenting an on-line learning method with complementary, tracking approaches, can lead to more stable results. In particular, they used a simple template model as a non adaptive and thus stable component, a novel optical-flow based meanshift tracker as highly adaptive element and an on-line

(2)

random forest as moderately adaptive appearance based learner.[6]

Mark Everingham, et al proposed describes the dataset and evaluation procedure. Review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse.[7]

J. Kwon a novel, et al proposed tracking algorithm that can work robustly in a challenging scenario such that several kinds of appearance and motion changes of an object occur at the same time.Algorithm is based on a visual tracking de-composition scheme for the efficient design of observation and motion models as well as trackers.[8]

J. Mairal, et al proposed Modeling data with linear combinations of a few elements from a learned dictionary has been the focus of much recent research in machine learning, neuroscience, and signal processing.

For signals such as natural images that admit such sparse representations, it is now well established that these models are well suited to restoration tasks. In this context, learning the dictionary amounts to solving a large-scale matrix factorization problem, which can be done efficiently with classical optimization tools.[9]

J. Yang, et al proposed Research on image statistics suggests that image patches can be well represented as a sparse linear combination of elements from an appropriately chosen over complete dictionary. Inspired by this observation, we seek a sparse representation for each patch

of the low-resolution input, and then use the coefficients of this representation to generate the high-resolution output. Theoretical results from compressed sensing suggest that under mild conditions, the sparse representation can be correctly recovered from the down sampled signals.[10]

J. Wright, et al proposed the choice of dictionary plays a key role in bridging this gap: unconventional dictionaries consisting of, or learned from, the training samples them selves provide the key to obtaining state-

Where y denotes an observation vector, C represents a matrix of templates, z indicates the corresponding coefficients, and ∈ is the error term which can be viewed as the coefficients of trivial templates.

By assuming that each candidate image patch is sparsely represented by a set of target and trivial templates, Eq. 2 can be solved via l1 minimization as

min¹₂ y − De ₂²+ λ e 1 (ii)

Where . ₁ and . ₂ denote the l₁ and l₂ norms respectively.

The underlying assumption of this approach is that error ∈ can be modeled by arbitrary but sparse noise, and therefore it can be used to handle partial occlusion.

However, the l₁tracker has two main drawbacks.

1. l₁tracker is computationally expensive due to that speed of the operation will be reduced.

2. It does not exploit rich and redundant image properties which can be captured compactly with subspace representations.

To overcome these problems we introduce a new algorithm we introduce a new algorithm based on SPCA.

III. TRACKING ALGORITHM

In this paper, Object tracking is considered as a Bayesian inference task in a Markov model with hidden state variables. Given a set of observed images YT = y₁, y₂… … y_T at the t^th frame the hidden state variable x_t is estimated recursively as,

p ^x_Y^t

t ∝ p ^y_x^t

t p _x^x^t

t−1 p ^x_Y^t−1

t−1 dxt−1--- (iii)

Where p(xt xt−1) represents the dynamic (motion) model between two consecutive states, and p(y_t ) x_t denotes observation model that estimates the likelihood of observing y_t at state x_t. The optimal state of the tracked target given all the observations up to t^th frame is obtained by the maximum a posteriori estimation over N samples at time t by

x =^arg^xtⁱ^{max p (y}^t

i x_tⁱ p _x^x^tⁱ

t−1 for i = 1,2, . . N --- (iv)

(3)

________________________________________________________________________________________________

If no occlusion occurs, an image observation yt can be assumed to be generated from a subspace of the target object spanned by U and centered at μ. However, it is necessary to account for partial occlusion in an appearance model for robust object tracking. We assume that a centered image observation y_t - μ of the tracked object can be represented by a linear combination of the PCA basis vectors U and few elements of the identity matrix I (i.e., trivial templates). If there is no occlusion, the most likely image patch can be effectively represented by the PCA basis vectors and coefficients tend to be zeros. If partial occlusion occurs, the most likely image patch can be represented as a linear combination of PCA basis vectors and very few numbers of trivial templates. [13].

For each observation corresponding to a predicted state, we solve the following equation efficiently using the proposed algorithm

L zⁱ, eⁱ = min_zt,e^t 1

2 y⁻ⁱ− uzi − eⁱ +λ eⁱ ---- (vi) And obtain zⁱ and eⁱ, where i denote the i-th sample of the state x.

The observation likelihood can be measured by their construction error of each observed image patch

p ^y_x⁻ⁱ_i = exp⁡(− y⁻ⁱ− uzi

2

2) --- (vii)

However, Eq. 3.4 does not consider occlusion. Thus, we use a mask to factor out non-occluding and occluding parts,

p ^y_x⁻ⁱ_i = exp⁡(− wⁱ. y⁻ⁱ− Uzi ₂²+ β 1 − wⁱ ---(viii)

C. Update of Observation Model

The change in target object for visual tracking can be obtained by updating the observation model. The trivial coefficients are acquired for occlusion detection and each trivial coefficient vector corresponds to a 2D map as a result of reverse raster scan of an image patch. The occlusion map is referred as non-zero element of this map indicates that pixel is occluded. The ratio ή of the number of nonzero pixels and the number of occlusion map pixels is calculated. Two thresholds tr1and tr2 are used to describe the degree of occlusion. If the ratio, ή <

tr1, the model can be updated directly. If tr1 < ή < tr2, it indicates that the target is partially occluded. Then the occluded pixels are replaced by its corresponding parts of the average observation µ, and use this recovered sample for update. Otherwise if ή > tr2, it means that a significant part of the target object is occluded, and this sample can be discarded without update.[14].

IV. EXPERIMENTAL RESULTS

The proposed model is simulated using MATLAB and its performance is evaluated by comparing with several tracking algorithms, i.e., the incremental visual tracker (IVT), L1 tracker (L1T), MIL tracker (MILT), the visual tracking decomposition tracker (VTD), PN , Frag Tracker algorithms.

A. Occlusion

The IVT, L1T , MILT , VTD , methods do not perform well, whereas the Frag algorithm performs slightly better. In contrast, our proposed method is better as shown in below figures

(4)

Figure 1: Qualitative evaluation: objects undergo heavy occlusion and pose change. Similar objects also appear in the scenes

(5)

________________________________________________________________________________________________

Figure 2: Center error for video clips/image series tested Table 1: Average Center Error (Pixels)

Data base IVT l1 PN VTD MIL Frag Track SPCA

Person 1 11.45 8.75 19.75 13.35 34.55 7.85 6.65

Person 2 7.95 8.85 16.35 8.15 11.85 13.25 8.15

Person 3 46.675 121.475 7.175 5.475 50.075 7.275 50.775

Person 4 11.867 6.467 11.767 7.967 73.567 8.867 33.767

Person 5 8.534 9.734 24.434 17.934 65.734 184.634 8.734

Car scene 1 10.75 6.85 34.95 6.35 17.45 24.25 98.25

Car scene 2 13.11 9.21 37.31 8.71 19.81 26.61 100.61

Car scene 3 8.514 4.614 32.714 4.114 15.214 22.014 96.014

Car scene 4 16.764 12.864 40.964 12.364 23.464 30.264 104.264

Average 15.06 20.97 25.046 9.37 34.63 36.11 56.35

Table 2: Overlap Rate of tracking

Data base IVT l1 PN VTD MIL Frag Track SPCA

Person 1 0.5 0.7 0.45 0.5 0.64 0.67 0.45

Person 2 0.28 0.28 0.70 0.83 0.25 0.68 0.28

Person 3 0.82 0.8 0.56 0.72 0.52 0.82 0.85

Person 4 0.5 0.7 0.45 0.5 0.64 0.67 0.45

Person 5 0.92 0.84 0.64 0.73 0.34 0.22 0.94

Car scene 1 0.85 0.88 0.65 0.77 0.59 0.90 0.92

Car scene 2 0.59 0.67 0.49 0.59 0.61 0.60 0.49

Car scene 3 0.28 0.28 0.70 0.83 0.25 0.68 0.28

Car scene 4 0.45 0.81 0.66 0.67 0.25 0.56 0.30

Average 0.54 0.48 0.54 0.56 0.39 0.33 0.51

C. Quantitative Evaluation

Performance evaluation is an important issue that requires sound criteria in order to fairly assess the strength of tracking algorithms. Quantitative evaluation of object tracking typically involves computing the difference between the predicated and the ground truth center locations, as well their average values. Table II summarizes the results in terms of average tracking errors. Our algorithm achieves lowest tracking errors in almost all the sequences.

V. CONCLUSION

This paper presents a robust tracking algorithm based on sparse prototype PCA representation. In this work, we taken partial occlusion and motion blur into account for appearance update and object tracking by exploiting the

strength of subspace model and sparse representation.

Several experiments are carried out on different image sequences and compared with several state-of art techniques with proposed tracking algorithm; our algorithm is better tracking results. In future, our representation schemes extended for other vision problems including object recognition, and develop other online orthogonal subspace methods with the proposed model.

REFERENCES

[1] Thang Ba Dinh,” Co-training Framework of Generative and Discriminative Trackers with Partial Occlusion Handling”, Applications of Computer Vision (WACV), 2011 IEEE Workshop, 5-7 Jan. 2011, 642 - 649

0 100 200 300 400

0 20 40 60 80 100 120

Frame Number

Center Error

0 100 200 300 400

0 50 100 150 200

Frame Number

Center Error

(6)

[2] Dong Wang, Huchuan Lu, “Incremental MPCA for Color Object Tracking”, 2010 IEEE, 1051- 4651.

[3] M. Everingham, L. Van Gool, C. Williams, J.

Winn, and A. Zisserman, “The Pascal visual object classes (voc) challenge,” Int. J. Comput.

Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010.

[4] J. Kwon and K. Lee, “Visual tracking decomposition,” in Proc. IEEE Conf. Comput.

Vis. Pattern Recog., 2010, pp. 1269–1276 [5] Jakob Santner, Christian Leistner,” PROST:

Parallel Robust Online Simple Tracking”, Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference, 13-18 June 2010, Page(s):723 – 730

[6] J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,”INRIA, Rocquencourt, France, Tech. Rep. 7400, 2010.

[7] J. Yang, J. Wright, T. S. Huang, and Y. Ma,

“Image super-resolution via sparse representation,” IEEE Trans. Image Process., vol.

19, no. 11, pp. 2861–2873, Nov. 2010.

[8] J. Wright, Y. Ma, J. Maral, G. Sapiro, T. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,” Proc. IEEE, vol.

98, no. 6, pp. 1031–1044, Jun. 2010.

[9] B. Babenko, M.-H.Yang, and S. Belongie,

“Visual tracking with online multiple instance learning,” in Proc. IEEE Conf. Comput. Vis.

Pattern Recog., 2009, pp. 983–990

[10] Xiaoyu Wang,”An HOG-LBP Human Detector with Partial Occlusion Handling”, 2009 IEEE 12th International Conference on Computer Vision (ICCV) 978-1-4244-4419-9.

[11] X. Li and W. Hu, “Robust visual tracking based on incremental tensor subspace learning,” in Proc. IEEE Int. Conf. Comput. Vis., 2007, pp. 1–

8.

[12] Amit Adam and Ehud Rivlin,” Robust Fragments-based Tracking using the Integral Histogram”, Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference, 17-22 June 2006, Page(s): 798 – 805 [13] S. Avidan, “Ensemble tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2005, vol. 2, pp. 494–501.

[14] D. Comaniciu, V. R. Member, and P. Meer.

Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5):564–575, 2003.

[15] M. Black andA. Jepson, “Eigentracking: Robust matching and trackingof articulated objects using a view-based representation,” in Proc. Eur.Conf.

Comput. Vis., 1996, pp. 329–342

