A survey of appearance models in visual object tracking

For example, visual object tracking is successfully applied to monitor human activities in residential areas, parking lots and banks (e.g. W4 system [Haritaoglu et al. 2000] and VSAM project [Collins et al. 2000]). In the field of traffic transportation, visual object detection is also widely used to handle traffic flow monitoring [Coifman et al. Visual object tracking also has several human-computer interaction applications, such as hand gesture recognition [Pavlovie et al.

Hu et al. 2004; Arulampalam et al. 2002]) of visual object tracking has been made to investigate the advanced tracking algorithms and their potential applications, as listed in Table I. 2006] divides visual object tracking into three categories: point tracking, kernel tracking, and silhouette tracking (see Figure 7 [Yilmaz et al. 2006] for details); the study by Cannons [2008] provides a very detailed and comprehensive review of every tracking problem in visual object tracking. Multi-cue shape-texture histogram representation (ie, color, . gradient, texture) 11 [Porikli et al. covariance representation 12 [Li et al. 2012] Global log-Euclidean covariance representation 13 [He et al. 2009] Local local feature-based rendering.

To capture the correlation information of objects' appearance, covariance matrix representations are proposed for visual representation [Porikli et al. According to the Riemannian metric [Li et al. 2012], the covariance matrix representations can be divided into two branches: affine- invariant Riemannian metrics based and log-Euclidean Riemannian metrics. i) The affine-invariant Riemann metric [Porikli et al. Following the work of Porikli et al. 2006], Austvoll and Kwolek [2010] use the covariance matrix within a region to detect whether the characteristic occlusion events occur.

2010] propose a simplified covariance region descriptor (called the Sigma set) that comprises the square root of the lower triangular matrix (obtained by Cholesky factorization) of the covariance matrix (used in [Li et al. 2008]).

Fig. 1. Illustration of complicated appearance changes in visual object tracking.

Local Feature-Based Visual Representation

Usually, a SIFT-based visual representation makes direct use of the SIFT features within an object region to describe the structural information of object appearance. Usually, there are two types of SIFT-based visual representations:. i) individual SIFT point based and (ii) SIFT graph based. For (ii), the SIFT graph-based visual representations are based on the underlying geometric contextual relationship between SIFT feature points.

An MSER-based visual representation should extract MSER (maximum stable extremal region) features for visual representation [Sivic et al. Typically, a corner feature-based visual representation uses corner features within an object region to describe the structural properties of the object's appearance and then matches these corner features across frames for object localization. For example, Kim [2008] uses corner features for visual representation and then performs multilevel dynamic clustering of corner features to generate a set of corner point trajectories.

Recently, feature set-based local visual representations have been widely used in ensemble learning-based object tracking. In the second stage, discriminative learning is performed to select multiple discriminative regions of attention for visual representation.

Fig. 7. Illustration of the local template-based visual representation using superpixels.

Discussion on Global and Local Visual Representations

However, it cannot encode precise information about objects, such as size, orientation, and position. The MSER-based representation tries to find some maximally stable extreme regions for matching features across frames. The SURF-based representation is based on Accelerated Result Properties, which have the properties of scale invariance, rotation invariance, and computational efficiency.

Therefore, it is suitable for tracking objects (e.g., cars or trucks) with many vertices and sensitive to the influence of non-rigid shape distortion and noise. The feature pool-based representation is highly correlated with feature selection-based ensemble learning that requires some local features (e.g., color, texture, and shape). Due to the use of many features, the process of feature extraction and feature selection is computationally slow.

Representation based on saliency detection aims to find a set of discriminative saliency regions for a given object. However, its drawback is that it relies heavily on salient region detection, which is sensitive to noise or drastic lighting changes.

STATISTICAL MODELING FOR TRACKING-BY-DETECTION

Mixture Generative Appearance Models
Kernel-Based Generative Appearance Models (KGAMs)
Subspace Learning-Based Generative Appearance Models (SLGAMs)
Boosting-Based Discriminative Appearance Models
SVM-Based Discriminative Appearance Models (SDAMs)
Randomized Learning-Based Discriminative Appearance Models (RLDAMs)
Discriminant Analysis-Based Discriminative Appearance Models (DADAMs)
Codebook Learning-Based Discriminative Appearance Models (CLDAMs)
Hybrid Generative-Discriminative Appearance Models (HGDAMs)

Another way to improve the efficiency of the 1-tracker [Mei and Ling 2009] is to reduce the number of 1-minimizations in the process of evaluating test samples [Mei et al. Self-Learning Single-Instance BDAMs. Based on online boosting [Oza and Russell 2001], researchers have developed a number of computer vision applications, such as object detection [Viola and Jones 2002] and visual object tracking [Grabner et al. Essentially, this BDAM is an extension of the GradientBoost algorithm [Friedman 2001] and works similarly to the AnyBoost algorithm [Mason et al.

To characterize the cumulative loss of the poor classifiers over multiple frames instead of the current frame, Li et al.

Table III. Summary of Representative Tracking-by-Detection Appearance Models Based on Generative Learning Techniques

BENCHMARK RESOURCES FOR VISUAL OBJECT TRACKING

In principle, eigenboosting aims to minimize a modified boosting error function where the generative information (i.e., eigenimages generated from Haar-like binary basis functions using robust PCA) is integrated as a multiplicative prior. HGDAMs via multilayer combination. In principle, the goal of HGDAMs via multi-layer combination is to combine the information from the generative and discriminative models on multiple layers. In general, such HGDAMs can be divided into two classes: HGDAMs via sequential combination and HGDAMs via interleaving combination.

In principle, the HGDAMs via sequential combination aim to merge the advantages of the generative and discriminative models in a sequential manner. Namely, they use the decision output of one model as the input of the other model. In principle, the purpose of the HGDAMs via interleaving combination is to combine the discriminative-generative information in a multi-layer interleaving way.

The decision output of one model is used to guide the learning task of the other model and vice versa. TableVI.SummaryofthePubliclyAvailableTrackingResources ItemNo.NameDatasetGroundtruthSourcecodeWeblink 1Headtrack [Birchfield1998]√ × √ www.ces.clemson.edu/∼stb/research/headtracker/seq/ 2Adammenttrack.⚈ecs.2 nion.ac.il/∼ amita/fragtrack/ fragtrack.htm 3Adaptivetracker [Jepsonetal.2003]√ ××www.cs.toronto.edu/vis/projects/ adaptiveAppearance.html 4PCAtracker [Rossetal.2008]√√ √ www.cs.utoronto. /ivt/ 5KPCAtracker [ChinandSuter2007]×× √ cs.adelaide.edu.au/∼tjchin/ 61tracker [MeiandLing2009]×× √ www.ist.temple.edu/∼hbling/codedata.htm 7Kernel [01-based]tracker. ×× √ code.google.com/p/detect/ 8Boostingtracker [GrabnerandBischof2006]√ × √ www.vision.ee.ethz.ch/boostingTrackers/ 9MILtracker [Babenkoetal.2009]√√/∼b.ebab.ucc.vision projectmiltrack .shtml 10MIForeststracker [Leistneretal.2010]√√ √ www.ymer.org/amir/software/milforests/ 11Boosting+ICAtracker [Yangetal.2010b]×× √ ice.dlut.edu.cn/lu/publications.html adaptivetracker [Zhouetal. .2004]×× √ www.umiacs.umd.edu/∼shaohua/sourcecodes.html 13Tracking with histograms and articulating blocks [Nejhumetal.2010]√√ √ www.cise.ufl.ufl.smhed. Lee 2010 ]√√ √ cv.snu.ac.kr/research/∼vtd/ 15StructuralSVMtracker [Hareetal.2011]×× √ www.samhare.net/research/struck 16PROSTtracker [Santneretal.2010]vision√g.tu.gra at/index.php?content=subsites/prost/prost.php. TableVI.Counted item no.nameDatasetGroundtruthSourcecodeWeblink 17Superpixeltracker [Wangetal.2011]√√ √ faculty.ucmerced.edu/mhyang/papers/iccv11a.html 18KLTfeaturetracker [LucasandKanade] /klt/ 19Deformablecontourtracker [Vaswanietal.2008]√ × √home.engineering.iastate. edu/∼namrata/research/ ContourTrack.html#code 20Condensationtracker [IsardandBlake1998]√ × √ www.robots.ox.ac.uk/∼Monsiontracking/∼Monsiontracking 21 StaufferandGrimson2000]√ × √ www.cs.berkeley.edu/∼flw /tracker/ 22Meanshifttracker×× √www.cs.bilkent.edu.tr/∼ismaila/ MUSCLE/MSTracker.htm 23∼Tracking-Learning-info tracking. ee.surrey.ac.uk/Personal/Z.Kalal/tld.html 24CAVIARsequences√√ ×homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/ 25PETSsequences√√ ×www.hitech-projects.com/euprojects/canta / datasetscantata/dataset.html 26SURF×× √ people.ee.ethz.ch/∼surf/downloadac.html 27XVisionvisualtracking×× √ peipa.essex.ac.uk/info/software.html 28TheMachinePerceptionToolbox×× √.mplab. /grants/project1/free-software/ MPTWebSite/introduction.html 29CompressiveTracker[Zhangetal.2012]√√ √ www4.comp.polyu.edu.hk/∼cslzhang/CT/CT.htm 30Structurallocalsparsetracker[Zhangetal.2012]√√ √ www4.comp.polyu.edu.hk/∼cslzhang/CT/CT.htm .edu.cn/lu/Project/cvpr12jiaproject/ cvpr12jiaproject.htm 31Sparsity-based collaborative tracker [Zhongetal.2012]√√ √ ice.dlut.edu.cn/lu/Project/cv2schtsa tracker [Zhangetal .2012]×× √ sites .google.com/site/zhangtianzhu2012/publications 33APG1tracker[Baoetal.2012]√√ √ www.dabi.temple.edu/∼hbling/codedata.htm#L1Tracker]34 √ www.samhare.net/research/keypoints 35Spatial-weighted MILtracker [ZhangandSong2012]×× √ code.google.com/p/online-weighted-miltracker/.

Note: The symbols √ and × mean that the visual representation can or cannot cope with situations of occlusions, changes in lighting or shape deformations. If objects of interest are marked with bounding boxes, a quantitative evaluation is performed by calculating the positional errors of the four corners between the tracked bounding boxes and the ground truth. Alternatively, for quantitative evaluation, the overlap ratio between the tracked bounding boxes (or ellipses) and the ground truth is calculated: r = AAttAAgg, where A is the tracked bounding box (or ellipse) and Agis the ground truth.

The task of marking the ground truth with bounding boxes or ellipses is difficult and time consuming. Specifically, they either record object center locations as ground truth for simplicity and efficiency or manually mark several points within object regions as ground truth for accuracy (e.g., seven marker points are used in the sequence of the duck face [Ross et al. 2008]). In this way, we can calculate the position residuals between the tracking results and the ground truth for quantitative estimation.

Table VII. Qualitative Comparison of Visual Representations

CONCLUSION AND FUTURE DIRECTIONS

However, these visual features and geometric constraints may also reduce the generalizability of appearance models in terms of subjectivity to other appearance variations. On the other hand, appearance models to improve tracking robustness relax some of the constraints on accurate object localization and thus allow for greater object localization ambiguity. In computer vision, designing simple and robust visual features is one of the most fundamental and important problems.

Typically, appearance models are based on a single camera that provides very limited visual information about tracked objects. They often cannot independently complete the task of tracking the same object in different but adjacent scenes. In Proceedings of the 10th European Conference on Computer Vision (ECCV'08). Lecture Notes in Computer Science, vol.