Object class detection: A survey - ACM Digital Library

Description of the patch level feature. The term "patch" often refers to a small local subwindow surrounding a point of interest in the image plane or scale pyramid, which can be sparsely examined by a keypoint detector, for example, a Difference of Gaussian Detector ( DoG) in Lowe [2004], or dense sampling in a regular grid as done in Dalal and Triggs [2005]. Summary. The region-level descriptors discussed in this section often build on low-level features such as colors, gradients, or self-similarities.

Structured Models

Similar tree-structured part models are adopted in some class-specific segmentation methods such as OBJ CUT [Kumar et al. 2005] and some 3D human body segmentation and pose estimation methods such as PoseCut [Bray et al. The model shows superior performance and has been adopted in some newly developed detectors such as Pedersoli et al.

Grammar-based models. Grammar-based models [Zhu and Mumford 2006], also called compositional models in Girshick et al. Based on this observation, Park et al. 2010] propose a resolution-dependent combination of window- and slice-based models. Specifically, the model degenerates to the window-based one of Dalal and Triggs [2005] when detecting small objects, such as pedestrians shorter than 50 pixels; and becomes the part-based model of Felzenszwalb et al. when detecting larger objects such as people that are taller than 100 pixels.

Unstructured Models

In addition, graphical or synthetic 3D models have also been used to support multiview detection, as performed by Hoiem et al. We compare globally and locally pooled window-based models with part-based ones in terms of their abilities to handle intra-class variations when detecting structured object classes. Although part-based models are also not invariant to induce changes, they can always tolerate these changes to a greater extent compared to locally pooled window-based models due to their inherent deformability.

In addition, the relative locations of parts of an object detected with part-based models can be used to describe the position of the object and support position estimation, which is confirmed by the recently proposed state-of-the-art position estimator . by Yang and Ramanan [2011]. Moreover, since a fraction of the parts in a part-based model is often sufficient to support a valid object hypothesis, part-based models can usually tolerate partial closures better than at least those based on in locally clustered windows. To this end, some bottom-up oversegmentation processes, for example, superpixel extraction [Ren and Malik 2003] or region decomposition [Gould et al.

Contextual Information Description

However, the basic assumption underlying these methods is that all the pixels of a specific segment belong to the same object [He et al. They try to provide their interpretations of context based on the early psychophysical findings of Biederman et al. To study the specific benefits of context to object class detection, let us first briefly consider the famous hypothesis of Biederman et al.

To provide a more comprehensive overview, we further include several sources of context summarized in the research of Divvala et al. 2009], refers to the biases about the way we take photos (framing [Simon and Seitz 2008], focus, subject), how we select datasets [Ponce et al. More often, local environmental context is captured using pairwise potentials in MRF/CRF-based methods, e.g. Shotton et al.

LOCALIZATION STRATEGIES

Localization through Subwindow Search

Then there are two basic variables to find the 2D coordinate of the center or corner W. Another limitation is due to the difference between the predefined shape of the pane and the shape of the detected object (discussed earlier in Section 3.2.1). 2008], which shows that by partitioning the search space into disjoint subspaces and by forming an easily computable matching result boundary, priority queuing can be used to significantly reduce the search time.

The second way to improve efficiency is to reduce the search range using contextual information [Torralba et al. Both methods can generate a small number (typically 10-100) of highly probable object locations, which can then be used as object hypotheses and thus significantly reduce the search space of the subwindow search, as confirmed by Carreira et al. Regarding the second limitation of the window search strategy, polygonal [ Yeh et al.

The Voting Strategy

The dataset can be found at http://www.vision.caltech.edu/html-files/archive.html. Note that subwindow search is a fairly general localization strategy that can be used not only with window-based models, but also with part-based models, for example Felzenszwalb et al.

Localization through Segmentation

SUPERVISED CLASSIFICATION METHODS

Parametric vs. Nonparametric Methods
Generative vs. Discriminative Methods
Latent and Hidden Models
Some Aspects regarding Classifier Training

Since an appearance descriptor belonging to a specific category can be generated from p(x|c), models based on this criterion are generative models, which are discussed in the next subsection. In contrast, non-probabilistic classifiers, especially kernel-based classifiers such as the SVM, can be trained using a very small set. They can even be unified, because non-probabilistic models can always be converted into probabilistic models.

To address this problem, incremental learning techniques can be used [Li and Fei-Fei 2010] and a set of specially modified classifiers, such as Light SVM [Joachims 1998], Large-Scale Multiple Kernel Learning [Sonnenburg et al. To address this problem, it is common to construct a training set consisting only of positive and "hard negative" cases, which can be applied by bootstrapping [Andriluka et al. More discussion on hard example data mining can be found in Felzenszwalb et al.

EVALUATION AND DATASETS

Performance Evaluation Metrics

Evaluation of image categorization results. The standard output of a categorization method on a test image is usually a predicted category label and its confidence level f, which can be the classification score, for comparison with a threshold to determine whether it is accepted. Therefore, it is equal to the recall for category c and therefore varies with the confidence level threshold f. Usually, the highest average recognition rate corresponds to the optimal value of the confidence level threshold.

The confusion matrix Mi j is a C×C matrix in which each element (i,j) stores a part of the test images from the i-th class that are categorized as the j-th class. Evaluation of object detection results. The standard outputs of the detection algorithm on the test image I are usually the bounding box B for each object and its predicted category label along with its confidence level f. The final category-specific segmentation accuracy is the average of all test subjects of the same category.

Datasets and the State-of-the-Art Results

When the confusion matrix has many significant off-diagonal values, then the prevalence model adopted by the algorithm is not discriminating enough for these categories, and therefore needs to be improved or replaced with better models. The segmentation accuracy of a single class in a test set can be measured by the proportion of pixels in that class that are correctly categorized (i.e., labeled) by the segmenter being evaluated. Due to the page limitations, in the following we only present a very brief overview of the latest results by summarizing the best (categorization, detection and segmentation) results of the PASCAL VOC 2010 competition in Table II.

Furthermore, observing and analyzing these results (including all results published at the previous URL) can reveal some interesting insights, which we believe are instructive for the development of object class detection methodology. i) Studying the technical details of the winners, e.g. Song et al. 2010] for segmentation we find that a combination of features, the integration of context, the use of kernel-based classifiers and the fusion of multiple models will help to achieve better overall accuracy as the winning algorithms have adopted some of these strategies. This point is also implied by the fact that no single method can win the competitions in all classes. ii). The level of accuracy achieved by the latest detection and segmentation methods is far from meeting the demands of practical general purpose applications.

OFFSHOOTS: NEWLY EMERGING APPROACHES AND APPLICATIONS

New Solutions

Retrieving large sets of labeled training images from the Internet. Along with text tags, we can use keywords such as category names to collect images of objects from social datasets or websites. To collect training images for object categorization, Wnuk and Soatto [2008] use a nonparametric oddity measure in the space of integrated image representations and perform iterative elimination of visual features to remove deviations from user-queried images. Quite differently, Li and Fei-Fei [2010] propose an Online Picture Collection via Incremental Model Learning (OPTIMOL) system, which mimics the continuous knowledge updating process of human observers and performs two tasks at the same time: automatically collecting datasets of objects from the web and for incremental learning of categorical models.

2010], whereby: (i) images from the Internet are parsed in an interactive way to build an AND-OR graph (AoG; see also Section 3.2.2) for visual knowledge representation and thus make. Their proposed text feature, a normalized text element name histogram, of an unannotated image is obtained from the tags of its nearest neighbors in this help collection. 2008] collected about 80 million color images from the web, rescaled them to 32×32, and loosely labeled each one with a noun in English from the WordNet lexicon (http://wordnet.princeton.edu/).

New Applications

Expand visual cues with texts. The accompanying texts of social images provide additional cues for categorization, detection, and segmentation. Categorization, detection and segmentation through direct matching. We have seen in the previous discussions that social images facilitate the construction of large labeled image sets. One of the later representatives is the Photo-Clip-Art system proposed by Lalonde et al.

Attribute-based image retrieval. The large amount and inaccurate tags of user-supplied social images often make the images retrieved (by them) using keywords noisy, bulky, low-quality, and disorganized. In other words, extracting really useful information from a large number of social images is very difficult. FaceTracer, a search engine for large sets of face images collected from the Internet, presented by Kumar et al.

CONCLUSIONS AND OPEN ISSUES

2007], which recognizes and segments objects of interest, such as pedestrians, in Internet photo collections and then allows users to paste them onto their photos. Another representative is a powerful image synthesis system, the Sketch2Photo platform developed by Chen et al. To solve this problem, researchers are exploring a new direction of image retrieval, that is, attribute-based image retrieval, which needs automatic accurate categorization of objects to provide attribute labels for object images.

2008], is just one such engine that supports attribute-based search, which can query images using natural language, such as "smiling men with blond hair and mustaches." Another feature-based image search engine is the SkyFinder system by Tao et al. 2009], which uses a set of automatically extracted semantic sky attributes such as layout, richness, and horizon. For example, it may include only one part for amorphous classes such as sky and clouds, and may use star-structured part topology for rigid classes such as bottle, and tree structure for complex, articulated classes such as the human body. iii).

ELECTRONIC APPENDIX