Modeling and Predicting Object Attention in Natural Scenes

A better understanding of this process could improve a computer's ability to parse scene images and convey information about them to humans. It is composed of a large public dataset of high-resolution annotated scene images and suitable metrics for performance evaluation.

Introduction

Human Annotation

For now, we rely on the intuitive idea and explore ways to identify which objects people notice the most in a photo.

Previous Work

Elazary and Itti consider the order in which objects are named in LabelMe, a measure of object interest [28, 115]. This is problematic because, as we will see in Chapter 3.6, viewers produce unordered lists of objects.

Data Collection

The results of past users are visible to future users, so an item token can only be drawn once, creating a single list. For example, "building" could mean a house in a suburban image or a skyscraper in a city image.

Figure 2.1: Detailed instructions given to all viewers on Mechanical Turk.

Image Collection

Data Overview

We then take the median of the rank differences, so as not to count objects twice. We identify the obvious object as the object mentioned early and often, the earliest in the average order of the more frequent half of the objects.

Figure 2.2: Representative sample of our images. These photos by artist Stephen Shore are a visual diary of arresting moments rather than a collection taken by a computer vision researcher for a particular purpose.

Urn Model

An object's importance in a particular image is the probability that it is mentioned first by a viewer. To model this, we develop a variant of the urn model, which we call the forgetful urn.

Figure 3.1: A photograph and corresponding lists generated by five viewers. Words are color coded to facilitate perception of word order

The Forgetful Urn

3.1) However, we draw marbles without replacement, so this equation is bounded by wmi = wjm =⇒ i = j. Assuming our data is valid, we are only dealing with the second case.

The Relaxed Urn

List Statistics and Urn Models

As with the other histogram comparisons, the urn models are all between human and random data.

Markov Chain Method

This Markov chain is aperiodic in construction and empirically irreducible from our data, so we are assured of a stationary distribution. Our intuition as to why the steady-state distribution should be about as important is that the Markov chain essentially makes the urn move backwards.

Figure 3.5: Histogram of the number of objects shared by a pair of lists for the same image

Left-out Object Sequence

To select the closest human list, we consider the first k objects in our omitted list and select the closest of the 24 lists. For a fair comparison, we force the first k objects in all guess lists to match the hidden list, and fill the other 10 k entries with objects in the order of the guess list.

Figure 3.9: Scatter plot where a dot is plotted for each object at its forgetful urn maximum likelihood and Markov chain measured importance coordinates.

Object Outlines

We measure the quality of the match as the intersection of the union of the human and the nearest computer segment and then sum the importance of the matched objects. Our generalization of the three-note criterion is to compare the largest of the three pairwise consistency values with 0.5.

Features

We note that the object map has a vertical axis of symmetry, so we treat distances to the left and right of the centerline equivalently. For each of these measures, we took the sum, maximum, and average of the saliency values covered by the object mask.

Regression

We then look at the percentage of the face mask covered by the object, the percentage of the object covered by the face mask, and the intersection of the union of the object and the face masks. However, the scatter plot is difficult to interpret because most of the objects have very low importance.

Figure 4.4: ROC curves for identifying important objects. We define an impor- impor-tant object as having a measured importance ≥ {0.05, 0.15, 0.25, 0.35} and move the threshold across the predicted importance

The Power of Features

Distance to center min Surface order descending Distance 3rd Box min Orientations CM-mean Intensities CM-mean Color CM-sum Distance below center maximum Surface order ascending Percentage of object covered by face Gaussian modulated saliency sum Object-face intersection/unit Distance left/right -mean Learned saliency max Percent overlap Orientations CM sum Overlapping objects mean log(area). Distance to center min. Blurred saliency average Blurred saliency sum Area Distance 3rd min Orientations CM max. Overlapping objects max. Distance 3rd max. Surface order descending Distance 3rd Box min Orientations CM mean Intensities CM mean Color CM sum Distance below center max Surface order of ascending object covered by Gaui order ascending modulated Object face intersection/union Distance left/right mean Learned saliency max Percent overlap Orientations CM sum Overlapping objects average(area).

Figure 4.6: Predicted Importance. Importance predicted using the Lasso and simple image features

Generative and Discriminative Tasks

The measured importance values in Figure 4.11 compare the lists obtained from viewers performing the regular, generative, or discriminative tasks. Distance below center minArea order descendingDistance left/right min Blurred salience maxIntensities CM meanDistance 3rds maxDistance 3rds min Distance above center mean Gaussian modulated salience mean Gaussian modulated salience sumPercentage of object covered by faceObject-face intersection/unionDistance below center maxOverlapping objects meanOverlapping objects maxDistance left /right averageLearned salience maxArea order ascendingOrientations CM sumPercent overlappedColor CM sumlog(area).

Figure 4.9: RSS error as features are included by stepwise regression.

Methods

Stimuli

Experimental Conditions

Selecting the target object made the search difficult; target objects were either present but not visible, or not present in the image but plausible for the scene and often named in other images (as determined independently in a web-based “what” condition without eye tracking). In the where condition an image disappeared when observers pressed a key indicating the presence or absence of the target object.

Figure 5.1: Ninety-three photographs of Stephen Shore’s collection “Uncommon Places” were used as stimuli.

Observers

After the image disappeared, the observer rated its interestingness from 1 to 5, and then typed up to five keywords (Figure 5.2A).

Recording Eye Position

After the picture disappeared, the observer rated its interestingness from 1 to 5, and then entered up to five key words (Figure 5.2A). until the key is pressed) screen response (until the key is released). In the “what” condition (above), observers see a picture for 3 seconds, are asked for a rating, and then key words.

Figure 5.2: A Paradigm outline. In the “what” condition (top), observers see the image for 3s, are prompted for a rating and then for keywords

Object Annotation

1/16 of the image resolution, i.e. 0.5◦/bin) and smaller than the typical object size, which we roughly estimate by the square root of the number of pixels covered by an object, yielding 223 pixels or 6.3◦ on average. Thus, the typical variation of fixation location during a fixation is small compared to the eye tracker's absolute localization accuracy, the resolution of saliency maps, and the typical size of objects.

Object Maps, Fixation Maps, and Saliency Maps

Analysis Methods

We follow two strategies to assess the effect of these spatial biases: First, we directly measure the spatial biases of the investigated feature (Figure 6.1 A and B). Second, we define a random reassignment baseline to measure how much of the prediction by a given map can be explained by the image-independent spatial biases: we randomly assign the object, salience, and fixation map of one image to another image.

Central Bias

C Area under the ROC curve for the prediction of the salience map of fixated locations, pooled across all observers in each image, histogram across images. Each data point corresponds to one image; for points above the diagonal salience map the prediction is better.

Figure 6.1: A Average object map, and its mean along cardinal axes. B Average saliency map and its mean

Task and Fixation

Instead, the distribution of observability is quite uniform, if we ignore the boundaries where the observability is technically zero (Figure 6.1B).

Figure 6.2: A Number of fixations mean and SE for individual observers. Red bars:

Fixation Consistency

CD data for the “where” observers of panel B separated by target present (dark gray) and target absent trials (light gray). spatial bias) of interobserver consistency. Within "where" observers' prediction does not consistently depend on the presence of the target (Figure 6.3C), excluding that fixations on or near the target dominate between-observer consistency in the "where" task.

Figure 6.3: A Fixation map for image of Figure 5.2B with fixations of observer MW superimposed

Object Recall Consistency

In summary, there is enough consistency between observers to predict another individual's fixations, despite some task dependence. E Histogram of the number of observers recalling the most recalled object in an image.

Figure 6.4: A Histogram of total objects recalled per image. B Histogram of number of observers recalling each object

Early Saliency Predicts Fixations Poorly

One observer's fixations are predicted from a map generated from the fixations of all others with an average AUC of 88.9% (Figure 6.3). This number is much higher than the 57.8% obtained for saliency, suggesting that fixation prediction from saliency maps, while better than chance, is far from optimal.

Task Independence of Saliency Map Predic- tionstions

If objects tend to be more salient than the background, saliency maps would predict fixations indirectly rather than directly causing fixation.

Predicting Fixations with Object Maps

Consequently, for most pictures, object familiarity is more predictive of fixations than early salience knowledge alone, even when the frequency of object recall is unknown.

If Objects are Known, Early Saliency Con- tributes Little to Fixation Predictiontributes Little to Fixation Prediction

B Examples of images where fixations are predicted 1 well by both, 2 well by the object map and poorly by the saliency map, 3 well by the saliency map and poorly by the object map, and 4poorly by both. This shows that early saliency does not add significantly to fixation prediction once recalled objects are known, whereas object maps are informative even when raw saliency is known.

Figure 8.1: A Area under the curve (AUC) for fixations predicted by saliency and object maps

Predicting Fixations with Object Saliency

This is a strong indication that familiarity with saliency provides little additional information once the objects are known.

Predicting Recall with Fixations

We also look at the relationship between the fraction of fixations falling on an object and the log odds of an object being named versus not named. However, a similar relationship is seen between one's own fixations and others' fixations on average (Figure 8.4) and for individuals (Figure 8.3B).

Figure 8.2: A Fraction of fixations inside object versus recall frequency. Mean ± SE across objects, fit treats all 981 objects individually

Object Saliency Predicts Recall Frequency

Although this indicates that observers prefer to remember large central objects, the correlation between total object saliency and recall frequency (Figure 8.2C) exceeds all other measures tested. These object measures are correlated with object saliency measures and thus are partially redundant in predicting object attraction.

Combinations of Properties Predict Recall

BAUC for predicting recall based on total object feature alone (left bars), fixations alone (middle bars), or a combination of both (right bars). D AUC for prediction by maximum object saliency (left), maximum object saliency combined with fixations (middle), and maximum object saliency combined with object surface (right).

Figure 9.1: Recall prediction by various object properties and their linear combi- combi-nation

Saliency Predicts a Scene’s Most Characteris- tic Objecttic Object

Therefore, total object saliency predicts the most frequently named object significantly better than uniform random selection. How does the object salience measure compare with other measures to predict the most frequently named object.

Table 9.1: Out of the 77 images, which have a unique characteristic object (2nd column), this object has the highest total object saliency in 28 images (4th column), the highest maximum object saliency in 26 (5th column), is the largest in 16 (7th column),

Saliency Peak Falls on Frequently Named Ob- jectsjects

In this respect, early salience does not direct attention directly (black path in Figure 9.3), but through correlation with object properties (red path). In this respect, saliency maps model object saliency instead of location saliency.

Figure 9.3: Right: Although early saliency predicts fixations to some extent (57.8%

Methods

The outline of the paper is as follows: Section 10.1 discusses previous work toward the goal of detecting objects independent of category membership. Ma and Zhang obtained fuzzy growth-related components of a contrast-based saliency map as detection [81].

Table 10.1: Comparison of visual cues used by Alexe et al ., Endres and Hoiem, and Hou and Zhang.

Datasets

Objects of categories other than the 20 selected are not marked; we find that marked objects make up only one-tenth of visible objects (Table 10.2). The objects are not marked in the data set, so Endres and Hoiem added marks that conformed to the original boundaries.

Benchmark

New Dataset

So if 75% of viewers produce at least three consistent frames for an image, then acceptance requires two consistent frames. Ground truth boxes were generated in the following manner: All consistent boxes for an image were clustered with a single linkage cutoff of oa = 0.5.

Figure 11.1: Random sample of our images. These photos by artist Stephen Shore are a sampling of experiences rather than images gathered for segmentation or object recognition.

Metrics

A different approach used by Alexe et al. allows for multiple matching and defines a signal-to-noise ratio and a detection ratio. Correctly finding one object out of dozens can result in a detection rate = 1 and a signal-to-noise ratio ≥ 0.5.

Benchmarking Previous Methods

A precision and recall suggest that while Endres and Hoiem outperform other methods, there is considerable room for improvement. Good performance in Figure 11.2B is indicated by fewer detections recalling more objects.

Figure 11.2: Methods evaluated on new benchmark. A Precision and recall suggest that while Endres and Hoiem outperform other methods, there is significant room for improvement

Predicting Box Importance

Effects of semantic consistency on eye movements during complex scene viewing. Journal of Experimental Psychology: Human Perception and Performance. Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search.