A better understanding of this process could improve a computer's ability to parse scene images and convey information about them to humans. It is composed of a large public dataset of high-resolution annotated scene images and suitable metrics for performance evaluation.
Introduction
Human Annotation
For now, we rely on the intuitive idea and explore ways to identify which objects people notice the most in a photo.
Previous Work
Elazary and Itti consider the order in which objects are named in LabelMe, a measure of object interest [28, 115]. This is problematic because, as we will see in Chapter 3.6, viewers produce unordered lists of objects.
Data Collection
The results of past users are visible to future users, so an item token can only be drawn once, creating a single list. For example, "building" could mean a house in a suburban image or a skyscraper in a city image.
Image Collection
Data Overview
We then take the median of the rank differences, so as not to count objects twice. We identify the obvious object as the object mentioned early and often, the earliest in the average order of the more frequent half of the objects.
Urn Model
An object's importance in a particular image is the probability that it is mentioned first by a viewer. To model this, we develop a variant of the urn model, which we call the forgetful urn.
The Forgetful Urn
3.1) However, we draw marbles without replacement, so this equation is bounded by wmi = wjm =⇒ i = j. Assuming our data is valid, we are only dealing with the second case.
The Relaxed Urn
List Statistics and Urn Models
As with the other histogram comparisons, the urn models are all between human and random data.
Markov Chain Method
This Markov chain is aperiodic in construction and empirically irreducible from our data, so we are assured of a stationary distribution. Our intuition as to why the steady-state distribution should be about as important is that the Markov chain essentially makes the urn move backwards.
Left-out Object Sequence
To select the closest human list, we consider the first k objects in our omitted list and select the closest of the 24 lists. For a fair comparison, we force the first k objects in all guess lists to match the hidden list, and fill the other 10 k entries with objects in the order of the guess list.
Object Outlines
We measure the quality of the match as the intersection of the union of the human and the nearest computer segment and then sum the importance of the matched objects. Our generalization of the three-note criterion is to compare the largest of the three pairwise consistency values with 0.5.
Features
We note that the object map has a vertical axis of symmetry, so we treat distances to the left and right of the centerline equivalently. For each of these measures, we took the sum, maximum, and average of the saliency values covered by the object mask.
Regression
We then look at the percentage of the face mask covered by the object, the percentage of the object covered by the face mask, and the intersection of the union of the object and the face masks. However, the scatter plot is difficult to interpret because most of the objects have very low importance.
The Power of Features
Distance to center min Surface order descending Distance 3rd Box min Orientations CM-mean Intensities CM-mean Color CM-sum Distance below center maximum Surface order ascending Percentage of object covered by face Gaussian modulated saliency sum Object-face intersection/unit Distance left/right -mean Learned saliency max Percent overlap Orientations CM sum Overlapping objects mean log(area). Distance to center min. Blurred saliency average Blurred saliency sum Area Distance 3rd min Orientations CM max. Overlapping objects max. Distance 3rd max. Surface order descending Distance 3rd Box min Orientations CM mean Intensities CM mean Color CM sum Distance below center max Surface order of ascending object covered by Gaui order ascending modulated Object face intersection/union Distance left/right mean Learned saliency max Percent overlap Orientations CM sum Overlapping objects average(area).
Generative and Discriminative Tasks
The measured importance values in Figure 4.11 compare the lists obtained from viewers performing the regular, generative, or discriminative tasks. Distance below center minArea order descendingDistance left/right min Blurred salience maxIntensities CM meanDistance 3rds maxDistance 3rds min Distance above center mean Gaussian modulated salience mean Gaussian modulated salience sumPercentage of object covered by faceObject-face intersection/unionDistance below center maxOverlapping objects meanOverlapping objects maxDistance left /right averageLearned salience maxArea order ascendingOrientations CM sumPercent overlappedColor CM sumlog(area).
Methods
Stimuli
Experimental Conditions
Selecting the target object made the search difficult; target objects were either present but not visible, or not present in the image but plausible for the scene and often named in other images (as determined independently in a web-based “what” condition without eye tracking). In the where condition an image disappeared when observers pressed a key indicating the presence or absence of the target object.
Observers
After the image disappeared, the observer rated its interestingness from 1 to 5, and then typed up to five keywords (Figure 5.2A).
Recording Eye Position
After the picture disappeared, the observer rated its interestingness from 1 to 5, and then entered up to five key words (Figure 5.2A). until the key is pressed) screen response (until the key is released). In the “what” condition (above), observers see a picture for 3 seconds, are asked for a rating, and then key words.
Object Annotation
1/16 of the image resolution, i.e. 0.5◦/bin) and smaller than the typical object size, which we roughly estimate by the square root of the number of pixels covered by an object, yielding 223 pixels or 6.3◦ on average. Thus, the typical variation of fixation location during a fixation is small compared to the eye tracker's absolute localization accuracy, the resolution of saliency maps, and the typical size of objects.
Object Maps, Fixation Maps, and Saliency Maps
Analysis Methods
We follow two strategies to assess the effect of these spatial biases: First, we directly measure the spatial biases of the investigated feature (Figure 6.1 A and B). Second, we define a random reassignment baseline to measure how much of the prediction by a given map can be explained by the image-independent spatial biases: we randomly assign the object, salience, and fixation map of one image to another image.
Central Bias
C Area under the ROC curve for the prediction of the salience map of fixated locations, pooled across all observers in each image, histogram across images. Each data point corresponds to one image; for points above the diagonal salience map the prediction is better.
Task and Fixation
Instead, the distribution of observability is quite uniform, if we ignore the boundaries where the observability is technically zero (Figure 6.1B).
Fixation Consistency
CD data for the “where” observers of panel B separated by target present (dark gray) and target absent trials (light gray). spatial bias) of interobserver consistency. Within "where" observers' prediction does not consistently depend on the presence of the target (Figure 6.3C), excluding that fixations on or near the target dominate between-observer consistency in the "where" task.
Object Recall Consistency
In summary, there is enough consistency between observers to predict another individual's fixations, despite some task dependence. E Histogram of the number of observers recalling the most recalled object in an image.
Early Saliency Predicts Fixations Poorly
One observer's fixations are predicted from a map generated from the fixations of all others with an average AUC of 88.9% (Figure 6.3). This number is much higher than the 57.8% obtained for saliency, suggesting that fixation prediction from saliency maps, while better than chance, is far from optimal.
Task Independence of Saliency Map Predic- tionstions
If objects tend to be more salient than the background, saliency maps would predict fixations indirectly rather than directly causing fixation.
Predicting Fixations with Object Maps
Consequently, for most pictures, object familiarity is more predictive of fixations than early salience knowledge alone, even when the frequency of object recall is unknown.
If Objects are Known, Early Saliency Con- tributes Little to Fixation Predictiontributes Little to Fixation Prediction
B Examples of images where fixations are predicted 1 well by both, 2 well by the object map and poorly by the saliency map, 3 well by the saliency map and poorly by the object map, and 4poorly by both. This shows that early saliency does not add significantly to fixation prediction once recalled objects are known, whereas object maps are informative even when raw saliency is known.
Predicting Fixations with Object Saliency
This is a strong indication that familiarity with saliency provides little additional information once the objects are known.
Predicting Recall with Fixations
We also look at the relationship between the fraction of fixations falling on an object and the log odds of an object being named versus not named. However, a similar relationship is seen between one's own fixations and others' fixations on average (Figure 8.4) and for individuals (Figure 8.3B).
Object Saliency Predicts Recall Frequency
Although this indicates that observers prefer to remember large central objects, the correlation between total object saliency and recall frequency (Figure 8.2C) exceeds all other measures tested. These object measures are correlated with object saliency measures and thus are partially redundant in predicting object attraction.
Combinations of Properties Predict Recall
BAUC for predicting recall based on total object feature alone (left bars), fixations alone (middle bars), or a combination of both (right bars). D AUC for prediction by maximum object saliency (left), maximum object saliency combined with fixations (middle), and maximum object saliency combined with object surface (right).
Saliency Predicts a Scene’s Most Characteris- tic Objecttic Object
Therefore, total object saliency predicts the most frequently named object significantly better than uniform random selection. How does the object salience measure compare with other measures to predict the most frequently named object.
Saliency Peak Falls on Frequently Named Ob- jectsjects
In this respect, early salience does not direct attention directly (black path in Figure 9.3), but through correlation with object properties (red path). In this respect, saliency maps model object saliency instead of location saliency.
Methods
The outline of the paper is as follows: Section 10.1 discusses previous work toward the goal of detecting objects independent of category membership. Ma and Zhang obtained fuzzy growth-related components of a contrast-based saliency map as detection [81].
Datasets
Objects of categories other than the 20 selected are not marked; we find that marked objects make up only one-tenth of visible objects (Table 10.2). The objects are not marked in the data set, so Endres and Hoiem added marks that conformed to the original boundaries.
Benchmark
New Dataset
So if 75% of viewers produce at least three consistent frames for an image, then acceptance requires two consistent frames. Ground truth boxes were generated in the following manner: All consistent boxes for an image were clustered with a single linkage cutoff of oa = 0.5.
Metrics
A different approach used by Alexe et al. allows for multiple matching and defines a signal-to-noise ratio and a detection ratio. Correctly finding one object out of dozens can result in a detection rate = 1 and a signal-to-noise ratio ≥ 0.5.
Benchmarking Previous Methods
A precision and recall suggest that while Endres and Hoiem outperform other methods, there is considerable room for improvement. Good performance in Figure 11.2B is indicated by fewer detections recalling more objects.
Predicting Box Importance
Effects of semantic consistency on eye movements during complex scene viewing. Journal of Experimental Psychology: Human Perception and Performance. Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search.