We postulate that a number of different factors may cause errors (false alarms, missed targets) in detection and localization. First of all, if the dataset contains objects that look similar to the target, “red herrings” or distractor objects, it is likely that some of them will be mistakenly detected as object instances. Another case is when the objects themselves are sometimes difficult to detect due to weak “signal strength.”
This is often the case, for example, when target objects are very small in terms of pixel
0.0 0.2 0.4 0.6 0.8 1.0 recall
0.0 0.2 0.4 0.6 0.8 1.0
precision
People
0.0 0.2 0.4 0.6 0.8 1.0 recall
0.0 0.2 0.4 0.6 0.8 1.0
precision
Yellow Cabs
0.0 0.2 0.4 0.6 0.8 1.0 recall
0.0 0.2 0.4 0.6 0.8 1.0
precision
Pools
0.0 0.2 0.4 0.6 0.8 1.0 recall
0.0 0.2 0.4 0.6 0.8 1.0
precision
Diagonal Bars
0.0 0.2 0.4 0.6 0.8 1.0 recall
0.0 0.2 0.4 0.6 0.8 1.0
precision
Cell Nuclei P1
0.0 0.2 0.4 0.6 0.8 1.0 recall
0.0 0.2 0.4 0.6 0.8 1.0
precision
Cell Nuclei P2
Figure 6.2: Annotator precision-recall. Each dot denotes one annotator. The curve is obtained by thresholding the number of clicks that each image location received.
The bottom two plots show the precision vs. recall of the same annotations using the ground truth provided by two different experts. Each expert’s precision-recall point vis-a-vis the other expert’s ground truth is shown as a green dot.
size, or their contrast with respect to the background is faint. Other times it is tough to decide whether an object in the image is a single or dual instance of a target object due to “crowding.” An example of this, if the target object category is people, would be one person almost completely occluding another one. Some annotators would see it as one person, while others would see it as two people very close together.
To highlight different issues in providing object detection annotations, we gathered five datasets. Each dataset exposes some of the issues discussed above to varying extents. The following sections describe each dataset in detail. With the exception of the last dataset, ground truth was established either by human operators who had access to higher resolution versions of the same images, or by human experts.
6.3.1 Swimming Pools
We collected 100 satellite images of a wealthy suburban neighborhood using Google Maps. The annotators were shown images of size 500×500 pixels with resolution of approximately 1 pixel per meter, and asked to look for swimming pools. The task is quite challenging as typical swimming pools are between 8 and 20 pixels wide and often lie in the shade of neighboring trees, thus their signal strength will be low. An initial version of the ground truth was provided by a hired annotator using annotation software developed in our group.
6.3.2 New York Yellow Cabs
We collected 240 satellite images of lower Manhattan from Google Maps. The images were sized at 500×500 pixels at a resolution of 2 pixels per meter. We asked the annotators to look for yellow cabs, which are clearly visible in many of the images.
However, in some cases the cabs were driving in the shade of buildings, obscuring their distinctive yellow color (the signal strength). Similarly, they are sometimes hard to distinguish from white or yellowish non-cab cars (the red herrings). The hired annotator provided ground truth for this dataset on images at twice the resolution (4 pixels per meter).
6.3.3 People
The 521 images in this dataset originated from two sources. The vast majority were sampled from a collection of holiday images with mostly outdoor scenes. The remain- ing images were street scenes taken by an mobile phone at red light intersections.
Images were either in portrait or landscape mode and the largest dimension was 800 pixels to ensure that they could be viewed in a standard web browser. 30 of the images contained no people at all. Since we consider only two-dimensional spatial detections, we asked annotators to click on the centroid of the head of any person found in the scene. We also explicitly asked them to ignore red herrings like statues, people-like toys and photos or posters of people in the images. The task is often quite easy, but can be challenging in crowded scenes or images where the people are very small in pixel-size. However, once a person has been found, it is usually pretty clear that it is a person, since people are not easily confused with other objects.
The ground truth was obtained by merging all detection annotations obtained from Amazon Mechanical Turk (MTurk) using a spatial clustering algorithm (see Section 6.4). One of the authors verified each of the consolidated annotations by overlaying such annotations on images that had 9–16x the resolution of the images that had been sent to the MTurk workers. Missed persons were annotated using the same technique.
6.3.4 Cell Nuclei
Patches of 320×320 were cropped from 1500×1500 pixel images of tissue samples [FHW+09] and up-sampled (using cubic interpolation) to 640×640 pixels, obtaining 136 images in total. Annotators were asked to click on cell nuclei according to some example images and written instructions. Cell nuclei were described as: Shapes with a clearly defined boundary. They are often (but not always) round in shape. Sometimes only the blue boundary (membrane) of a nucleus can be seen, while the nucleus itself appears white. Nuclei can also appear as brown shapes (stained with a marker). The task is challenging for several reasons. First, for most annotators this is the first time
they are exposed to images of tissue samples and cell nuclei annotation. Sometimes it is hard to know whether an image shows two neighboring nuclei or one single elongated nucleus. Some nuclei are strongly ambiguous and can be mistaken for non-nuclei tissue.
Two expert pathologists independently provided the ground truth detection an- notations for the dataset. The pathologists had access to the full size tissue samples and used a UI where they could zoom and pan the image. Their task was to provide circular annotations around each nucleus, thus giving the scale in addition to the location of each nucleus. As ground truth, we used only the center of the circular annotation, and discarded the scale information. We kept the annotations from both annotators separate, and used one as the ground truth during the experiments, except where stated otherwise.
6.3.5 Diagonal Bars
We generated a more controlled dataset with images containing a 10×10 grid of bars at different orientations. Most bars were oriented at 45◦, but in each image N bars were oriented atθ degrees from the diagonal. The annotators were asked to find bars that werenot at 45◦ diagonal. We variedN ={5,10,20}andθ ={10◦,15◦,30◦} and generated 10 random images for each combination, yielding 90 images in total. All images were 800×800 pixels. Since the images were generated synthetically, we had access to the ground truth.