In this section, we conduct experiments on more challenging data: gray-scale image
sequences. To apply the detection and labeling algorithms, candidate features are
obtained from the Lucas-Tomasi-Kanade [1, 2] feature selector/tracker on pairs of
frames. Figure3.8 illustratesthe approach.
50 100 150 200 250 300 350
50
100
150
200
0 50 100 150 200 250 300 350
0
50
100
150
200
(a) (b)
Figure3.8: Illustrationof the approach ongray-scale images. Fora given image (a),
features arerst selected andtracked tothe next frame. Dots in(a)arethe features,
and (b) shows the features with velocities. From all the candidate feature points
(with positions and velocities), we want to rst decide whether there is a person in
the scene and then nd the best labeling { the most human-like conguration (dark
dots in (a) and (b)) according to alearned probabilisticmodel.
Figure3.9shows the hand-constructed probabilisticdecomposition forthe exper-
iments. Twenty parts are chosen to represent the human body. The dark dots in
Figure 3.8 shows features representing the parts. Three parts are missing for the
framein Figure3.8: twoat the left knee and one atthe right heel.
1 2
5 6
7 8
9 10 11
12
13
14
15
16
17 18
RS
LE
LW RW
RE
LH
LKO LKI RKO
RKI
LA RA
LT RT
LHE RHE
H
N LS
3 4
RH
Figure3.9: Decompositionsof thehumanbody forgray-scaleimage experiments. `L'
and `R'in label names indicate left and right. H:head, N:neck, S:shoulder, E:elbow,
W:wrist, H:hip, KI:inside knee, KO:outside knee, A:ankle, HE:heel, and T:toe. The
numbers inside trianglesgiveone eliminationorder.
3.6.1 Data
The image sequences were captured by a CCD camera at 30 Hz. There are three
types of motion: (1) A subject walks from left toright, facing60 degrees away from
the front view (middle row of Figure 3.10). We have 20 sequences with around 120
frames each. (2) A chair moves from left to right (bottom row of Figure 3.10). 8
sequences, with 120 frames each. (3) While a subject walks as in type (1), a chair
also moves as in type (2) (top row of Figure 3.10). 16 sequences, with 120 frames
each.
Training set: manually tracked data. The model parameters (mean and
covariance of Gaussian) are learned from a training set with the hand-constructed
ground truth labeling. The training sequences include eight type (1) walking se-
quences. For the rst frame of each sequence, we manually select all the features
corresponding to the body parts in the model of Figure 3.9. The features are then
tracked automatically to the next frame using the Lucas-Tomasi-Kanade tracking
algorithm. The tracking results are monitored, and features with obvious tracking
Figure 3.10: Sample frames from body and chair moving sequences (type (3), top
row), body moving sequences (type (1), middle row), and chair moving sequences
(type (2), bottom row). The dots (either in black or in white) are the features
selected byLucas-Tomasi-Kanade[1,2]algorithmonpairsofframes. The whitedots
are the most human-like congurationfound by our algorithm.
velocities of features. The labeling (body part assignment of the features) is given
manually. This process isrepeated for allthe frames.
Testing Set. For the test sequences, features are obtained automatically from
the standard Lucas-Tomasi-Kanade feature selection/tracking algorithm on pairs of
frames. We do not track features over more than two frames, but reselect all the
features atthenext frameafter tracking,whichsimulates thearguably mostdiÆcult
situationforlabeling anddetection (asdiscussed insection3.3). The dotsin Figures
3.8and3.10arefeaturesfromthisprocedure. Theaveragenumberoffeaturesdetected
ineachframeis64,46,and58fortype(1),(2),and(3)sequences, respectively. There
are more body parts missing (occlusion) in the automatic detected features than in
the manuallytracked training data.
3.6.2 Labeling on manually tracked data
To evaluatethe hand-crafted decomposabletriangulated probabilisticmodel(Figure
3.9),labelingexperimentswereperformedonthemanuallytrackeddata(withground
truth labeling). For a test sequence, frames fromall the other seven sequences were
used to learn the model parameters (mean and covariance of Gaussian). Figure 3.11
(a) shows the statisticsof the number of body parts present. Figure 3.11 (b) shows
thecorrectlabelingratevs. thenumberofbodypartspresent,withtheoverallcorrect
labelingrate85:89%. FromFigure3.11 (b),we seethatthe correctlabelingrategoes
up as the numberof detected body parts increases, whichis consistent with the fact
that withmore body partspresent,the probability decompositionisamore accurate
approximation.
3.6.3 Detection and localization
The two detection strategies described in section 3.2 were run on the testing set.
Figure 3.12 (a) shows the receiver operatingcharacteristics (ROC) curves when the
type (3) sequences were used aspositiveexamples and type (2) sequences were used
12 14 16 18 20 0
5 10 15 20 25 30 35 40
number of body parts present
percentage of frames (%)
12 14 16 18 20
0.4 0.5 0.6 0.7 0.8 0.9 1
number of body parts present
correct label rate
(a) (b)
Figure 3.11: (a) percentage of frames corresponding to the number of body parts
present in the hand-constructed data set; (b) correct labeling rate vs. the number
of body parts present. The chance level of a body part being assigned a correct
candidate feature is around0.06. The correct rates here are much higher thanthat.
(2) sequences. The solid lines are results of using the sum-over-all-labelings detec-
tion strategy, and the dashed lines are of the winner-take-all strategy. This gure
shows that the sum-over-all-labelingsstrategy performs better than thewinner-take-
all strategy for the gray-scale images,which is opposite tothe results in section 3.5.
Wepostulatethat this is because, for gray-scale images,there are many close candi-
date features for one body part (Figure3.10) and thereforethere are many labelings
closetothe`correct'labeling,whichmakesthesum-over-all-labelingsstrategyacloser
approximation.
Figure3.10 givesthe localizationresults. Foreach image,the white dots give the
best labeling. For most frames, the person is localized correctly. However, for some
frames, the features consisting of the best conguration can be far away from each
other, e.g., the third imagein the top row (Figure3.10). A detailed study nds that
the program took the two dots on the wall as `left elbow and left wrist', and the
four dots on the chair as `left outside knee, left ankle, left toe and left heel'. This
is because for the triangulateddecomposition in Figure 3.9, if `left shoulder and left
hip'aremissing,thenboth`leftelbowandleftwrist'and`leftoutsideknee, leftankle,
lefttoeand leftheel' aredisconnected withotherbodyparts. Therefore, theoptimal
0 0.2 0.4 0.6 0.8 1 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
false alarm rate
detection rate
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
false alarm rate
detection rate
(a) (b)
Figure3.12: ROCcurves. (a) Resultsofimageswith bodyandchair vs. imageswith
chair only. (b) Results of images with body only vs. images with chair only. Solid
line: the sum-over-all-labelings detection strategy; dashed line: the winner-take-all
detection strategy.
other. Itis clear that inthis case the conditional independence required by equation
(3.7) isnot agoodapproximation any longer. We willaddressmore onthis problem
later insections 5.4.2 and 7.5.
3.6.4 Using information from multiple frames
Herewetestedhowthedetectionratesimprovedbyintegratinginformationovertime,
using the approach described in section 3.3. Type (3) and type (1) sequences were
used. Figure3.13(a) shows ROC curves of using 1to 4 pairs of frames, respectively.
Figure 3.13(b) plots the detection rates (with P
detect
= 1 P
fal se al arm
) vs. the
number offrames integrated. Withmore frames used,the detection rategets higher.
Thedetection rateismore than98%whenmore than7frames (around200 ms)were
used.