• Tidak ada hasil yang ditemukan

Experiments on gray-scale image sequences

In this section, we conduct experiments on more challenging data: gray-scale image

sequences. To apply the detection and labeling algorithms, candidate features are

obtained from the Lucas-Tomasi-Kanade [1, 2] feature selector/tracker on pairs of

frames. Figure3.8 illustratesthe approach.

50 100 150 200 250 300 350

50

100

150

200

0 50 100 150 200 250 300 350

0

50

100

150

200

(a) (b)

Figure3.8: Illustrationof the approach ongray-scale images. Fora given image (a),

features arerst selected andtracked tothe next frame. Dots in(a)arethe features,

and (b) shows the features with velocities. From all the candidate feature points

(with positions and velocities), we want to rst decide whether there is a person in

the scene and then nd the best labeling { the most human-like conguration (dark

dots in (a) and (b)) according to alearned probabilisticmodel.

Figure3.9shows the hand-constructed probabilisticdecomposition forthe exper-

iments. Twenty parts are chosen to represent the human body. The dark dots in

Figure 3.8 shows features representing the parts. Three parts are missing for the

framein Figure3.8: twoat the left knee and one atthe right heel.

1 2

5 6

7 8

9 10 11

12

13

14

15

16

17 18

RS

LE

LW RW

RE

LH

LKO LKI RKO

RKI

LA RA

LT RT

LHE RHE

H

N LS

3 4

RH

Figure3.9: Decompositionsof thehumanbody forgray-scaleimage experiments. `L'

and `R'in label names indicate left and right. H:head, N:neck, S:shoulder, E:elbow,

W:wrist, H:hip, KI:inside knee, KO:outside knee, A:ankle, HE:heel, and T:toe. The

numbers inside trianglesgiveone eliminationorder.

3.6.1 Data

The image sequences were captured by a CCD camera at 30 Hz. There are three

types of motion: (1) A subject walks from left toright, facing60 degrees away from

the front view (middle row of Figure 3.10). We have 20 sequences with around 120

frames each. (2) A chair moves from left to right (bottom row of Figure 3.10). 8

sequences, with 120 frames each. (3) While a subject walks as in type (1), a chair

also moves as in type (2) (top row of Figure 3.10). 16 sequences, with 120 frames

each.

Training set: manually tracked data. The model parameters (mean and

covariance of Gaussian) are learned from a training set with the hand-constructed

ground truth labeling. The training sequences include eight type (1) walking se-

quences. For the rst frame of each sequence, we manually select all the features

corresponding to the body parts in the model of Figure 3.9. The features are then

tracked automatically to the next frame using the Lucas-Tomasi-Kanade tracking

algorithm. The tracking results are monitored, and features with obvious tracking

Figure 3.10: Sample frames from body and chair moving sequences (type (3), top

row), body moving sequences (type (1), middle row), and chair moving sequences

(type (2), bottom row). The dots (either in black or in white) are the features

selected byLucas-Tomasi-Kanade[1,2]algorithmonpairsofframes. The whitedots

are the most human-like congurationfound by our algorithm.

velocities of features. The labeling (body part assignment of the features) is given

manually. This process isrepeated for allthe frames.

Testing Set. For the test sequences, features are obtained automatically from

the standard Lucas-Tomasi-Kanade feature selection/tracking algorithm on pairs of

frames. We do not track features over more than two frames, but reselect all the

features atthenext frameafter tracking,whichsimulates thearguably mostdiÆcult

situationforlabeling anddetection (asdiscussed insection3.3). The dotsin Figures

3.8and3.10arefeaturesfromthisprocedure. Theaveragenumberoffeaturesdetected

ineachframeis64,46,and58fortype(1),(2),and(3)sequences, respectively. There

are more body parts missing (occlusion) in the automatic detected features than in

the manuallytracked training data.

3.6.2 Labeling on manually tracked data

To evaluatethe hand-crafted decomposabletriangulated probabilisticmodel(Figure

3.9),labelingexperimentswereperformedonthemanuallytrackeddata(withground

truth labeling). For a test sequence, frames fromall the other seven sequences were

used to learn the model parameters (mean and covariance of Gaussian). Figure 3.11

(a) shows the statisticsof the number of body parts present. Figure 3.11 (b) shows

thecorrectlabelingratevs. thenumberofbodypartspresent,withtheoverallcorrect

labelingrate85:89%. FromFigure3.11 (b),we seethatthe correctlabelingrategoes

up as the numberof detected body parts increases, whichis consistent with the fact

that withmore body partspresent,the probability decompositionisamore accurate

approximation.

3.6.3 Detection and localization

The two detection strategies described in section 3.2 were run on the testing set.

Figure 3.12 (a) shows the receiver operatingcharacteristics (ROC) curves when the

type (3) sequences were used aspositiveexamples and type (2) sequences were used

12 14 16 18 20 0

5 10 15 20 25 30 35 40

number of body parts present

percentage of frames (%)

12 14 16 18 20

0.4 0.5 0.6 0.7 0.8 0.9 1

number of body parts present

correct label rate

(a) (b)

Figure 3.11: (a) percentage of frames corresponding to the number of body parts

present in the hand-constructed data set; (b) correct labeling rate vs. the number

of body parts present. The chance level of a body part being assigned a correct

candidate feature is around0.06. The correct rates here are much higher thanthat.

(2) sequences. The solid lines are results of using the sum-over-all-labelings detec-

tion strategy, and the dashed lines are of the winner-take-all strategy. This gure

shows that the sum-over-all-labelingsstrategy performs better than thewinner-take-

all strategy for the gray-scale images,which is opposite tothe results in section 3.5.

Wepostulatethat this is because, for gray-scale images,there are many close candi-

date features for one body part (Figure3.10) and thereforethere are many labelings

closetothe`correct'labeling,whichmakesthesum-over-all-labelingsstrategyacloser

approximation.

Figure3.10 givesthe localizationresults. Foreach image,the white dots give the

best labeling. For most frames, the person is localized correctly. However, for some

frames, the features consisting of the best conguration can be far away from each

other, e.g., the third imagein the top row (Figure3.10). A detailed study nds that

the program took the two dots on the wall as `left elbow and left wrist', and the

four dots on the chair as `left outside knee, left ankle, left toe and left heel'. This

is because for the triangulateddecomposition in Figure 3.9, if `left shoulder and left

hip'aremissing,thenboth`leftelbowandleftwrist'and`leftoutsideknee, leftankle,

lefttoeand leftheel' aredisconnected withotherbodyparts. Therefore, theoptimal

0 0.2 0.4 0.6 0.8 1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

false alarm rate

detection rate

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

false alarm rate

detection rate

(a) (b)

Figure3.12: ROCcurves. (a) Resultsofimageswith bodyandchair vs. imageswith

chair only. (b) Results of images with body only vs. images with chair only. Solid

line: the sum-over-all-labelings detection strategy; dashed line: the winner-take-all

detection strategy.

other. Itis clear that inthis case the conditional independence required by equation

(3.7) isnot agoodapproximation any longer. We willaddressmore onthis problem

later insections 5.4.2 and 7.5.

3.6.4 Using information from multiple frames

Herewetestedhowthedetectionratesimprovedbyintegratinginformationovertime,

using the approach described in section 3.3. Type (3) and type (1) sequences were

used. Figure3.13(a) shows ROC curves of using 1to 4 pairs of frames, respectively.

Figure 3.13(b) plots the detection rates (with P

detect

= 1 P

fal se al arm

) vs. the

number offrames integrated. Withmore frames used,the detection rategets higher.

Thedetection rateismore than98%whenmore than7frames (around200 ms)were

used.

Dokumen terkait