A Probabilistic Approach to Human Motion Detection and Labeling

(1)

Detection and Labeling

Thesis by

Yang Song

In Partial Fulllmentof the Requirements

for the Degree of

Doctor of Philosophy

California Instituteof Technology

Pasadena, California

2003

(Defended Nov 13,2002)

(2)

c

2003

Yang Song

(3)

Acknowledgements

First I would like tothank my advisor, Pietro Perona for admittingmeinto Caltech

and forshowing mewhat scientic researchis allabout. Heplayed a very important

rule in leading metowards scientic maturity. I am grateful to his support through

the years on both scientic and personal matters.

Iamgratefultomycandidacyanddefensecommittees,forservingonmycommit-

tee, and for sharing their comments: Yaser Abu-Mostafa, Jehoshua Bruck, Richard

Murray, Stefano Soatto, JimArvo,Mike Burl and Michelle Eros.

I amgratefultoLuisGoncalves,myclosestcollaboratoroverseveral years. Iben-

etedverymuchfrommanystimulatingdiscussionswithhimandfromhisconsistent

encouragement. He is alsovery helpfulin collectingthe data set inchapter 6.

I am grateful to Xiaolin Feng and Enrico Di Bernardo for collaboration on the

experiments in chapter 3 and for the motion capture data, to Charless Fowlkes for

bringing structure learning problem to our attention and discussions on mixtures of

trees, and to Max Welling for some inspiringdiscussions.

I would like to thank my fellow graduate students, Anelia Angelova, Christophe

Basset, Arrigo Benedetti,Jean-Yves Bouguet, DomitillaDel Vecchio, ClaudioFanti,

Rob Fergus, Pierre Moreels, Fei Fei Li, Mario Munich, Marzia Polito, and Silvio

Savarese, for making the Vision Lab at Caltech a resourceful and pleasant place to

work. I am grateful to the systems managers, Dimitris Sakellariou, Naveed Near-

Ansari,BobFreeman, JosephChiu,andMichaelPotter,formakingsurethe comput-

ers working well. I amalso grateful to Catherine Stebbins, Malene Hagen, Lavonne

Martin, and Melissa Slemin for their help onadministrativematters.

I wouldliketothank myfriendsoutsidethevisionlab, HuayanWang,HongXiao,

Chengxiang (Rena)Yu,Qian Zhao,Yue Qi,LifangLi,HanyingFeng, TianxinChen,

Zhiwen Liu, LuSun, Xiaoyun Zhu,and XuboSong fortheir help onvarious aspects

(4)

Last, but certainly not the least, I would liketo express my deepest gratitude to

my family. I am grateful to my parents for their unconditional love and condence

in me, for their support during the hardest times, and for their patience during this

long adventure. I amgrateful tomy husband, Xiao-chang,for hisunderstanding and

support,for hissacrices totake extra familywork, andfor providingmewith many

everyday wisdoms. Finally,allof the work becomesmeaningfulbecauseof my lovely

daughter,MyraMiaobo,whohasbeenverysupportivebynotcryingmuchandgiving

mepeaceof mind. She motivates me to achieve more inlife.

(5)

List of Publications

Workrelated tothis thesis has been or willbe presented in the followingpapers:

Unsupervised Learning of Human Motion,

Y. Song, L. Goncalves and P. Perona, submitted to IEEE Trans. on Pattern

Analysisand Machine Intelligence.

Monocular Perception of Biological Motion in Johansson Displays,

Y.Song,L.Goncalves,E.DiBernardoandP.Perona,ComputerVisionandImage

Understanding,vol. 81, no. 3, pages303-327, 2001.

Learning Probabilistic Structure for Human Motion Detection,

Y. Song, L. Goncalves and P. Perona, Proc. IEEE Conf. Computer Vision and

Pattern Recognition, vol. II, pages 771-777,December 2001.

Unsupervised Learning of Human Motion Models,

Y. Song,L.Goncalvesand P.Perona,AdvancesinNeuralInformationProcessing

Systems 14,December2001.

Towards Detection of Human Motion,

Y.Song,X.Fengand P.Perona,Proc. IEEEConf. ComputerVisionandPattern

Recognition, vol. I, pages 810-817,June, 2000.

Monocular perception of biological motion - clutter and partial occlu-

sion,

Y. Song, L. Goncalves, and P. Perona, Proc. of 6th European Conferences on

Computer Vision, vol. II, pages 719-733,June/July, 2000.

Monocular perception of biological motion - detection and labeling,

(6)

Conferences onComputer Vision, pages805-812, September 1999.

A computational model for motion detection and direction discrimina-

tion in humans,

Y. Song and P. Perona, IEEE Computer Society Workshop on Human Motion,

pages 11-16, December, 2000.

(7)

Abstract

Human motion analysis is avery important task for computer visionwith many po-

tentialapplications. Thereareseveralproblemsinhumanmotionanalysis: detection,

tracking, and activity interpretation. Detection is the most fundamentalproblem of

the three, but remains untackled due to its inherent diÆculty. This thesis develops

a solution to the problem. It is based on a learned probabilistic model of the joint

positionsandvelocitiesofthebodyparts,wheredetectionandlabelingareperformed

by hypothesis testing on the maximum a posteriorestimate of the pose and motion

of the body. To achieve eÆciency in learning and testing, a graphicalmodel is used

to approximate the conditional independence of human motion. This model is also

shown to providea naturalway todeal with clutter and occlusion.

One key factor inthe proposed method is the probabilistic model of human mo-

tion. In this thesis, an unsupervised learning algorithm that can obtain the proba-

bilistic modelautomatically from unlabeled training data ispresented. The training

data includeuseful foreground features as well as features that arise from irrelevant

background clutter. The correspondence between parts and detected features is also

unknown in the training data. To learn the best model structure as well as model

parameters,avariantoftheEMalgorithmisdeveloped wherethelabelingofthedata

(partassignments)istreatedashiddenvariables. Weexploretwoclasses ofgraphical

models: trees and decomposable triangulated graphs and nd that the later are su-

perior for our application. To better modelhuman motion,wealsoconsider the case

when the model consists of mixtures of decomposable triangulatedgraphs.

The eÆciency and eectiveness of the algorithmhave been demonstrated by ap-

plying it to generate models of human motion automatically from unlabeled image

sequences,andtestingthelearnedmodelsonavarietyofsequences. Wenddetection

rates of over 95% on pairs of frames. This is very promising for building a real-life

(8)

Contents

Acknowledgements iii

List of Publications v

Abstract vii

1 Introduction 1

1.1 Motivationfor human motion analysis . . . 1

1.2 Problemsin human motionanalysis . . . 2

1.3 Human perception: Johansson experiments . . . 3

1.4 Approach . . . 4

1.5 Outlineof the thesis . . . 6

2 The Johansson problem 8 2.1 Notationand approach . . . 8

2.2 Decomposabletriangulated graphs . . . 11

2.3 Algorithms. . . 15

2.4 Experiments . . . 18

2.4.1 Detection of individualtriangles . . . 20

2.4.2 Performance of dierent bodygraphs . . . 23

2.4.3 Viewpointinvariance . . . 24

2.4.4 Performance with dierent motions . . . 26

2.5 Summary . . . 27

3 Generalized Johansson problem: clutter and occlusion 28 3.1 Labeling problemunder clutterand occlusion . . . 29

(9)

3.1.2 Approximationof foreground probabilitydensity function . . . 31

3.1.3 Comparison of two labelings and cost functions for dynamic programming . . . 33

3.2 Detection . . . 35

3.2.1 Winner-take-all . . . 37

3.2.2 Summation over allthe hypothesis labelings . . . 37

3.2.3 Discussion . . . 40

3.3 Integrating temporalinformation . . . 40

3.4 Counting . . . 41

3.5 Experiments onmotion capture data . . . 42

3.5.1 Detection and labeling . . . 42

3.5.2 Usingtemporalinformation . . . 46

3.5.3 Counting experiments . . . 47

3.5.4 Experimentson dancingsequence . . . 49

3.6 Experiments ongray-scale imagesequences . . . 50

3.6.1 Data . . . 51

3.6.2 Labeling onmanuallytracked data . . . 53

3.6.3 Detection and localization . . . 53

3.6.4 Usinginformationfrom multiple frames . . . 55

3.7 Summary . . . 55

4 Search of optimal decomposable triangulated graph 57 4.1 Optimizationcriterion . . . 57

4.2 Greedy search . . . 60

4.3 Constructionfrom amaximum spanning tree . . . 61

4.3.1 Transformingtrees intodecomposable triangulatedgraphs . . 61

4.3.2 Maximum spanning tree . . . 63

4.3.3 Greedy transformation . . . 63

4.4 Computationof dierentialentropy -translation invariance . . . 64

(10)

4.6 Summary . . . 69

5 Unsupervised learning of the graph structure 70 5.1 Brief review of the EM algorithm . . . 70

5.2 Learningwith allforeground parts observed . . . 72

5.3 Dealingwith missingparts (occlusion) . . . 76

5.4.1 Results onmotion capture data . . . 77

5.4.2 Results onreal-imagesequences . . . 82

5.5 Summary . . . 85

6 Mixtures of decomposable triangulated models 86 6.1 Denition . . . 86

6.2 EM learningrules . . . 87

6.3 Detection and labelingusing mixture models . . . 92

6.4.1 Evaluation ofthe EM algorithm . . . 95

6.4.2 Models obtained . . . 95

6.4.3 Detection and labeling . . . 97

6.5 Conclusions . . . 102

7 Decomposable triangulated graphs and junction trees 104 7.1 Introduction . . . 104

7.2 Junctiontrees . . . 105

7.3 Max-propagationon junctiontrees . . . 106

7.4 Comparison between dynamic programmingand max-propagation on junction trees . . . 109

7.5 Justication forthe use of decomposable triangulatedgraphs . . . 110

7.5.1 Trees vs. decomposable triangulated graphs . . . 111

7.6 Summary . . . 113

(11)

8 Conclusions and future work 115

8.1 Summaryof main contributions . . . 115

8.2 Futurework . . . 116

Bibliography 118

(12)

List of Figures

1.1 Human motionanalysis. . . 1

1.2 Sample frames of Johansson's display. In Johansson's original experi-

ments, black backgroundwas used instead of white background. . . . 3

1.3 Diagramof the system ongray-scale images. . . 5

2.1 The labeling problem(without clutter and missing points): given the

position and velocity of body parts in the image plane (a), we use a

probabilistic model to assign the correct labels to the body parts (b).

`L' and `R' in label names indicate left and right. H:head, N:neck,

S:shoulder, E:elbow, W:wrist, H:hip, K:knee, A:ankleand F:foot.. . . 9

2.2 Exampleofsuccessiveeliminationofadecomposabletriangulatedgraph,

with eliminationorder (A;B;C;(DEF)). . . 12

2.3 Two decompositions of the human body into triangles. `L' and `R'

in label names indicate left and right. H:head, N:neck, S:shoulder,

E:elbow, W:wrist, H:hip, K:knee, A:ankle and F:foot. The numbers

insidetriangles give the index of trianglesused inthe experiments. In

(a)they arealsooneorderinwhichthe verticesaredeleted. In(b) the

numbers inbrackets showone eliminationorder. . . 13

2.4 Examples of non-decomposabletriangulated graphs. . . 13

2.5 An example of dynamic programming algorithm applied to a simple

graph. The goalistoassignthe markers tothevariablesA;B;C;D;E

inthe graphsuch that P(A;B;C;D;E)is maximized.. . . 19

2.6 Sample frames for the (a) walking sequence W3; (b) happy walking

sequenceHW;(c)dancingsequenceDA.Thenumbersonthehorizontal

axesare the framenumbers. . . 21

(13)

2.7 Local model error rates (percentage of frames for which the correct

choice of markers did not maximize each individualtriangle probabil-

ity). Triangle indicesare those of the two graph models of Figure2.3.

`+': results for decomposition Figure 2.3(a); `o': results for decom-

position Figure 2.3 (b). (a) joint probability model; (b) conditional

probabilitymodel. . . 22

2.8 Probability ratio (correct markers vs. the solution with the highest

probability when an error happens). The horizontal axis is the index

of frames where error happens. (a) joint probability ratio for triangle

10 or 25 (RH, LK, RK); (b) conditional probability ratio for triangle

17(H, N, LS). . . 23

2.9 Labeling performance as a function of viewing angle. (a) Solid line:

percentage of correctly labeled frames as a function of viewing angle,

when the trainingwas done at90 degrees (frontal view). Dashed line:

training was done by combining data from views at 30, 90, and 150

degrees. (b) Labeling performance when the training was done at

0 degrees (right-side view of walker). The dip in performance near 0

degreesisduetothefactthatfromasidevieworthographicprojection

withoutbody self-occlusionsit isalmost impossible todistinguish left

and right. . . 25

2.10 Error rates for individual body parts. `L' and `R' in label names in-

dicate left and right. H:head, N:neck, S:shoulder, E:elbow, W:wrist,

H:hip, K:knee,A:ankle and F:foot. See section 2.4.4. . . 27

3.1 Perception of biological motion in real scenes: one has to contend

with a large amount of clutter (more than one person in the scene,

other objects in the scene are also moving), and a large amount of

self-occlusion (typically only half of the body is seen). Observe that

segmentation (arm vs. body,left and right leg) isat best problematic. 28

(14)

3.2 Detection and labeling under the conditions of clutter and occlusion:

Given the position and velocity of dots in an image plane (a), we

want to decide whether a person is present in the scene and nd the

most possible human conguration. Filled dots in (b) are body parts

and circles are background points. Arrows in (a) and (b) show the

velocities. (c) is the full conguration of the body. Filled (blackened)

dotsrepresentingthosepresentin(b), andthe'*'sare actuallymissing

(notavailabletotheprogram). Thebodypartlabelnamesarethesame

asin Figure2.1. . . 29

3.3 Detectionandlabelingresultsonmotioncapture data(underthe con-

ditions of clutter and occlusion). (a) ROC curves from the winner-

take-alldetectionstrategy. Solidlines: 3to8bodypartswith30back-

ground pointsvs. 30 background points only. The bigger the number

of signal points is, the better the ROC is; dashed line: overall ROC

considering all the frames used in six solid ROCs. The stars (`*') on

the solid curves are the points corresponding to the threshold where

P

D

= 1 P

FA

on the dashed overall ROC curve. (b) ROC curves

from the sum-over-all-labelings strategy. The experiment settings are

the same as (a), except a dierent detection algorithm is used. (c)

detection rate vs. number of body parts displayed. Solid line: from

the winner-take-all strategy with regard to the xed threshold where

P

D

= 1 P

FA

onthe overall ROC curve in (a), with false alarm rate

P

FA

= 12:97%; dashed line: from the sum-over-all-labelings strategy

with regardto the xed threshold where P

D

=1 P

FA

on the overall

ROC curve in (b), with P

FA

= 14:96%. (d) correct label rate (label-

by-label rate) vs. number of body parts when a person is correctly

detected (using the winner-take-all strategy with regard to the same

threshold asin (c)). . . 44

(15)

3.4 Results of integrating multiple frames. (a) ROCs of integrating one

toeight framesusing only5bodypartswith 30clutterpointspresent.

Themore framesintegrated, the betterthe ROC curveis. Whenmore

thanveframesareused, theROCs arealmostperfectandoverlapped

with the axes. (b) detection rate (when P

detect

=1 P

fal se al arm ) vs.

number of frames used. . . 46

3.5 One sample image of counting experiments. `*' denotes body parts

froma personand`o's are backgroundpoints. Thereare threepersons

(six body parts for each person) with sixty superimposed background

points. Arrows are the velocities. . . 47

3.6 Resultsofcountingpeople. Solidline(with*): oneperson;dashedline

(with o): two persons; dash-dot line (with triangles): three persons.

Counting is done with regard to the threshold chosen fromFigure 3.3

(a). Forthat thresholdthe correctrateforrecognizingthat thereisno

person inthe scene is95%. . . 48

3.7 Results of dancing sequences. (a) Solid lines: ROC curves for 4 to

10 body parts with 30 added background points vs. 30 background

points only. The bigger the number of signal points is, the better the

ROC is. Dashed line: overall ROC considering all the frames used in

seven solid ROCs. The threshold corresponding to P

D

= 1 P

FA on

this curve was used for (b). The stars (`*') on the solid curves are

the pointscorresponding to that threshold. (b) detection rate vs. the

number of body parts displayed with regard to a xed threshold at

which P

D

=1 P

FA

onthe overall ROC curvein(a). The falsealarm

rate is14.67%. . . 49

(16)

3.8 Illustration of the approach on gray-scale images. For a given image

(a), features are rst selected and tracked to the next frame. Dots in

(a)are the features, and (b) shows the features with velocities. From

allthecandidatefeaturepoints(withpositionsandvelocities),wewant

torstdecide whether thereisa personinthe scene and thennd the

bestlabeling{themosthuman-likeconguration(darkdotsin(a)and

(b)) according toa learnedprobabilistic model. . . 50

3.9 Decompositionsof the human body for gray-scale image experiments.

`L' and `R' in label names indicate left and right. H:head, N:neck,

S:shoulder,E:elbow, W:wrist,H:hip, KI:inside knee, KO:outside knee,

A:ankle, HE:heel, and T:toe. The numbers inside triangles give one

eliminationorder. . . 51

3.10 Sample frames from body and chair moving sequences (type (3), top

row),bodymovingsequences (type(1),middlerow),andchair moving

sequences(type(2),bottomrow). Thedots(eitherinblackorinwhite)

are the features selected by Lucas-Tomasi-Kanade [1,2] algorithm on

pairsof frames. Thewhite dotsare themost human-likeconguration

found by our algorithm. . . 52

3.11 (a) percentage of frames corresponding to the number of body parts

present inthe hand-constructed data set; (b) correctlabeling rate vs.

the number of body parts present. The chance level of a body part

being assignedacorrect candidatefeature is around0.06. The correct

rates here are much higher thanthat. . . 54

3.12 ROC curves. (a) Results of images with body and chair vs. images

withchair only. (b) Results of imageswith body onlyvs. imageswith

chair only. Solid line: the sum-over-all-labelings detection strategy;

dashedline: the winner-take-alldetection strategy. . . 55

(17)

3.13 Results of integrating multiple frames. (a) Four curves are ROCs

of integrating 1 to 4 pairs of frames, respectively. The more frames

integrated, the better the ROC curve is. (b) detection rate (when

P

detect

=1 P

fal se al arm

)vs. numberof frames used. . . 56

4.1 An example of transforming a tree into a decomposable triangulated

graph. Figure (a) shows the tree; gure (b) gives a decomposable

triangulatedgraph obtained by adding edges tothe tree in(a). . . . 62

4.2 Decomposabletriangulatedmodelsformotioncapturedata. (a)hand-

constructed model; (b) model obtained from greedy search (section

4.2); (c) decomposable triangulated model grown from a maximum

spanning tree (section 4.3). The solid lines are edges from the maxi-

mum spanning tree and the dashed lines are added edges. (d) a ran-

domlygenerated decomposable triangulatedmodel. . . 67

4.3 Likelihoodevaluationof graph growing algorithms. . . 68

4.4 Evaluationofthe algorithmsonsynthetic datawith decomposabletri-

angulated independence. (a)Expected likelihoods of the true models

(dashed curve) and of models from greedy search (solid curve). The

solidlinewitherror barsare the expected likelihoodsofrandomtrian-

gulatedmodels. (b)Expectedlikelihooddierencefromthe respective

true model, i.e., the results of subtracting the likelihood of the true

model. Solid: modelsfromthe greedysearch (section4.2); dotted: tri-

angulated models from MST (section 4.3); dash-dot: MST. The solid

linewith error bars are the results of random triangulatedmodels. . 68

5.1 Log-likelihoodvs. iterationsofEMfordierentrandominitializations.

Iteration0 meansrandom initializations,iteration1is afterone itera-

tion, and so on. The results are from motion capture data, assuming

that all the foreground parts are observed in the learning algorithm

(18)

5.2 Two decomposabletriangulatedmodels forJohansson displays. These

models were learned automatically from unlabeled training data. 'L':

left; 'R': right. H:head, N:neck, S:shoulder, E:elbow, W:wrist, H:hip,

K:knee, A:ankle. . . 79

5.3 Evolution of a modelwith iterations(frommotion capture data). . . 80

5.4 Detectionandlabelingresults. (a)and(b)areROCcurvescorrespond-

ing to models Figure 5.2 (a) and (b), respectively. Solid lines: 3 to 8

body parts with 30 background pointsvs. 30 background pointsonly.

The more body parts present, the better the ROC. Dashed line: over-

allROC considering allthe frames used. Thethreshold corresponding

to P

D

= 1 P

FA

on this curve was used for later experiments. The

stars ('*') onthe solid curves are correspondingto that threshold. (c)

detection rate vs. number of body parts displayed with regard to the

xed threshold. (d) correctlabelrate (label-by-labelrate)vs. number

ofbody partswhenapersoniscorrectlydetected. In (c)and(d), solid

lines (with*) are frommodelFigure5.2(a); dashed lines (with o)are

frommodel Figure5.2 (b); and dash-dot lines with trianglesare from

the hand-crafted modelin Figure2.3(a) (also see Figure3.3). . . 81

5.5 (a) The mean positions and mean velocities (shown in arrows) of the

composed parts selected by the algorithm. (b) The learned decom-

posable triangulated probabilisticstructure. The numbers inbrackets

showthe correspondence of (a) and (b) and one eliminationorder. . 82

5.6 Sampleframes frombody and chair movingsequences (top two rows)

and body moving sequences (bottom two rows). The dots (either in

black or in white) are the features selected by Lucas-Tomasi-Kanade

algorithm on two frames. The white dots are the most human-like

congurationfound by the automatically learnedmodel(Figure 5.5). 83

(19)

5.7 ROC curves. (a) Results of images with body and chair vs. images

withchair only. (b) Results of imageswith body onlyvs. imageswith

chair only. Solid line: using the automatically learned model as in

Figure5.5; dashedline: using the model inFigure3.9(dashed lines of

Figure3.12). . . 84

6.1 Sampleimages. Thetextstringinparenthesis indicatesthe imagetype. 94

6.2 Evaluation of the EM-like algorithm: log-likelihood vs. iterations of

EMfor dierent randominitializations. Theindices alongx-axis show

the number of iterations passed. (a). 12-part 3-cluster single-subject

models; (b). 12-part3-cluster multiple-people models. . . 96

6.3 Examples of 12-part 3-cluster models. (a)-(b) are a single-subject

model (corresponding to the thick curve in Figure 6.2 (a)), and (c)-

(d) are a multiple-people model (corresponding to the thick curve in

Figure6.2(b)). (a) (or (c)) givesthe mean positions and mean veloc-

ities (shown in arrows) of the parts for each component model. The

number

i

, i = 1;2;3, on top of each plot is the prior probability for

each component model. (b) (or (d)) is the learned decomposable tri-

angulated probabilisticstructure for models in(a)(or(c)). The letter

labelsshow the bodyparts correspondence. . . 98

6.4 ROC curves using the single-subject model as in Figure 6.3 (a). (a)

positive walking sequences vs. person bikingR-L sequences (b+); (b)

positive walking sequences vs. car movingR-L sequences (c+). Solid

curvesuse positivewalkingsequences of subject LG aspositiveexam-

ples,anddashedcurvesusesequencesofothersubjects. (c)isobtained

bytakingtheR-LwalkingsequencesofsubjectLGaspositiveexamples

and the R-L walkingsequences of other subjects asnegative examples. 99

(20)

6.5 Detection rates vs. types of negativeexamples. (a)is fromthe single-

subject model (Figure 6.3 (a)), and (b) is from the multiple-people

model (Figure 6.3 (b)). Stars (*) with error bars use R-L walking

sequencesofsubjectLGaspositiveexamples,andcircles(o)witherror

barsuseR-Lwalkingsequences ofothersubjects. Thestars(orcircles)

show the average detection rates, and error bars give the maximum

and minimum detection rates. The performance is measured on pairs

of frames. It improves further when multiple pairs in a sequence are

considered. . . 101

6.6 Detection and labeling results on some images. See text for detailed

explanation of symbols. . . 103

7.1 Examples of clique trees. (a) and (b) are for the graph in Figure 2.2;

(c), (d) and (e) are for the graphs of Figure 2.4 (a,b,c), respectively;

(f) and(g)are for the graphin Figure2.5. (a,c,e,f) are junctiontrees,

and (b,d,g)are not. . . 105

7.2 Examplesof cliquetrees with separators. Cliquetrees are fromFigure

7.1. . . 106

7.3 A junction tree with separators for the body decomposition graph in

Figure2.3 (a). . . 107

7.4 Two cliques V and W with separator S. . . 107

7.5 (a)percentageofconnected graphsvs. numberofverticespresent(out

of 14). The solid line with stars is for the tree, and the line with

triangles for the decomposable triangulated graph. (b) the ratio of

connected percentage: decomposable triangulated graphsvs. trees. . 112

(21)

List of Tables

2.1 Error rates using the models in Figure 2.3 . . . 24

2.2 Error rates for dierent sequences. ALL: average over all three se-

quences; W3: walking sequence; HW: walking in happy mood; DA:

dancingsequence . . . 26

6.1 Types of images used in the experiments. 'L-R' denotes 'from left to

right,' and 'R-L' means 'from right to left.' The digits in the paren-

thesis are the number of sequences by the number of frames in each

sequence. For example, (3-4 x 80) means that there are 3 or 4 se-

quences, with around 80 frames for each sequence. The +/- in the

code-names denotes whether movement isR-L orL-R. . . 93

(22)

Chapter 1 Introduction

Thisthesis presentsanew approachtohumanmotiondetectionand labeling. In this

chapter, we rst give the motivation for this work, i.e., why the problem of human

motion analysis is important and why this thesis focuses on detecting and labeling

human motion. We then briefour approach and givethe outline for the thesis.

1.1 Motivation for human motion analysis

Human motion analysis is an important but hard problem in computer vision. Hu-

mansarethemostimportantcomponentofourenvironment. Motionprovidesalarge

amountofinformationabouthumansandisveryusefulforhumansocialinteractions.

The goal of human motion analysis is to extract information about human motion

fromvideosequences. AsshowninFigure1.1,foragivenvideosequence, wewantto

develop a computer system/algorithm which can give us a description of the scene.

The description should rst address whether there are humans in the scene. If so,

how many there are, where they are located, and what they are doing.

Computer Vision Algorithms

image sequences

Description of the scene:

Human presence?

How many?

Where are they?

What are they doing?

desired output

Figure1.1: Human motionanalysis.

(23)

limited to:

For the security of airports or big museums, it is very useful that a computer

can detect automatically if someone is doing something suspicious, e.g., trying

tograb a piece of art work.

Human motion detection is also attractive to the automobile industry. Pedes-

trian detection is very important for transportation safety and for automated

navigation.

Human computer interfaces. We use keyboard, mouse and/or joystick as our

input devices. If the computer could recognize what we mean when we point

at it and/or give our instruction by our body movement, it would make the

computer more user-friendly.

However, humanmotionanalysisisdiÆcult. Firstofall,thehumanbodyisrichly

articulated-even a simple stick model describing the pose of arms, legs, torso and

headrequires morethan20degrees of freedom. The bodymovesin3-Dwhichmakes

the estimation of these degrees of freedom a challenge in a monocular setting [3,4].

Image processing is also a challenge: humans typically wear clothing which may be

looseandtextured. ThismakesitdiÆculttoidentifylimbboundaries,andevenmore

so tosegment the main parts of the body.

1.2 Problems in human motion analysis

A system for interpreting human activity must, rst of all, be able to detect human

presence. A second important task is to localize the visible parts of the body and

assignappropriatelabelstothe correspondingregionsoftheimage-forbrevitywecall

this the labeling task. Detection and labeling are coupled problems. Once we know

the bodyparts assignments,we knowthepresence ofaperson;andviceversa. Given

alabeling,dierentpartsof thebodymay betrackedintime [5,6,7,3,8,9,10,11].

Their trajectories and/or spatiotemporalenergy pattern willallow aclassication of

(24)

Among these problems, activityinterpretationneeds to take the results of detec-

tion and tracking as input, whereas tracking algorithms need initializations, which

can beprovided by eitherdetection, orinthe absence of which, byadhoc heuristics.

Hence detection is the most fundamental problem of the three. In the eld of com-

putervision,trackinghasrecentlybeenanareaofmuchattention,whereconsiderable

progress has been made. Detection, on the contrary, remains an open problem and

willbe the focus of this thesis.

1.3 Human perception: Johansson experiments

Our work onhuman motion detection and labelingis inspiredby human perception.

Astrikingdemonstrationofthecapabilitiesofthehumanvisualsystemisprovidedby

the experiments of Johansson [14]. Johansson lmed people acting in total darkness

with smalllight bulbs xed to the main joints of their body. A single frame (Figure

1.2) of a Johansson movie is nothing but a cloud of identical bright dots on a dark

eld; however, as soon as the movie is animated, one can readily detect, count,

segment a number of people in a scene, and even assess their activity, age, and

sex [15, 16,17]. Althoughsuch perception is completelyeortless, our visual system

is ostensibly solving a hard combinatorial problem (the labeling problem-which dot

should be assignedtowhich body part of which person?).

Figure 1.2: Sample frames of Johansson's display. In Johansson's original experi-

ments,black background wasused insteadof white background.

Johansson experiments prove that motion is an important cue for visual percep-

(25)

illustrates that our visual system has developed a very strong ability in perceiving

humanmotion-wecan recognizehumanmotioneasilyfromdots representing the mo-

tion ofthe mainjoints. Thispsychophysicalevidenceinspiresustobuild acomputer

algorithmto achievewhat human eyes can do.

1.4 Approach

Webelievethatthehumanvisualsystemgainstheabilityofrecognizingbodymotion

through learning(dailyobservation)

. Hence rather than modelingthe details of the

mechanics of the human body, we choose to approach human motion perception as

the problemof recognizing a peculiarspatio-temporalpatternwhichmay belearned

perceptually. We approach the problemusing learning and statisticalinference.

We model how a person moves in a probabilistic way. Though dierent persons

move in dierent styles and the same person moves dierently at dierent times, a

certaintypeofmotionmust sharesome commonfeatures. Moreover, the proportions

of the body are in a similar range despite the dierence inhuman body size. Hence

a probabilistic model which captures both the commonfeatures and the variance of

human motion isvery appropriate.

The approach on gray-scale images is shown in Figure 1.3. To detect and label

a movinghuman body, a feature detector/tracker (such as a corner detector) is rst

used toobtaincandidatefeatures fromapair offrames. The combinationof features

is then selected based on maximum likelihoodby using the joint probability density

function formed by the position and motion of the body. Detection is performed by

thresholding the likelihood(see the lowerpart of Figure1.3).

Weusepointfeatures(fromamotioncapturesystemoracornerdetector)because

they areeasier toobtaincomparedtoothertypesoffeatures, such asbodysegments,

whichmay be more susceptible toocclusion. Pointfeatures are also anatural choice

sincepsychophysicsexperiments(Johansson'sexperiments[14])indicatethat thehu-

Weonceshowedamovieofthetop-viewofoneperson walking,anditbecamemuchharderto

recognizethatitwasapersonwalking. Onereasonableexplanationisthat itisbecauseweusually

(26)

Presence of Human?

Localization of parts?

Type of motion?

Feature detector/

tracker

Detection and Labeling Training Data

Probabilistic Model of

Human Motion Learning

algorithm

Feature detector/

tracker ^LT ^LT ^RA ^RA ^RT ^RT

LA LA LHE LHE RK RK LH LH LE LE RS RS RH RH RE RE

LS LS LW RW LW RW N

H

Testing: two frames Image sequences

Testing Training

Figure1.3: Diagramof the system on gray-scale images.

man visual system can perceive vivid human motion from moving dots representing

the motionofthe humanbody joints. However, this doesnotpreclude the useof this

algorithmto othertypes of features.

Onekeyfactorinthemethodistheprobabilisticmodelofhumanmotion. Inorder

toavoidanexponentialcombinatorialsearch,agraphicalmodelisusedtodepict the

conditional independence of body parts. Graphical models are a marriage between

probability theory and graph theory [18]. We originally apply them to the problem

ofhumanmotiondetectionand labeling. Weexploretwoclassesofgraphicalmodels:

trees and decomposable triangulated graphsand nd that the latter are superior for

our application.

Atthetrainingstageofourapproach,probabilisticindependencestructuresaswell

asmodel parameters are learnedfromatrainingset. There are two types of training

data-labeledandunlabeled. Inthecaseoflabeledtrainingdata,thepartsofthemodel

and the correspondence between the parts and observed features in the training set

are known, e.g., data from a motion capture system. For labeled training data,

we can hand-craft the probabilistic independence structure and estimate the model

parameters (e.g.,mean and covariance forunimodalGaussian). Weuse this learning

method in Chapters 2 and 3. In Chapter 4, we tackle a more challenging learning

problem, where algorithms are developed to search for the optimal independence

(27)

Inthecaseofunlabeledtrainingdata,probabilisticmodelsarelearnedfromtrain-

ing features includingboth useful foreground parts and background clutter, and the

correspondence between the parts and detected features is unknown. The problem

arises when we run a feature detector (such as the Lucas-Tomasi-Kanade detector

[1]) onreal-image sequences, features are detected both on target objects and back-

ground clutter with no identity attached to each feature. From these features, we

wish to know which feature combinations arise in correspondence to a given visual

phenomenon(e.g.,personwalkingfromlefttoright). InChapters5and6,wedevelop

unsupervised algorithms that are able to learn models of human motion completely

automaticallyfromrealimagesequences, i.e.,unlabeledtrainingfeatureswithclutter

and occlusion.

1.5 Outline of the thesis

This thesis is organized asfollows.

Chapter 2considers the problemof labelingaset ofobserved points whenthere

is noclutter and nobody partsare missing,whichwecall the `Johansson problem.'

Chapter 3 explains how to extend the algorithm to perform detection and la-

beling in a cluttered and occluded scene, which we call the `generalized Johansson

problem.'

Chapter 4 describes how tolearn the conditionalindependence structure of the

probabilistic modelfromlabeleddata.

Chapter 5 addresses the learning problem when the training features are unla-

beled.

Chapter6introducestheconceptofmixturesofdecomposabletriangulatedmod-

elsandextendstheunsupervisedlearningalgorithmtothemixturemodel. Thischap-

ter alsopresents a morecomprehensiveexperimentalsection thanprevious chapters.

Chapter 7 puts decomposable triangulated models in the general framework of

graphical models, compares them with trees, and justies the use of decomposable

(28)

Chapter 8 summarizes the thesis work and indicates possible future research

directions.

(29)

Chapter 2 The Johansson problem

InJohansson's humanperceptionexperiments,the inputtothe humanvisualsystem

aremovingdots,andwecanget avividperceptionofhumanmotionand assignbody

parts (such as hand, elbow, shoulder, knee and foot)to the dots immediately [14].

During this process, our visual system has solved a hard combinatorialproblem-the

labelingproblem: whichdot should beassignedto which body part ofwhichperson?

Thischapterdevelopsanalgorithmprovidingasolutiontothelabelingproblemwhen

thereis noclutterandnobodyparts aremissing. Sincethe display isverysimilarto

that of Johansson's experiments,we callit the `Johansson problem.'

2.1 Notation and approach

Asshown inFigure2.1, giventhe positionand velocity(arrows inthegure) ofsome

dots

in the imageplane (Figure2.1(a)), we want toassign the correctlabelsto the

dots. Velocity is used to characterize the motion. In our Johansson scenario each

partappearsasasingledotinthe imageplane. Therefore, itsidentityisnot revealed

by cues other than its relativepositionand velocity.

Wedeploy aprobabilisticapproach. The bodypose andmotionare characterized

by the joint probability density of the position and velocity of its parts. Let S

body

=

fLW;LE;LS;H:::R Fg be the set of M body parts, for example, LW is the left

wrist, R F is therightfoot,etc. Correspondingly,letX

LW

be thevector representing

thepositionandvelocityoftheleftwrist,X

RF

bethevectorofthe rightfoot,etc. We

model the pose and motion of the body probabilistically by means of a probability

density functionP

S

body (X

LW

;X

LE

;X

LS

;X

H

;:::;X

RF ).

Suppose that there are N point features in a display. Let X = [X

1

;:::;X

N ] be

Inthisthesis,thewords,`dots,'`points,'`markers,'`features'or `pointfeatures,'havethesame

meaning: thingsobservedfrom the images. We willuse theminterchangeably. The words,`parts'

(30)

H

N

LS RS

LE RE

LW RW

LH RH

LK RK

LA RA

LF RF

(a) (b)

Figure 2.1: The labeling problem (without clutter and missing points): given the

position and velocity of body parts in the image plane (a), we use a probabilistic

model to assign the correct labels to the body parts (b). `L' and `R' in label names

indicateleftand right. H:head,N:neck,S:shoulder, E:elbow, W:wrist,H:hip, K:knee,

A:ankleand F:foot.

the vector of measurements (each X

i

, i = 1;:::;N, is a vector describing position

and velocity of point i). Here we assume that there are no missing body parts and

no clutter. In this case N =M. Let L = [L

1

;:::;L

N

] be a vector of labels, where

L

i 2 S

body

is the label of X

i

. The labeling problem is to nd L

, over all possible

labelvectorsL,suchthat the posteriorprobabilityofthe labeling given the observed

data is maximized,that is,

L

= argmax

L2L

P(LjX) (2.1)

where P(LjX) is the conditional probabilityof a labeling L given the data X and L

is the set of all possible labelings. Using Bayes' law:

P(LjX) = P(XjL) P(L)

(2.2)

(31)

It is reasonable to assume that the priors P(L) are equal for dierent labelings,

then

L

= argmax

L2L

P(XjL) (2.3)

Given a labeling L , each point feature i has a corresponding labelL

i

. Therefore

eachmeasurementX

i

mayalsobewrittenasX

Li

,i.e.,themeasurementcorresponding

to a specic body part associated with label L

i

. For example, if L

i

=LW, i.e., the

label corresponding to the left wrist is assigned to the ith point, then X

i

=X

LW is

the position and velocity of the left wrist. Then,

P(XjL)=P

S

body (X

LW

;X

LE

;X

LS

;X

H

;:::;X

RF

) (2.4)

where P

S

body

is the joint probability density function of the position and velocity of

allthe M body parts.

Three problems faceus at this point: (a) What is the structure for the probabil-

ity/likelihood function to be maximized? (b) How do we estimate its parameters?

(c) How dowereduce thecomputationalcost of the combinatorialsearch problemof

nding the optimal labeling? Problems (a) and (c) need to be addressed together:

the structure of the probabilitydensity functionmust be such that it allows eÆcient

optimization.

Abruteforcesolutiontotheoptimizationproblemistosearchexhaustivelyamong

allM!(assumingnoclutter,nomissingbodyparts)possibleL'sandndthebestone.

The search cost is factorialwith respect to M. Assume M =16, then the number of

possible labelings islarger than 210 13

, which is computationallyprohibitive.

It isuseful tonoticethat the bodyis akinematicchain: for example,the wrist is

connected to the body indirectlyvia the elbow and the shoulder. One could assume

(32)

position and velocity of the rest of the body once the position and velocity of elbow

and shoulder are known. This intuitionmay begeneralized to the whole body: once

the position and velocity of a set S of body parts is known, the behavior of body

parts that are separated by S is independent. Of course, this intuition is only an

approximation whichneeds to be validatedexperimentally.

Our intuition on how to decompose the problem may be expressed in the lan-

guageofprobability: considerthe jointprobabilitydensityfunctionof5randomvari-

ables P(A;B;C;D;E). By Bayes' rule, it may be expressed as P(A;B;C;D;E) =

P(A;B;C)P(DjA;B;C)P(EjA;B;C;D). Iftheserandomvariablesareconditionally

independent asdescribed inthe graph of Figure 2.5, then

P(A;B;C;D;E)=P(A;B;C)P(DjB;C)P(EjC;D) (2.5)

Thus, if thebodypartscan satisfythe appropriateconditionalindependence con-

ditions,wecanexpressthejointprobabilitydensityoftheposeandvelocityofallparts

as the product of conditional probability densities of n-tuples. This approximation

makes the optimizationstep computationallyeÆcient aswillbediscussed below.

What isthebestdecompositionforthe humanbody? What isareasonablesize n

of the groups (or cliques) of bodyparts? We hope tomaken assmall aspossibleto

minimizethecostoftheoptimization. Butasngetssmaller,conditionalindependence

may not be a reasonable approximation any longer. There is a tradeo between

computational cost and algorithm performance. We use decomposable triangulated

models with n =3 aswill bediscussed below.

2.2 Decomposable triangulated graphs

We use decomposable triangulated graphs y

to depict the probabilistic conditional in-

dependence structure of body parts. A decomposable triangulated graph [19] is a

y

For generalgraphicalmodels,thetermdecomposableandthetermtriangulatedhavetheirown

meanings(they areactuallyequivalentproperties[18]). Inthisthesis,weusethetermdecomposable

(33)

collectionof cliques ofsize three,wherethereisaneliminationorderofverticessuch

that (1) when a vertex is deleted, it is only contained in one triangle (we call it a

free vertex); (2) aftereliminating one free vertex and the two edges associated with

it,the remainingsubgraph is again acollectionof cliques of size three untilonly one

triangleleft.

A

B C

D E

F

B C

D E

F

C

D E

F

D E

F

Figure2.2: Example ofsuccessive eliminationofa decomposabletriangulatedgraph,

with eliminationorder(A;B;C;(DEF)).

Figure 2.2 shows an example of a decomposable triangulated graph. The cliques

of the graphs are fA;B;Eg, fB;E;Fg, fC;E;Fg, and fD;E;Fg. One elimination

order of the vertices is A;B;C, and fD;E;Fg is left as the last clique. Figure 2.2

gives the steps of elimination of vertices following this order. Note that for a xed

graph structure, the elimination order is not unique. For example, for the graph in

Figure2.2, another eliminationorder ofvertices isC;D;F withfA;B;Egleftas the

lastclique.

Figure 2.3 shows two decomposable graphs of the whole body, along with one

order of successive eliminationof the cliques.

To better understand the concept of the decomposable triangulated graph, some

graphswhicharenotdecomposabletriangulatedgraphsare giveninFigure2.4. They

are not decomposable triangulatedgraphs for the followings reasons. Figure2.4 (a):

afteronefreevertexanditsassociatededgesaredeleted, theremaininggraphisnota

collectionof cliques ofsize three; Figure2.4(b): there isno freevertex inthe graph;

Figure2.4 (c): itis a clique of size four, not acollection ofcliques of size three.

When decomposablegraphsareused todescribeconditionalindependence ofran-

z

(34)

H

N

LS RS

LE LH

LK

RE RH

Rk

LA RA

LW RW

LF RF

13 14

2 7 6

1

3

4

5

8

9

10 11 12

H

N

LS RS

LE LH

LK

LA

RE RH

Rk

RA

LW RW

LF RF

(1) (2)

(3)

(4)

(5) (6)

(7) (8) (10)

(11) (12)

(13) (14)

(15) (16)

16

18 (9) 20 17

15 19

22 24

23 21

26

25 28 27

(a) (b)

Figure 2.3: Two decompositions of the human body into triangles. `L' and `R' in

label names indicate left and right. H:head, N:neck, S:shoulder, E:elbow, W:wrist,

H:hip, K:knee, A:ankle and F:foot. The numbers inside triangles give the index of

trianglesusedintheexperiments. In(a) they arealsooneorderinwhichthevertices

are deleted. In(b) the numbers inbrackets show one eliminationorder.

A

B

C D

E

A

B

C D

E

A

B

C D

(a) (b) (c)

Figure 2.4: Examples of non-decomposable triangulatedgraphs.

(35)

dom variables,the probabilitydensity functioncan bewrittenaccording tothe elim-

ination order of the vertices. For example, following the elimination order given in

Figure2.2, the joint probabilityP(A;B;C;D;E;F)can beapproximated by

P(A;B;C;D;E;F)=P(AjB;E)P(BjE;F)P(CjE;F)P(D;E;F) (2.6)

If we use another eliminationorder mentioned above, C;D;F with fA;B;Egleft as

the last clique, then the joint probability P(A;B;C;D;E;F)can be writtenas

P(A;B;C;D;E;F)=P(CjE;F)P(DjE;F)P(FjB;E)P(A;B;E) (2.7)

Using Bayes' rule, it is easy to verify that equations (2.6) and (2.7) are equivalent.

Forone graph, althoughwecan write dierentdecompositions accordingto dierent

eliminationorders,they describe the same conditional independence.

In general, Let S

body

=fS1;S2;:::;SMg be the set of M parts, for example, S1

denotes the left wrist, SM is the right foot, etc. X

Si

, 1 i M, is the measure-

ment for Si. If the joint probability density function P

S

body

can be decomposed as a

decomposable triangulatedgraph, itcan be writtenas

P

S

body (X

S1

;X

S2

;:::X

SM )

= Y

T 1

t=1 P

AtjBtCt (X

At jX

Bt

;X

Ct )P

A

T B

T C

T (X

A

T

;X

B

T

;X

C

T

) (2.8)

where A

i

;B

i

;C

i 2 S

body

, 1 i T =M 2, fA

1

;A

2

;:::;A

T

;B

T

;C

T g =S

body , and

(A

1

;B

1

;C

1 );(A

2

;B

2

;C

2

);:::;(A

T

;B

T

;C

T

)arethecliques. (A

1

;A

2

;:::;A

T

)givesone

eliminationorder forthe decomposablegraph.

The choice of decomposable triangulated graph is motivated by both computa-

tional and performance reasons. Trees are good examples of modeling conditional

(in)dependence [20, 21]. But decomposable triangulated graphs are more powerful

models than trees since each node can be thought of ashaving two parents. Similar

totrees,decomposabletriangulatedgraphsalloweÆcientalgorithmssuchasdynamic

(36)

ofdata [19]. Wewillgivemore rigorousanalysis onwhy wechoosedecomposabletri-

angulated graphs insection 7.5. The detailsof the dynamicprogramming algorithm

willbe discussed in the next section.

2.3 Algorithms

What is needed is an algorithm that will search through all the legal labelings and

ndthe onethatmaximizestheglobaljointprobabilitydensityfunction. Notice that

this optimum cannot be obtained by optimizing independently each triplet (clique

of size three). If the joint probability can by decomposed by a decomposable trian-

gulated graph, dynamic programming can be used to solve this problem eÆciently.

The key condition for using dynamic programming is that the problem exhibits op-

timal substructure. For example, we want to nd the labeling which can maximize

P(A;B;C;D;E). If equation (2.5) holds, then whatever the choices of A;B;C;D

are, the best E must be the one which maximizes P(EjC;D). Therefore to get the

best E,weonlyneedtoconsider thefunctionP(EjC;D)insteadofP(A;B;C;D;E).

More formally,

max

A;B;C ;D;E

P(A;B;C;D;E) = max

A;B;C

(P(A;B;C)max

D

(P(DjB;C)max

E

P(EjC;D)))

= max

A;B;C

(P(A;B;C)max

D

(f(B;C;D)))

= max

A;B;C

g(A;B;C) (2.9)

where f(B;C;D) = P(DjB;C) max

E

P(EjC;D) and g(A;B;C) = P(A;B;C)

max

D

f(B;C;D). Assumeeachvariablecan takeN possiblevalues. Ifthemaximiza-

tion is performed over P(A;B;C;D;E)directly, then the size of the search space is

N M

(M is the number of variables, M = 5 for this example). By equation (2.9),

the maximizationcan beachieved by maximizationover P(EjC;D), f(B;C;D) and

g(A;B;C)successively, and the size of the search spaceis (M 2)N 3

.

(37)

equation (2.8),then

maxP

S

body (X

S1

;X

S2

;:::X

SM )

= max

X

A

T

;X

B

T

;X

C

T P

T (X

A

T

;X

B

T

;X

C

T

) max

X

A

T 1 P

T 1 (X

A

T 1 jX

B

T 1

;X

C

T 1 )

max

X

A

2 P

2 (X

A

2 jX

B

2

;X

C

2 )max

X

A

1 P

1 (X

A

1 jX

B

1

;X

C

1

) (2.10)

where the `max' operation iscomputed from rightto left.

If we take the probability density function as the cost function, a dynamic pro-

gramming method similar to that described in [19] can be used. For each triplet

(A

t

;B

t

;C

t

),we characterize it with aten dimensionalfeature vector

x=(v

Ax

;v

Bx

;v

Cx

;v

Ay

;v

By

;v

Cy

;p

Ax

;p

Cx

;p

Ay

;p

Cy )

T

(2.11)

Therst threedimensionsof xarethe x-direction(horizontal)velocityofbodyparts

(A

t

;B

t

;C

t

),the next three are the velocity in the y-direction (vertical), and the last

four dimensions are the positions of body parts A

t

and C

t

relative to B

t

. Relative

positions are used here so that we can obtain translation invariance. As a rst-

orderapproximation,itisconvenienttoassumethatxisjointlyGaussian-distributed

and therefore its parameters may be estimated from training data using standard

techniques. After the joint probability density functionis computed, the conditional

one can beobtained accordingly:

P

A

t jB

t C

t (X

At jX

Bt

;X

Ct )=

P

AtBtCt (X

At

;X

Bt

;X

Ct )

P

B

t C

t (X

B

t

;X

C

t )

(2.12)

where P

B

t C

t (X

B

t

;X

C

t

) can be obtained by estimating the joint probability density

function of the vector (v

Bx

;v

Cx

;v

By

;v

Cy

;p

Cx

;p

Cy )

T

.

Let

t (X

At

;X

Bt

;X

Ct

)=logP

AtjBtCt (X

At jX

Bt

;X

Ct

); for 1tT 1 (2.13)

(38)

t (X

At

;X

Bt

;X

Ct

)=logP

A

T B

T C

T (X

A

T

;X

B

T

;X

C

T

); for t=T (2.14)

be the cost function associate with each triangle, then the dynamic programming

algorithmcan be described asfollows:

Stage 1: for every pair (X

B

1

;X

C

1 ),

Compute

1 (X

A

1

;X

B

1

;X

C

1

) for allpossible X

A

1

Dene T

1 (X

A

1

;X

B

1

;X

C

1

) the total value so far.

Let T

1 (X

A

1

;X

B

1

;X

C

1 )=

1 (X

A

1

;X

B

1

;X

C

1 )

Store 8

<

: X

A1[X

B

1

;X

C

1 ]

=argmax

X

A

1 T

1 (X

A

1

;X

B

1

;X

C

1 )

T

1 (X

A

1 [X

B

1

;X

C

1 ]

;X

B

1

;X

C

1 )

Stage t,2t T: for every pair (X

Bt

;X

Ct ),

Compute

t (X

A

t

;X

B

t

;X

C

t

) for allpossible X

A

t

Compute the total value sofar (tillstage t):

{ Dene T

t (X

A

t

;X

B

t

;X

C

t

) the total value so far.

Initialize T

t (X

A

t

;X

B

t

;X

C

t )=

t (X

A

t

;X

B

t

;X

C

t )

{ Ifedge (A

t

;B

t

) is contained in aprevious

stage and isthe latest such stage, add the cost

T

(X

A

[X

A

t

;X

B

t ]

;X

A

t

;X

B

t

) (orT

(X

A[X

B

t

;X

A

t ]

;X

B

t

;X

A

t

)if the

edge was reversed) toT

t (X

A

t

;X

B

t

;X

C

t )

{ Likewise, add the costs of the latest previous

stages containingrespectively edge (A

t

;C

t

) and edge (B

t

;C

t )

to T

t (X

A

t

;X

B

t

;X

C

t )

Store 8

<

: X

A

t [X

B

t

;X

C

t ]

=argmax

X

A

t T

t (X

At

;X

Bt

;X

Ct )

T

t (X

At[X

B

;X

C ]

;X

Bt

;X

Ct )

(39)

When stage T calculation is complete, T

T (X

A

T [B

T

;C

T ]

;X

B

T

;X

C

T

) includes the

value of each

t

, 1 t T, exactly once. Since the

t

's are the logs of condi-

tional (and joint) probabilities,then if equation (2.8) holds,

T

T (X

A

T [B

T

;C

T ]

;X

B

T

;X

C

T

)=logP

S

body (X

LW

;X

LE

;X

LS

;X

H :::X

RF )

Thus picking the pair (X

B

T

;X

C

T

) that maximizes T

T

automatically maximizes the

joint probability density function.

The best labeling can now be found tracing back through each stage: the best

(X

B

T

;X

C

T

) determines X

A

T

, then the latest previous stages with edge respectively

(X

A

T

;X

B

T ), (X

A

T

;X

C

T

), and/or(X

B

T

;X

C

T

) determine more labelsand soforth.

A simple example of this algorithmis shown in Figure2.5.

The above algorithm is computationally eÆcient. Assume M is the number of

bodypartlabelsandN (N =M forthis section)isthenumberofcandidatemarkers,

thenthe totalnumberofstagesisT =M 2andineachstagethe computationcost

is O(N 3

). Thus, the complexity of the whole algorithmison the order of M N 3

.

2.4 Experiments

Wedid experiments onmotioncapture data x

, which allowusto explore the labeling

performance of the algorithm on frames with all the body parts observed and no

clutter points. The data were obtained lming a subject moving freely in 3-D; 16

lightbulbswere strapped tothemain jointsof the subject's body. In ordertoobtain

ground-truth, the data were rst acquired, reconstructed and labeled in 3-D using a

4-cameramotioncapture systemoperatingatarateof60samples/sec. Sinceourgoal

is todetect and labelthe bodydirectly in the cameraimage plane,a genericcamera

viewwassimulatedby orthographicprojectionof the3-D markercoordinates. In the

following sections we will control the camera view with the azimuth viewing angle:

a value of 0 degrees will correspond to a right-side view, a value of 90 to a frontal

x

ThesedatawerecapturedbyDrs. LuisGoncalvesandEnricoDiBernadousingamotioncapture

(40)

E

D C

B A

t=1

t=2

t=3 1 2 3

4 5

Graph

Markers

x

Figure2.5: Anexampleofdynamicprogrammingalgorithmappliedtoasimplegraph.

ThegoalistoassignthemarkerstothevariablesA;B;C;D;Einthegraphsuchthat

(41)

viewof the subject. Six sequences were acquired eacharound 2minutes long. In the

next sectionsthey willbereferred asfollows: SequencesW1(7000frames),W2(7000

frames): relaxed walking forward and backwards along almost straight paths (with

20degreedeviationsinheading); W3andW4(6000 frameseach): relaxed walking,

with the subject turning around now and then (Figure 2.6(a) shows sample frames

fromW3);Sequence HW(5210 frames): walkinginahappymood,movingthe head,

arms, hips more actively (Figure 2.6(b)); Sequence DA (3497 frames): dancing and

jumping (Figure2.6(c)), with the subject moving hislegs and arms freely and much

fasterthaninthepreviousfoursequences. Giventhatthedatawereacquiredfromthe

same subject and that orthographicprojection was used tosimulate a camera view,

ourdatawere alreadynormalizedinscale. Thevelocityofeachcandidatemarkerwas

obtainedbysubtractingitspositionsintwoconsecutiveframes. Thus, togetvelocity

information,weassumed that featurescould betracked fortwoframes but wedidn't

use any feature correspondence over more than two frames, which is arguably the

most diÆcult conditions under which to perform labeling and detection, as will be

discussed in section 3.3.

Among the sequences, walking sequences W1 and W2 are the relatively simple

ones, soW1 and W2were rst used totest the validityof the Gaussianprobabilistic

modeland the performance of two possible body decompositions (Figure2.3). Since

the heading directionof W1 and W2 was roughly along a line, these sequences were

alsousedtostudy theperformanceasafunctionof viewingangle. Then experiments

were conductedusing W3,HWand DA tosee how themodel worked for moreactive

and non-periodicmotions.

2.4.1 Detection of individual triangles

In this section, the performance of the Gaussian probabilistic model for individual

triangles is examined. In the training phase, the joint Gaussian parameters (mean

and covariance)foreachtriangleinFigure2.3were estimatedfromwalkingsequence

(42)

830 845 860 875 890

(a)

3600 3615 3630 3645 3660

(b)

2460 2466 2508 2514 2526

(c)

Figure 2.6: Sample frames for the (a) walking sequence W3; (b) happy walking

sequence HW; (c) dancingsequence DA. Thenumbers onthehorizontalaxesare the

framenumbers.

(43)

in W2 (also viewed of 45 degrees), each triangle probability was evaluated for all

possible combinations of markers (161514 dierent combinations). Ideally, the

correct combination of markers should produce the highest probability for each re-

spective triangle. Otherwise, anerror occurred. Figure 2.7 (a)shows how well each

triangle's joint probability model detects the correct set of markers. Figure 2.7 (b)

shows a similar resultfor the conditional probability densitiesof triangles,where for

each triangle conditional probability density P

A

t jB

t C

t (X

A

t jX

B

t

;X

C

t

), we computed

P

AtjBtCt (X

At jX

Bt

;X

Ct

) for all the possible choices of A

t

(14 choices), given the cor-

rect choice of markers for B

t

and C

t

. Figure 2.7 shows that the Gaussian model is

very goodfor most triangles(in the jointcase, if a triangleis chosen randomly, then

the chance ofgettingthe correctone is310 4

and the probabilitymodels domuch

better than that).

2 4 6 8 10 12 14 16 18 20 22 24 26 28 0

10 20 30 40 50 60 70

index of triangles

error rate (in percentage)

2 4 6 8 10 12 14 16 18 20 22 24 26 28 0

2 4 6 8 10 12

index of triangles

error rate (in percentage)

(a) (b)

Figure2.7: Localmodelerrorrates(percentageofframesforwhichthecorrectchoice

of markers did not maximize each individual triangle probability). Triangle indices

arethoseofthetwographmodelsofFigure2.3. `+': resultsfordecompositionFigure

2.3(a); `o': results for decomposition Figure2.3 (b). (a)joint probability model; (b)

conditional probability model.

It is not surprising that the performance of some triplets is much worse than

others. The worst triangles in Figure 2.7 (a) are those with left and right knees,

whichmakes sense because the two knees are so close insome frames that it is even

hard for human eyes to distinguish between them. Therefore, it is also hard for the

(44)

0 1000 2000 3000 4000 1

1.001 1.002 1.003 1.004 1.005 1.006

error frame

probability ratio

0 200 400 600 800

1 1.001 1.002 1.003 1.004 1.005 1.006

error frame

conditional probability ratio

(a) (b)

Figure2.8: Probabilityratio (correctmarkersvs. the solutionwiththe highestprob-

ability when an error happens). The horizontal axis is the index of frames where

error happens. (a) joint probability ratio for triangle 10 or 25 (RH, LK, RK); (b)

conditional probability ratiofor triangle 17(H, N, LS).

Furtherinvestigationofthebehaviorofthetriangleprobabilitiesrevealedthat,for

framesinwhichthecorrectchoiceofmarkers didnotmaximizeatriangleprobability,

that probabilitywas nevertheless quiteclose tothe maximal value. Figure2.8shows

the ratio of the probabilitiesof the correctchoice over the maximizingchoice for the

twoworstbehavingtriangles,overthesetofframeswherethe errorsoccurred. Figure

2.8 (a)shows the ratioof the joint probabilitydistributionfor triangle10(consisting

of right hip, left knee, and right knee, as in gure 2.3 (a)). Figure 2.8 (b) shows

the ratio of the conditional probability distribution for triangle17 ( head, neck, and

left shoulder). Although these two triangles had the highest error rates, the correct

marker combination was always very close to being the highest ranking, always less

than a factor of 1.006 away. This is a good indication that the individual triangle

probability models encode the distribution quitewell.

2.4.2 Performance of dierent body graphs

We did experiments using the two decompositions in Figure 2.3. The training se-

quenceW1andthe testsequence W2wereunderthesame viewingangle: 45degrees,

(45)

frame-by-frame error is the percentage offrames inwhicherrors occurred, and label-

by-labelerroristhepercentageofmarkerswronglylabeledoutofallthemarkersinall

the testingframes. Label-by-labelerror issmallerthanframe-by-frameerror because

anerror in aframe doesnot mean all the markers are wrongly labeled.

decomposition model (a) (b)

frame-by-frameerror 0.27% 13.13%

label-by-labelerror 0.06% 1.61%

Table 2.1: Error rates using the models in Figure 2.3

The performance of the algorithm using the decomposition of Figure 2.3(a) is

almost perfect and muchbetterthan that of (b), which isconsistentwith our expec-

tation (by Figure2.7, the localperformance of decompositionFigure 2.3(a)is better

thanthat ofFigure2.3(b)). Weusedthe bettermodelintherest ofthe experiments.

2.4.3 Viewpoint invariance

In the previous sections the viewing angle for training and for testing was the same.

Hereweexplorethebehaviorofthemethodwhenthetestingviewingangleisdierent

fromthatusedduringtraining. Figure2.9showstheresultsofthreesuchexperiments

where walkingsequence W1 wasused as the training set and W2 asthe test set .

The solid line in Figure 2.9(a) shows the percentage of frames labeled correctly

when the training was done at a viewing angle of 90 degrees (subject facing the

camera)and the testing viewingangle was varied from0 degrees (right-sideview) to

180 degrees (leftside view) inincrementsof10degrees. Whenthe viewing anglewas

between 60to120degrees, almostallframeswere labeledcorrectly,thusshowingthat

the probabilistic model learned at 90 degrees is insensitive to changes in viewpoint

by up to30 degrees.

The solidlineinFigure2.9(b)shows theresultsof asimilarexperimentwherethe

trainingviewpointwasat0degrees (right-sideview) andthe testinganglewasvaried

A Probabilistic Approach to Human Motion Detection and Labeling

Computer Vision Algorithms

image sequences

Description of the scene:

Human presence?

How many?

Where are they?

What are they doing?

desired output

Presence of Human?

Localization of parts?

Type of motion?

Feature detector/

tracker

Detection and Labeling Training Data

Probabilistic Model of

Human Motion Learning

algorithm

Feature detector/

tracker LT LT RA RA RT RT

LA LA LHE LHE RK RK LH LH LE LE RS RS RH RH RE RE

LS LS LW RW LW RW N

H

Testing: two frames Image sequences

Testing Training

H

N

LS RS

LE RE

LW RW

LH RH

LK RK

LA RA

LF RF

A

B C

D E

F

B C

D E

F

C

D E

F

D E

F

H

N

LS RS

LE LH

LK

RE RH

Rk

LA RA

LW RW

LF RF

13 14

2 7 6

1

3

4

5

8

9

10

11 12

H

N

LS RS

LE LH

LK

LA

RE RH

Rk

RA

LW RW

LF RF

(1) (2)

(3)

(4)

tracker ^LT ^LT ^RA ^RA ^RT ^RT