Isvc 2014 Paper Id548

8/10/2019 Isvc 2014 Paper Id548

1/10

Body Joint Tracking in Low Resolution Video

using Region-based Filtering

Binu M Nair1, Kimberly D Kendricks2, Vijayan K Asari1, andRonald F Tuttle3

1 Department of ECE, University of Dayton, Dayton, OH, USA{nairb1,vasari1}@udayton.edu

2 CHPSA Lab, Central State University, Wilberforce, OH, [email protected]

3 Air Force Institute of Technology, 2950 Hobson Wa, OH, [email protected]

Abstract. We propose a region-based body joint tracking scheme totrack and estimate continuous joint locations in low resolution imagerywhere the estimated trajectories can be analyzed for specific gait signa-tures. The true transition between the joint states are of a continuousnature and specifically follows a sinusoidal trajectory. Recent state of arttechniques enables us to estimate pose at each frame from which jointlocations can be deduced. But these pose estimates at low resolution areoften noisy and discrete and hence not suitable for further gait analysis.Our proposed 2-level region-based tracking scheme gets a good approx-imation to the true trajectory and obtains finer estimates. Initial jointlocations are deduced from a human pose estimation algorithm and sub-sequent finer locations are estimated and tracked by a Kalman filter. Wetest the algorithm on sequences containing individuals walking outdoors

and evaluate their gait using the estimated joint trajectories.

Keywords: kalman filter, region-based tracking, local binary patterns, histogramof oriented gradients, low resolution

1 Introduction

Most of the research work done in the field of tracking from surveillance videoshas been restricted to detecting and tracking large objects in the scene such aspeople in shopping malls, players on a soccer/basketball court, detection andtracking of cars etc. But, when it comes to tracking body joints in a scene,

it borders on the line of human pose estimation in images and videos. In thepresent research community, the human body pose estimation problem is beingtackled in two different scenarios; one which uses the depth information andthe other which uses only the images. The former uses the depth informationfrom the Kinect (Shotten et al [14]) and mainly suited for indoor applicationssuch as gaming consoles, human interactive systems etc.. The latter is used insurveillance applications which uses video feed from multiple CCTV cameras

8/10/2019 Isvc 2014 Paper Id548

2/10

2 Nair et al.

(a) Manual annotation provided by point lightssoftware [10]

(b) Human pose estimation usingArticulated Models[15]

Fig. 1: Illustration of specific joints/body parts on the human body to be tracked.

monitoring a parking lot or a shopping mall. Some early research for trackingmotion and pose in surveillance videos has been developed where interest points

detected on the human body can be tracked. The trajectories are then modeledto differentiate between human actions [8]. Recently in an approach proposed byHuang et al. [7], human body pose is estimated and tracked across the scene us-ing information acquired by a multi-camera system. The human pose estimatesobtained from such algorithms give continuous smooth sinusoidal like trajecto-ries and therefore are deemed useful for gait analysis. However, one limitation isthe requirement of high resolution imagery for accurate estimation of joint tra-jectories. Therefore, the use of such algorithms on low-resolution videos does notguarantee joint location estimate suitable for gait analysis and a pre-processingmechanism should be applied on these noisy discrete estimates. An illustration ofthe pose estimates obtained by a proprietary point light software and articulatedpart-based models are shown in Figures 1a and 1b.

2 Related Work

One of the earlier and popular works which does not use the depth informationand uses only a single video camera to track human motion is done by MarkusKohler [9]. Here, a Kalman Filter is designed to track non-linear human motionin such a way that non-linearity in motion is considered as motion with constantvelocity and changing acceleration modeled as white noise. In our proposed algo-rithm, we use a modification of this Kalman filter and the design of the processnoise covariance to track the body joints across the video sequence. Kaniche etal [8] used the extended Kalman filter to track specific points or corners detectedat every frame of the video sequence for the purpose of gesture recognition.

In recent years, the problem of human body pose estimation has not justbeing limited to tracking points or corners or using depth information. One ofthe state of art methods for human pose estimation on static images is theflexible mixture of parts model, proposed by Yang and Ramanan [15]. Instead ofexplicitly using variety of oriented body part templates(parameterized by pixellocation and orientation) in a search-based template matching scheme, a familyof affine-warped templates is modeled, each template containing a mixture of

8/10/2019 Isvc 2014 Paper Id548

3/10

Body Joint Tracking using Region-based Filtering 3

non-oriented pictorial structures. Ramakrishna et al [12] proposed an occlusionaware algorithm which tracks human body pose in a sequence where the human

body is modeled as a combination of single parts such as the head and neck andsymmetric part pairs such as the shoulders, knees and feet. Here, the importantaspect in this algorithm is that it can differentiate between similar looking partssuch as the left or right leg/arm, thereby giving a suitable estimate of the humanpose. Although these methods show an increased accuracy on datasets such asthe Buffy Dataset [5] and the Image Parse dataset [13], the performance on verylow-resolution imagery is not yet evaluated. Further processing of the humanpose estimates can provide coarse locations of a joint which can form the basisof many tracking schemes. One such work was done by Xavier et al [3] wherethey propose a generalization of the non-maximum suppression post processingschemes to merge multiple post estimates either in a single frame or in multipleconsecutive frames of a video sequence. We focus on an alternative problemwhere we require smooth trajectories of individual joints in low resolution videoscene for realistic and online analysis for gait signatures. The work proposed inthis paper is a alternative and more accurate method to our preliminary model[10] in body joint tracking where a combination of optical flow and LBP-HOGdescriptors with Kalman filter had been evaluated.

3 Theory

In this section, we explain the various modules such as the region-based featurematching and the tracking scheme using the Kalman filter used in the proposedframework.

3.1 Region Descriptor Matching

The region descriptors such as the Histogram of Oriented Gradients (HOG)[4] and the Local Binary Patterns (LBP) [11] are used to describe the edgeinformation and the textural content in a local region respectively. Both can bevery effective descriptors for region-based image matching in low resolution. Thehistogram of oriented gradients (HOG) [4] descriptor is a weighted histogram ofthe pixels over the edge orientation where the weights are the corresponding

edge magnitude. The gradient magnitude and direction are given by

G2x+G2y

and tan1(GyGx

) whereGx, Gy are gradients in the x, y directions.The local binary pattern is an image coding scheme which brings out the

textural features in a region. For representing a joint region and to associate ajoint in successive frames, the texture of the region plays a vital part in addition

to the edge information. The LBP considers a local neighborhood of 88 or 1616 in a joint region and generates a coded value which represents the underlyingtexture in its local region. The LBP operator is defined as

LBPP,R =P

p=0

s(gp gc)2p s(z) =

1 z 0

0 z

8/10/2019 Isvc 2014 Paper Id548

4/10

4 Nair et al.

where (P, R) is the number of points around the local neighborhood and itsradius. The textural representation of the joint region will then be the histogram

of these LBP-coded values. For our purpose, we use P = 8 with R = 1 whichreduces to a local region of size 8 8. The matching between two joint regionsrepresented either by HOG or LBP is done using the Chi-squared metric [11]in Equation 2 wheref1,f2 are feature vectors corresponding to a certain joint insuccessive frames.

2(f1, f2) =b

(f1(b) f2(b))2

f1(b) +f2(b) (2)

3.2 Kalman Filter

The recursive version of the Kalman filter can also be used for tracking pur-poses and in literature, it has been widely applied for tracking points in video

sequences. In this proposed algorithm, we use the Kalman filter to track a spe-cific body joint across the scene. This is done by setting the state of the process(which in this case is the human body movement) as the (x, y) coordinates ofthe joint along with its velocity (vx, vy) to get a state vector xk R

4. Themeasurement vector zk = [xo, yo] R2 will be provided either by the coarsejoint location estimates or by the region-based estimate. By approximating themotion of a joint in a small time interval by a linear function, we can design thetransition matrix A so that the next state is a linear function of the previousstates. As done by Kohler[9], to account for non-constant velocity often asso-ciated with accelerating image structures, we use the process noise covariancematrix Q defined in Equation 3 where a is the acceleration and t is the timestep determined by the frame rate of the camera.

Q= a2t

6

2(t)2 0 3t 0

0 2(t)2 3t3t 0 6

0 3t 6

(3)

4 Proposed Framework

A block schematic of the proposed tracking scheme is shown in Figure 2. Itconsists of two main stages: a) 2-level region based matching using LBP/HOGb) tracking of region-based estimates using Kalman filter. Following are the stepsin the proposed algorithm flow :

1:Extract the first frame(time instant t = 1) of the sub-trajectory. Computedense optical flow within the foreground region to get the global velocityestimate (median flow).

2: Initialize the Kalman filter with the coarse joint location and the globalvelocity. The state of the tracker for each body joint is then xt= [x,y,vx, vy]where (vx, vy) is the joint velocity which is set to the global flow velocityestimate. This is considered as the corrected state xt1 at time t = 1.

8/10/2019 Isvc 2014 Paper Id548

5/10


Fig. 2: Block schematic of the proposed tracking scheme.

3: Update t t + 1 and predict the state (get prior state) xt of the Kalmanfilter. Using the predicted state x

t

, posterior state xt1 and the a-priorierror co-variance Pt , estimate the elliptical region Sreg1(t) where the jointlocation is likely to fall on.

4: Extract the next frame. Find the region based matching estimate of eachjoint between instances t and t 1 formulated as argminpSop(t)

2(fj , fp)wherefj is the joint descriptor updated in the previous time instant, fp isthe region descriptor computed at the pixel p within the elliptical searchregionSreg1(t). Also compute the dense optical flow and the global velocityof the foreground region.

5: Using this estimate and the coarse joint location estimate, predict the newelliptical search region Sreg2(t). A constraint Sreg12(t) Sreg1(t) is en-forced to prevent the growth ofSreg2(t). If constraint is satisfied, go to Step6. Else goto Step 8.

6: Compute region-based estimate given byargminpSreg2(t)2(fj , fp). Use thisfiner estimate of the joint location as the measurement vector z = [zx, zy] tocorrect the Kalman tracker associated with that particular joint.

7: Update t t+ 1. Set the joint velocity as the global velocity and predictthe state (get prior state)xt and the elliptical search regionSreg1(t). Go toStep 4.

8: Using the coarse joint location estimates as the measurement vector, performthe correction phase of the filter.

9: Update t t+ 1. Set the joint velocity as the global velocity and predictthe state (get prior state)xt and the elliptical search regionSreg1(t). Go toStep 4.

10: Continue till all the frames of the sequence has been processed.

5 Results and Experiments

The proposed tracking scheme has been tested on a private dataset providedby the Air Force Institute of Technology, Dayton OH. It consists of 12 subjectswalking along a outdoor track across the face of a building is performed twice,

8/10/2019 Isvc 2014 Paper Id548

6/10

6 Nair et al.

(a) Covariance based trajectory measures. (b) MOTA/MOTP scores

Fig. 3: Experimental results obtained with the proposed region-based Kalmantracking scheme using LBP descriptors. The numbering of points inMOTP/MOTA scores refer to the name of the sequences mentioned in the leftfigure.

one wearing a loaded vest and other no vest by each subject to get a totalof 24 video sequences. The area of focus is when the subject walks clockwisearound the track and climbs a ramp. We set equal neighborhood sizes of 17 17for each joint region and set a constant acceleration a = 0.1pixels/f rame2

for the corresponding Kalman filter. Figure 4 shows sample illustration of theproposed scheme in certain frames of the sequence. Sample illustrations of thejoint trajectories are also shown in Figure 5 where a comparison is made with four

different schemes. All of the joint trajectories estimated by different schemes foreach joint is smoothened by using a regression based neural network. We see thatthe smooth trajectories obtained by the proposed scheme using LBP or HOGhas the closest approximation to the sinusoidal trajectory with subtle variations.

5.1 Co-variance-Based Trajectory Measure

Its a statistical measure which gives how close the tracked joint locations areto the coarse estimates of the joint location for each sequence associated with a

particular subject. This metric [6] is given by d2(K, Km) =ni=1

(log(i(K, Km))2

where K R3

is the co-variance of the tracked points, Km R3

is the co-variance matrix of the coarse joint locations, i is theith Eigen value associatedwith |K Km| = 0 and n being the number of Eigen values. The lower thevalue, the closer are the tracked points to the coarse joint locations. This measuredoes not provide us with the precision of the tracking scheme but it gives anindication whether the tracked joint trajectory are located within the spatial-temporal neighborhood of the coarse joint trajectory. We see that most of the

8/10/2019 Isvc 2014 Paper Id548

7/10


joint trajectories obtained from the proposed scheme have very low values. Thisshows that the proposed scheme obtains tracked estimates which are close to the

pose estimates obtained from a pose detector.

5.2 Multiple Object Tracking Precision/Accuracy (MOTP/MOTA)

The MOTP/MOTA [2] metric is a widely used efficiency measure for multiple-object tracking mechanisms where the MOTP/MOTA gives the precision andaccuracy of the tracker by considering all the detected and tracked objects.We use an implementation of the CLEAR-MOT [1] to give us the statisticaldata such as false positive rate, false negative rate, MOTA and MOTP scores.Multiple Object Tracking Precision (MOTP) refers to the closeness of a trackedpoint location to its true location (given as ground truth). Here, we measure the

closeness by measuring the overlap between the neighborhood region occupied bythe tracked point location and the ground truth. Higher the value of this overlap,more precise is the estimated location of the point. Multiple Object TrackingAccuracy (MOTA) gives the accumulated accuracy in terms of the fraction ofthe tracked joints matched correctly without any misses or mismatches. Wecomputed the MOTP, MOTA, false positive rate and false negative rate for eachsequence by setting the threshold T = 0.5 with same acceleration parametera= 0.1 and a neighborhood size of 17 17 for each body joint. We also use thecoarse joint location estimates as the ground truth data since no appropriateground truth has been provided with this dataset. In Figure 3b, we see thatall of the sequences have moderately high precision of around 75% and a highaccuracy of around 90%. This shows that the proposed tracking scheme is lessnoise free and the reduction in precision is due to the slight variation of the

estimated joint locations with respect to the coarse joint location estimates.

6 Conclusion

We have proposed a body joint tracking algorithm using a region-based matchingscheme incorporated along with a Kalman filter for use in conjunction with thestate of the art human pose estimation algorithms under low-resolution scenariosfor outdoor sequences. The algorithm is a combination of effective region-basedpoint tracking techniques using HOG or LBP coupled with the predictive ca-pability of the Kalman filter. After applying a post-processing GRNN-based

smoothening scheme, we see that the proposed scheme provides a better approx-imation of the true sinusoidal trajectory than the schemes using only the poseestimates through qualitative evaluation. In terms of quantitative analysis, pre-cision and accuracy of the joint tracks obtained from proposed scheme is higher.Future work will involve analyzing the trajectories obtained with the joints todetermine any characteristics embedded in it for suitable gait signature analysisfor people re-identification or for human action and activity analysis.

8/10/2019 Isvc 2014 Paper Id548

8/10

8 Nair et al.

(a) Elliptical search region in frame 1 forframe 2.

(b) Fine estimates of joint location inframe 2 obtained from tracking scheme(LBP).

(c) Elliptical search region computed inframe 3 for frame 4. Here, the shoulder andthe ankle joint trackers are corrected withthe coarse location while the other jointtrackers are corrected with the region-

based estimate.

(d) Finer estimates of joint locations inframe 4 obtained from tracking scheme(LBP).

(e) Elliptical search region computed at

frame 7 for frame 8.

(f) Tracked joint locations at frame 9

based on the elliptical search regions.

Fig. 4: Illustration of elliptical search regions before tracking and joint locationestimates after tracking. The coarse pose estimates are represented by purplecolor in each frame. The search regions and the finer joint estimates are givenas shoulder (blue), elbow (green), wrist (red), waist (cyan), knee (yellow) andankle (pink).

8/10/2019 Isvc 2014 Paper Id548

9/10

8/10/2019 Isvc 2014 Paper Id548

10/10

10 Nair et al.

Acknowledgments. This work was done in collaboration with Central StateUniversity and is supported by the National Science Foundation grant No:1240734.

We would like to thank the National Signature Program and the Air Force In-stitute of Technology for the dataset used in this research.

References

1. Bagdanov, A., Del Bimbo, A., Dini, F., Lisanti, G., Masi, I.: Posterity logging offace imagery for video surveillance. MultiMedia, IEEE 19(4), 4859 (Oct 2012)

2. Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance:The clear mot metrics. J. Image Video Process. 2008, 1:11:10 (Jan 2008)

3. Burgos-Artizzu, X., Hall, D., Perona, P., Dollar, P.: Merging pose estimates acrossspace and time. In: Proceedings of the British Machine Vision Conference. BMVAPress (2013)

4. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:

Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE ComputerSociety Conference on. vol. 1, pp. 886893 vol. 1 (2005)

5. Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reductionfor human pose estimation. In: Computer Vision and Pattern Recognition, 2008.CVPR 2008. IEEE Conference on. pp. 18 (June 2008)

6. Forstner, W., Moonen, B.: A metric for covariance matrices (1999)7. Huang, C.H., Boyer, E., Ilic, S.: Robust human body shape and pose tracking. In:

3DV-Conference, 2013 International Conference on. pp. 287294 (2013)8. Kaaniche, M., Bremond, F.: Tracking hog descriptors for gesture recognition. In:

Advanced Video and Signal Based Surveillance, 2009. AVSS 09. Sixth IEEE In-ternational Conference on. pp. 140145 (2009)

9. Kohler, M.: Using the Kalman Filter to Track Human Interactive Motion: Mod-elling and Initialization of the Kalman Filter for Translational Motion. Forschungs-berichte des Fachbereichs Informatik der Universitat Dortmund, Dekanat Infor-

matik, Univ. (1997)10. Nair, B.M., Kendricks, K.D., Asari, V.K., Tuttle, R.F.: Optical flow based kalman

filter for body joint prediction and tracking using hog-lbp matching. vol. 9026, pp.90260H90260H14 (2014)

11. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotationinvariant texture classification with local binary patterns. Pattern Analysis andMachine Intelligence, IEEE Transactions on 24(7), 971987 (2002)

12. Ramakrishna, V., Kanade, T., Sheikh, Y.: Tracking human pose by tracking sym-metric parts. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEEConference on. pp. 37283735 (2013)

13. Ramanan, D.: Learning to parse images of articulated bodies. In: Scholkopf, B.,Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems19, pp. 11291136. MIT Press (2007)

14. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kip-

man, A., Blake, A.: Real-time human pose recognition in parts from single depthimages. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Con-ference on. pp. 12971304 (2011)

15. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Confer-ence on. pp. 13851392 (June 2011)

Isvc 2014 Paper Id548

Documents

Transcript of Isvc 2014 Paper Id548