Event Detection and Human Behavior Recognition€¢ Event examples in the news domain: Airplane...

21
EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION Ing. Lorenzo Seidenari e-mail: [email protected]

Transcript of Event Detection and Human Behavior Recognition€¢ Event examples in the news domain: Airplane...

EVENT DETECTION AND HUMAN

BEHAVIOR RECOGNITION

Ing. Lorenzo Seidenari

e-mail: [email protected]

What is an Event?

Dictionary.com definition:

“something that occurs in a certain place during

a particular interval of time.”

Sports: shot on goal Surveillance: enter in car Movies: drink

Examples from various domains:

Importance of Human Actions• Most videos recorded and downloadable from the web

contain people; the semantic is therefore defined by people

behavior.

• Third generation video-surveillance systems benefit from

automatic interpretation of human actions and behaviors.

Definition 1: physical body motion.

Definition 2: interaction with environment (objects or people) on a specific purpose.

Human action recognition challenges

• Actor appearance variation. Gender,

clothing body posture and size.

• Scale, illumination and background change as in object categorization.

• Semantically different but perceptually

similar actions (e.g. running and jogging).

• Different ways of executing the same action. This results in limbs trajectory and speed change.

time time time time

Are actions space-time objects?

We already know how to detect instances of object categories in static images.

How do we take advantage of time to describe dynamic concepts (i.e. human actions)?

Bag-of-wordsSVM classifierrunning

walking

jogging

handwaving

handclapping

boxing

Visual Dictionary

Bag-of-featuresInterest points extraction

Framework Overview:• Same three steps of object categorization (feature extraction, dictionary formation, classification)

• Features detector and descriptor here differ!

Visual Dictionaries

… …

Visual Dictionary

3DGrad + HoF BoW

3DGrad_HoF

3DGrad

HoF

BoW

ST Patch

ST Patch

Descriptor

Descriptors

Action Representation

Action Representation

Descriptor combination strategy

Effective codebooks:• Spatio-temporal descriptors span an extremely high-dimensional feature space

• Our dense multi-scale sampling produce a non-uniform feature space.

K-means clusters are attracted

towards densely populated regions.

• Less dense zone are not represented

correctly.

Radius-based clustering [Jurie ICCV05]

exploits mode finding to place cluster

centers.

• More accurate coding of the feature

space.

Note: to reduce the uncertainty we perform soft assignment.

Results: codebook performance

KTH

Informative

mid-frequency terms.Non-informative

high-frequency terms.

codebook size

Words are sorted by frequency and added incrementally to dictionary.

Weizmann

codebook size

Non-informative

high-frequency terms.

Informative

mid-frequency terms.

Results: codebook performance

Words are sorted by frequency and added incrementally to dictionary.

Results: dataset

• KTH

• 25 actors

• 6 actions

• 4 viewing conditions

• 2931 clips

• Weizmann

• 9 actors

• 10 actions

• 1 viewing conditions

• 93 clips

The approach is tested on two standard datasets

Weizmann dataset is considered less challenging for the reduce variability of

shooting conditions and amount of actors.

Results: comparison with the state of the art

Method KTH Weizmann

Our method 92.57 95.41

Laptev et al. - HoG ['08] 81.6 -

Laptev et al. - HoF [‘08] 89.7 -

Dollár et al. [‘05] 81.2 -

Wong e Cipolla [‘07] 86.6 -

Scovanner et al. [‘07] - 82.6

Niebles et al. [‘08] 83.3 90

Liu et al. [‘08] - 90.4

Kläser et al. [’08] 91.4 84.3

Willems et al. [‘08] 84.2 -

We compare our results by using the same methodology to measure the

Improvement w.r.t. to the current state-of-the-art

walking

running

Real video footage

We test our detector on a sequence taken in a garage.

A sliding temporal window is used to perform the segmentation.

• Online video search and video indexing

• Events characterized by an evolution of scenes, objects

and actions over time.

• 56 events are defined in LSCOM.

• Event examples in the news domain:

Airplane Flying Car Exiting

Recognizing generic video events

• A possible approach, which exploit object recognition is to detect interest object,

track over time, and model spatio-temporal dynamics.

• Some events are well defined by the presence and motion of an object.

Object Detection & Localization

Tracking Inference

“Airplane

Landing”

?

Event Recognition: Object Tracking

• Hard to detect events without explicit object motion, such as Riot

feature feature feature feature

extractionextraction

concept concept concept concept

detectorsdetectors

EMDEMDEMDEMD

distancedistance

......

Plug the EMD into

a rbf kernel and use

it in a SVM to predict

category.

Event recognition: exploit dynamic concept evolution

• Global low level feature are extracted such as edge histograms, Gabor texture descriptors and

grid color moments.

• 108 concent detectors are trained on this features.

• Each frame is represented by 108 concept scores.

• Shots similarity is evaluated by computing Earth Mover’s Distance.

• Train detectors on

low-level features

• Mid-level semantic concept

feature is more robust

• Columbia developed and

released 374 semantic concept

detectors. Detectors are

available online.

Concept Detectors

Content Representation: Mid-level Semantic Concept Scores

Image Database

+-

http://www.ee.columbia.edu/ln/dvmm/columbia374/

Earth Mover’s Distance (EMD): Approach

dij

Supplier P is with a

given amount of goods

Receiver Q is with a

given limited capacity

Weights:

Solved by linear programming

• Temporal shift:

a frame at the beginning of P can be mapped to a frame at the end of Q

• Scale variations:

a frame from P can be mapped to multiple frames in Q

111/21/2

1/21/2

Experiments:

Keyframe based feature performance

0,0

0,2

0,4

0,6

0,8

1,0

Car

Cra

sh

Pro

test

Gre

etin

g

Car

Exitin

g

Com

bat

Mar

ching

Rio

t

Run

ning

Sho

otin

g

Walking

(ave

rage

)

concept scores Gabor texture

edge direction histogram color moment

Dataset: TRECVID2005Evaluation Metric: Average

Precision

Experiments:

EMD concept performance

References

On space-time interest points, Laptev, I. IJCV 2005

Behavior recognition via sparse spatio-temporal features, Dollar, P., Rabaud, V.,

Cottrel, G. and Belongie, S. ICCV VS-PETS 2005

Effective Codebooks for Human Action Recognition, Ballan, L., Bertini, M., Del

Bimbo , A.,Seidenari, L. and Serra, G. ICCV VOEC 2009

Video Event Recognition using kernel methods with multilevel temporal

alignement, Dong Xu, Shih-Fu Chang, TPAMI 2008