Detecting Subtle Human-Object Interaction Using Kinect


Introduction

We present a method to identify human-object interactions involved in complex, fine-grained activities. Our approach benefits from recent improvements in range sensor technology and body trackers to detect and classify important events in a depth video. Combining global motion information with local video analysis, our method is able to recognize the time instants of a video at which a person picks up or puts down an object. We introduce three novel datasets for evaluation and perform extensive experiments with promising results.



Dataset 1

The first dataset shows 11 actors interacting with 5 objects (a cup, a phone, a hole puncher, a pair of headphones and a remote). Each video shows a single actor standing in front of a table. The objects are randomly placed on the table. There are 6 videos for each actor. In the first three videos, the actor interacts with each object in turn. He chooses an object, picks it up, manipulates it in some way and finally puts it down on the table. This is repeated for every object on the table. The last three videos are similar, but the actor does not actually manipulate the object. Instead, he just touches the object and then moves the hand away.


Dataset 2

The second dataset shows 10 actors interacting with 6 objects (a remote, a book, a cup, a phone, a picture frame and a box). Each video shows a single actor sitting in front of a coffee table. The objects are randomly placed on the table. There are 6 videos for each actor. In the first three videos, the actor interacts with each object in turn. He chooses an object, picks it up, manipulates it in some way and finally puts it down on the table. This is repeated for every object on the table. Videos 4 and 5 are similar, but the actor does not actually manipulate the object. Instead, he just touches the object and then moves the hand away. In video 6, the actor randomly decides which objects to touch and which objects to pick up.


Dataset 3

The third dataset shows 10 actors interacting with 6 objects (a book, a remote, a phone, a stapler, a wallet and a cup). Each video shows a single actor standing in front of a desk. The objects are randomly placed on the desk. There are 6 videos for each actor. In the first three videos, the actor interacts with each object in turn. He chooses an object, picks it up, manipulates it in some way and finally puts it down on the desk. This is repeated for every object on the table. Videos 4 and 5 are similar, but the actor does not actually manipulate the object. Instead, he just touches the object and then moves the hand away. In video 6, the actor randomly decides which objects to touch and which objects to pick up.


Download

Three channels are recorded: depth maps (.bin), skeleton joint positions (.txt), and RGB video (.avi). Datasets 2 and 3 are downloadable in a single zip file. Files in this zip file are named using the format:

a[VIDEO NUMBER]_s[ACTOR NUMBER]_e[DATASET]_[CHANNEL].[EXTENSION],

where "01" and "02" are used to refer to dataset 3 and 2 respectively. For example "a03_s02_e01_depth.bin" is the depth channel recording for the third video of actor 2 in dataset 3.