A Database for Fine Grained Activity Detection of Cooking Activities. Rohrbach, Amin, Andriluka, Schiele. CVPR 2102.

  1. Includes novel database of 65 cooking activities, recorded in a realistic setting (looks much better than CMUs lab), in fact from the images it looks like a normal kitchen with lots of cabinets – not sure how its stocked
  2. Benchmark 2 approaches on this data:
    1. Using “articulated pose tracks”
    2. Using “holistic video features”
  3. “While the holistic approach outperforms the pose-based approach, our evaluation suggests that fine-grained activities are more difficult to detect and the body model can help in those cases.”
  4. There are limitations with the commonly-used action recognition data sets.
    1. Most are coarse, full-body activities (waving, jumping, etc)
    2. Activities are all very different, so it is relatively simple to differentiate them.  In more realistic data it is often more difficult to differentiate
    3. Generally have episodic – given a short video, identify the one thing going on.  On the other hand, a more realistic setting involves being fed a video stream and dynamically deciding what is going on
  5. This dataset addresses these 3 limitations
  6. “From the experimental results we can conclude that fine grained activity recognition is clearly beyond the current state-of-the-art and that further research is required to address this more realistic and challenging setting.”
  7. Mentions a list of over 30 action datasets
    1. Although just 2 “natural” datasets, one is from Rochester and seems fairly limited, the other is from TUM <and isn’t useful for us because of lack of instrumentation>
  8. Commonly used features for action recognition are Histograms of Oriented Gradients, or Flow (HOG, HOF)
  9.  Recording has 12 participants doing 65 cooking activities (like cut, pour).  They did this through completion of 14 recipes so the motions are naturalistic and not isolated
  10. Instructions also seem to be fairly loose, which could allow subject freedom in what how/they were doing the cooking (ex/ some made soup from scratch, others from a packet)
    1. They were allowed to look around the kitchen as much as they liked to get situated at first
  11. Data is 4D <3D video with time?>
  12. Data is hand-annotated (5,609 annotations of 65 activities)
    1. Looks like they also have skeleton data as a form of annotation?
  13. <Skipping the rest, doesn’t seem like exactly what we need>

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: