Human Activity Analysis: A Review. Aggarwal, Ryoo. ACM Computing Surveys 2011

<Was the basis of the 2011 CVPR Tutorial on human activity recognition>activityToxonomy


  1. “Depending on their complexity, we conceptually categorize human activities into four different levels: gestures, actions, interactions, and group activities.”
  2. Single layer approaches operate directly off a sequence of images,\
    1. “Space-time approaches view an input video as a 3-D (XYT) volume…” May instead represent the activity as a trajectory (of a stick figure, as opposed to volume of person in foreground)
    2. “…sequential approaches interpret it as a sequence of observations.”
  3. hierarchical methods are based on the composition of subevents
    1. An example of statistical hierarchical methods is hierarchical HMM
    2. Syntactic approaches work on the basis of formal grammars (statistical context-free)
    3. Descriptive methods attempt a logical explanation
  4. Object recognition can also be important for activity recognition if the subject is interacting with an object
  5. Single layer approaches come down to representation engineering and classification
    1. Often work on basis of sliding windows
    2. Usually only applicable to short simple actions (like jumping)
    3. Can do template matching
    4. Nearest neighbor is common
  6. <skipping details about 3D volume methods>
  7. Single-layer trajectory methods may try to find an affine projection to develop view-invariant representation.  May also attempt analysis of curvature of tracked joints
  8. “Action Recognition Using Space-Time Local Features… if a system is able to extract appropriate features describing characteristics of each action’s 3-D volumes, the action can be recognized by solving an object-matching problem.”  Works in terms of local features/interest points.  Generally this is applied to stills, but there are also methods that work in X,Y,T.  Also approaches that don’t extract features from every still frame, but instead “… extract features only when there exists a salient appearance or shape change in 3-D space-time volume. Most of these features have been verified to be invariant to scale, rotation, and translations, similar to object recognition descriptors.”
  9. Application of latent semantic analysis (from text mining) to this problem
  10. “In most approaches using sparse local features, spatial and temporal relationships among detected interest points are ignored. The approaches that we discussed above show that simple actions can be recognized successfully, even without any spatial and temporal information among features. This is similar to the success of object recognition techniques that ignore the local features’ spatial relationships, typically calledbag-ofwords. The bag-of-words approaches were particularly successful for simple periodic actions.”
  11. Exemplar based approach is part of sequential methods.  An example is dynamic time warping (sometimes multiple templates of the same thing)
  12. <Paper is a bit too much of a laundry list – skimming the rest>
  13. HMMs, DBNs, semi-Markov models
  14. “In general, sequential approaches consider sequential relationships among features in contrast to most of the space-time approaches, thereby enabling detection of more complex activities (i.e., nonperiodic activities such as sign languages).”
  15. Exemplar systems can actually be more flexible than model-based sequential approaches because multiple different exemplars can be maintained per activity
  16. On to hierarchical methods
  17. Often single-layered approaches are used to extract the lowest-level, atomic actions for hierarchical methods
  18. Hierarchical methods are better suited to understanding more complex methods of interaction
  19. Although they can model more complex behavior, they can generally get by on smaller training sets than single-layered methods
  20. Also mention HMMs in hierarchical methods – in this case the model has layers
  21. DBNs also mentioned here, and context-free grammars
  22. Description-based approaches “… explicitly maintains human activities’ spatio-temporal structures. It represents a high-level human activity in terms of simpler activities that compose the activity (i.e., subevents), describing their temporal, spatial, and logical relationships. That is, description-based approaches model a human activity as an occurrence of its subevent (which might be composed of their own subevents) that satisfies certain relations.”  Can also be represented as a context-free grammar
  23. <skipping group activities>
Tagged ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: