MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation. Jonathan, Tompson, LeCun, Bregler. Arxiv 2014

  1. A system for pulling out pose estimation from videos using conv nets – including color and motion features
  2. They propose a new body pose dataset, and their results tests as better than state of the art
  3.  Traditionally, posture estimation has relied on hand coded features like HoG (histogram of gradients), and not motion-based features.  On the other hand, psychophysical experiments show that to people, motion is a powerful cue that by itself can be used to extract a great deal of information including pose
  4. Previous studies involving the use of motion data had negative results, leading to no real improvement in actual performance, and in some cases, intractable inference problems.
    1. Here it is shown that deep learning can take advantage of motion information.  In fact, with their approach, motion data alone outperforms a number of algorithms, showing that there is indeed valuable information in motion data
  5. Contributions:
    1. An algorithm that incorporates motion features and outperforms state of the art for ‘in-the-wild’ data
    2. Algorithm is efficient and is almost real time
  6. Hogg (different from HoG) in 83 was one of the first systems for motion tracking, they often worked from an explicit geometric model and required initialization and then incrementally updated the pose information
  7. Later on, systems without explicit geometrical models were introduced, generally relying on “bags of features” (SIFT, STIP, HoG, HoF)
  8. Most state of the art is based on a combination of HoG and “Deformable Part Models” (DPM)
  9. Previous applications of deep learning to pose recognition lead to better than state of the art performance
  10. Input to their convnet is a rgb image along with a set of motion features
  11. Two broad categories of motion data:
    1. Simple derivatives of RGB video frames
    2. Optical flow features
  12. The simple derivatives are not great, and is high-dimensional data.  It would be hard to get a network to do optical flow, so they compute optical flow separately as a preprocessing step
    1. They mention later that this is a nontrivial amount of information, so it could be a big help to an algorithm, although other algorithms havent been able to take advantage of it, and even just this information alone in their system leads to good performance
  13. Convnet is based on “sliding patches”
  14. <skipping details of arch and optimization, can come back to it if necessary>
  15. Designed only to identify one skeleton on screen, center of torso is marked, which allows for constraints on the rest of skeleton to be used
  16. Training on 4k training, 1k test images takes 12 hours, a forward pass through takes 50ms
  17. They show examples where use of motion data leads to correct classification, but ignoring it leads to errors
    1. Especially in the case when there is a cluttered background
  18. <Seems like they just do head and arms? Torso already given…>
  19. System is pretty robust to range of parameters for optical flow, and removal of camera motion compensation doesn’t change performance much either
  20. Their results really really beat up on other state of the art in their data set
    1. Even motion features alone beat them, but if you want exact results the RGB information is necessary as well

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: