What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision. Malamaud, Huang, Rathod, Johnston, Rabinovich, Murphy. (NAACL?) 2015

Going off the Arxiv version

  1. “We present a novel method for aligning a sequence of instructions to a video of someone carrying out a task. In particular, we focus on the cooking domain, where the instructions correspond to the recipe. Our technique relies on an HMM to align the recipe steps to the (automatically generated) speech transcript. We then refine this alignment using a state-of-the-art visual food detector, based on a deep convolutional neural network. We show that our technique outperforms simpler techniques based on keyword spotting. “
  2. Most large knowledge bases are based on declarative facts like “Barack Obama was born in Hawaii”, but lack procedural information
  3. This is a complex problem and can involve many different types of media and information, but here they focus on aligning video to text
    1. They work on video, with the text being the recipe that the user uploaded with the video
  4. Instructional videos are a good place to start on these types of problems because they often are heavily annotated with speech in the video
  5. Process is as follows:
    1. Align instructional steps (recipe) with speech via HMM
    2. Refine this alignment using computer vision
  6. They “create a large corpus of 180k aligned recipe-video pairs, and an even larger corpus of 1.4M short video clips, each labeled with a cooking action and a noun phrase. We evaluate the quality of our corpus using human raters. Third, we show how we can use our methods to support applications such as within-video search and recipe auto-illustration.”
  7. Worked from a corpus of 180k videos (started from 7.5 mil and worked their way down)
  8. Separate text into 3 classes (very accurately, with just naive Bayes and bag of words): recipe, ingredient, non-recipe
  9.  Use in house NLP processor similar to the Stanford parser
  10. Video annotation is provided automatically by youtube, then apply the NLP processor to that
    1. This data isn’t high quality, and the system works better when a real transcript is provided, but using automatic transcripts gets them more data
  11. HMM has #states = #steps in recipe (you can only move forward in this HMM)
    1. Figuring out what parts of the dialogue are “non-recipe” are important for this and help prevent premature transitions
  12. A simpler method is to do “keyword spotting” by looking for verbs and taking windows around when that occurred and looking for simple noun/verb combinations
  13. They use both of the previous techniques together: HMM+keyword spotting
  14. Sometimes people verbally describe what they are going to do before they do it.  They then use image recognition to find where an object was described (+/- seconds) figure out where to align the annotation to in the actual video
    1. Trained their own vision food-detector
  15. They downsample images to about the same size as overfeat uses (~220×220)
  16. When asking actual people on MTurk the hybrid HMM+keyword spotting method was rated as best
  17. <Here they are working from english language to a syntax tree and then doing alignment.  I wonder if you can do something similar with motion primitives which have also been used to learn generative grammars to do an alignment?>
  18. Other related work
  19. Methods for making simple subject-verb-object-place sentences from video

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: