- “We explore first-person sensing through a wearable camera and Intertial Measurement Units (IMUs) for temporally segmenting human motion into actions and performing activity classification in the context of cooking and recipe preparation in a natural environment.”
- Try:
- Gaussian Mixture Models
- HMMs
- K-NN
- Also try unsupervised methods so that annotation isn’t necessary
- Large number of references to prior work
- The IMUs used here are set up as:
- 5 on the wrist
- <5?> on ankles
- 1 on waist
- And then there is head-cam
- There is ambiguity in how to label what is going, especially because people did the same task differently.
- Also huge partial observability (even in terms of what is in view of the camera)
- Action clauses are distinct per recipe (at least partially), 29 clauses for brownies
- For unsupervised segmentation, they try PCA, then cluster that
- Performance is about 70% correct with this method
- These features can also be used to try and classify what recipe is being made
- Recipe classification is perfect on the small dataset when using data from IMUs
- Tried unsupervised clustering of IMU data with HMM but it came up with garbage
- They then merge video and IMU data <but its not totally clear how they accomplish this in terms of representation> again, run through PCA and HMM
- Recipe classification with this method was ~93%, so just using IMUs is better
- Then they move onto supervised, with annotation
- ~80% of frames annotated, “stirring the mix” takes up about 25% of labeled frames
- They trained only on frames where annotation was available
- Then do classification with supervised HMM (poor classification~ 10%) and k-NN (much better at ~60%)
- They argue perf of k-NN is from high D data <not sure why that is a good argument though>