1. Deep learning for sequential (video) data
  2. Because video has temporal smoothness, an object will appear on many frames in a series, potentially from different perspectives (helps to learn a representation invariant to this)
  3. ConvNets with “… a training objective with a temporal coherence regularizer added to a typical training error minimizing objective, resulting in a modified backpropagation rule.”
  4. Use SGD
  5. Enforce that representation in hidden layers should be similar in time as measured by L1, conversely, two nonconsecutive frames have representations in hidden layer that are pushed apart (max-margin).
  6. Two optimizations (object recognition, video coherence) are done simultaneously (with a parameter that weighs relative contribution).
    1. But they actually just optimize one and then the other interleaved (which saves them the weighing parameter)
  7. SFA mentioned in the related work
  8. There is a graph version of SVM that tries to minimize difference of points connected in a graph
    1. “Graph methods on the other hand suffer from two further problems: (1) building the graph is computationally burdensome for large-scale tasks, (2) they make an assumption that the decision rule lies in a region of low density with respect to the distance metric chosen for k-nearest neighbors.”
  9. This approach doesn’t rely on low-density assumption
  10. Performance is about the same as state of the art on object recognition “However in bot cases we have managed to match the state-of-the-art whilst avoiding a strongly engineered solution for this task by utilizing learning from unlabeled video.”
  11. Doing the unsupervised step increases perf by about 8%
  12. Although some unsupervised data sets are more helpful than others <naturally> “… our results indicate that using unlabeled auxiliary video is still always beneficial compared to not using video, even when the objects in the auxiliary video are not similar to those of the primary task.”
  13. Even making making a data sequence when there wasn’t originally one (in their example, for face detection, simply concatenate images of the same face into a video where frames are of an arbitrary order), still works better.
  14. Because the framework is set up to work on identifying objects in video, the method should be invariant to the types of perspective changes that occur in video.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: