Category Archives: Time Series

Unsupervised Learning of Video Representations using LSTMs. Srivastava, Mansimov, Salakhutdinov. Arxiv

  1. LSTMs to learn representations of video sequences
  2. Maps an input video sequence to a fixed-length representation
  3. This representation is then used to do other tasks
  4. Experiment with inputs of pixel patches as well as “high-level” “percepts”
    1. <Definition of latter isn’t clear from abstract but sure it will be explained>
  5. Unsupervised setting
  6. ” The key inductive bias here is that the same operation must be applied at each time step to propagate information to the next step. This enforces the fact that the physics of the world remains the same, irrespective of input. The same physics acting on any state, at any time, must produce the next state.”
  7. Can use NN to do autoencoder, or prediction, or both
  8. Previous work on generative models of video showed that a squared-error loss function is problematic and instead resorts to a dictionary-based method.
    1. Its very hard to make an appropriate and effective loss function for video
    2. Here they simply go with squared error
  9. “The key advantage of using an LSTM unit over a traditional neuron in an RNN is that the cell state in an LSTM unit sums activities over time. Since derivatives distribute over sums, the error derivatives don’t vanish quickly as they get sent back into time. This makes it easy to do credit assignment over long sequences and discover longrange features.”
  10. For unsupervised learning, they use two LSTM networks – one for encoding, another for decoding
  11. “The encoder LSTM reads in this sequence. After the last input has been read, the decoder LSTM takes over and outputs a prediction for the target sequence. The target sequence is same as the input sequence, but in reverse order. Reversing the target sequence makes the optimization easier because the model can get off the ground by looking at low range correlations.”
  12. “The state of the encoder LSTM after the last input has been read is the representation of the input video. The decoder LSTM is being asked to reconstruct back the input sequence from this representation. In order to do so, the representation must retain information about the appearance of the objects and the background as well as the motion contained in the video.”
  13. This is a similar setup as is used to develop representations of word meanings
  14. Also a previous paper that does video prediction (in the to-read list)
  15. Prediction can either be conditioned on the previously predicted output or not:
    1. “There is also an argument against using a conditional decoder from the optimization point-of-view. There are strong short-range correlations in video data, for example, most of the content of a frame is same as the previous one. If the decoder was given access to the last few frames while generating a particular frame at training time, it would find it easy to pick up on these correlations. There would only be a very small gradient that tries to fix up the extremely subtle errors that require long term knowledge about the input sequence. In an unconditioned decoder, this input is removed and the model is forced to look for information deep inside the encoder.”
  16. In the “composite model” the same LSTM representation is passed into two different LSTM decoders, one that does prediction, while the other does reconstruction of the input sequence
  17. <Experiments>
  18. Use UCF-101 and HMDB-51 datasets which each have several thousand videos of several seconds each
  19. They also subsample from a previously existing sports video dataset
    1. Sample down to 300 hours in clips that are 10 seconds each
  20. Unsupervised training over this youtube set was good enough; using the other two sets in addition didn’t really change performance
  21. Whole video was 240×320, they look at just the center 224×224, 30fps
  22. “All models were trained using backprop on a single NVIDIA Titan GPU. A two layer 2048 unit Composite model that predicts 13 frames and reconstructs 16 frames took 18-20 hours to converge on 300 hours of percepts. We initialized weights by sampling from a uniform distribution whose scale was set to 1/sqrt(fan-in). Biases at all the gates were initialized to zero. Peep-hole connections were initialized to zero. The supervised classifiers trained on 16 frames took 5-15 minutes to converge. “
  23. First set of experiments is on moving MNIST digits.  Sequences were 20 frames long and were in  a 64×64 patch
    1. Positions are randomly initialized as well as velocities; they bounce off walls
  24. LSTM has 2048 units
  25. Look at 10 frames at a time, try and reconstruct that 10, predict next 10
  26. “We used logistic output units with a cross entropy loss function.”
  27. “It is interesting to note that the model figures out how to separate superimposed digits and can model them even as they pass through each other. This shows some evidence of disentangling the two independent factors of variation in this sequence. The model can also correctly predict the motion after bouncing off the walls”
  28. Two layer LSTM works better than 1, and using previously predicted outputs to predict future outputs also helped
  29. Next they worked on 32×32 image patches from one of the real-life video datasets
    1. Linear output units, squared error loss
    2. Input 16 frames, reconstruct last 16 and predict next 13
  30. Show outputs for 2048 and 4096 units, claim latter is better <to my eye they look essentially identical>
    1. Next they test for generalization at different time scales.  2048 LSTM units, 64×64 input
    2. Then look at predictions for 100 frames in the future.  Don’t show the image ouputs, but show that the activation doesn’t just average out to some mean amount; it has periodic activations that it maintains
    3. It starts to blur outputs, but maintains motion:
  31. They also used the network that was trained on 2 moving digits on 3 and 1 moving digits; for 1 digit it superimposed another blob on the one digit, and for the 3 it blurred the digits out but maintained motion <to me, on the other hand it looks like it slightly blurs the 1 digit and for the 3 digit turns into 2 blurry digits>
  32. Discuss visualizing features <not paying attention to that at the moment>
  33. Check how unsupervised learning helps supervised learning
  34. 2048 unit autoencoder trained on 300 hours of video.  Encode 16 frames, predict 10
  35. “At test time, the predictions made at each time step are averaged. To get a prediction for the entire video, we average the predictions from all 16 frame blocks in the video with a stride of 8 frames.”
    1. <averaging over predictions seems like a funny thing to do>
  36. ” All classifiers used dropout regularization, where we dropped activations as they were communicated across layers but not through time within the same LSTM” Dropout was important
  37. For small datasets pretraining on unsupervised data helps
  38. Similar experiments on “Temporal Stream Convolutional Nets” <In my reading queue> – also helped there
  39. “We see that the Composite Model always does a better job of predicting the future compared to the Future Predictor. This indicates that having the autoencoder along with the future predictor to force the model to remember more about the inputs actually helps predict the future better. Next, we can compare each model with its conditional variant. Here, we find that the conditional models perform better”
  40. “The improvement for flow features over using a randomly initialized LSTM network is quite small. We believe this is atleast partly due to the fact that the flow percepts already capture a lot of the motion information that the LSTM would otherwise discover. “

State Representation Learning in Robotics: Using Prior Knowledge about Physical Interaction. Jonschowski, Brock. RSS 20Re10Q

related to https://aresearch.wordpress.com/2015/04/18/learning-task-specific-state-representations-by-maximizing-slowness-and-predictability-jonschkowski-brock-international-workshop-on-evolutionary-and-reinforcement-learning-for-autonomous-robot-s/

  1. Uses the fact that robots interact with the physical world to set constraints on how state representations are learnt
  2. Test on simulated slot car and simulated navigation task, with distractors
  3. How to extract a low dimensional representation relevant to the task being undertaken from high dimensional sensor data?
  4. The visual input in the experiments is 300-D
  5. From the perspective of RL
  6. “According to Bengio et al. [1], the key to successful representation learning is the incorporation of “many general priors about the world around us.” They proposed a list of generic priors for artificial intelligence and argue that refining this list and incorporating it into a method for representation learning will bring us closer to artificial intelligence.”
  7. “State representation learning is an instance of representation learning for interactive problems with the goal to find a mapping from observations to states that allows choosing the right actions. Note that this problem is more difficult than the standard dimensionality reduction problem, addressed by multi-dimensional scaling [14] and other methods [23, 29, 6] because they require knowledge of distances or neighborhood relationships between data samples in state space. The robot, on the other hand, does not know about semantic similarity of sensory input beforehand. In order to know which observations correspond to similar situations with respect to the task, it has to solve the reinforcement learning problem (see Section III), which it cannot solve without a suitable state representation.”
  8. State representation learning can be done by:
    1. Deep autoencoders
    2. SFA (and its similarity to proto-value functions)
    3. Predictability / Predictive actions.  Points to a Michael Bowling paper <I haven’t read – will check out>
  9. Bunch of nice references
  10. The Robotic priors they care about (these are all defined mathematically later):
    1. Simplicity: For any task, only a small number of properties matter
    2. Temporal coherence: important properties change gradually over time
    3. Proportionality: Amount of change in important properties is proportional to action magnitude
    4. Causality: important properties and actions determine the reward
    5. Repeatability: Same as causality but in terms of transition not reward
  11. These properties hold for robotics and physical systems but aren’t necessarily appropriate to all domains
    1. Even in robotics these sometimes don’t hold (for example, a robot running into a wall will have an abrupt change in velocity; once pushed against the wall proportionality doesn’t hold because no amount of pushing will allow the robot to move
  12. They set learning up as an optimization problem that includes loss terms for each of the priors above (just a linear combination)
  13. <these priors aren’t directly applicable to the mocap work I am doing now because they involve actions which we dont have access to>
  14. Formally they would need to compare all samples, leading to n^2 loss, but they restrict this to a window
  15. Linear mapping from observations to states
  16. Epsilon greedy exploration, although there is a bias to repeat the previous action
  17. Have distractors in their simulations
  18. In the navigation task they demonstrate an invariance to perspective by using either an overhead or first-person view from the robot
    1. Representations learned are highly similar
  19. Learned in the course of 5,000 observations
  20. The observations in the experiment are 300 pixels or 100 (not 300×300)
  21. For the simulated slot car task, the state sample matrix has rank 4.  Two large eigenvalues correspond to the position of the controlled car, and two smaller eigenvalues correspond to the position of the distractor car
    1. Ideally the distractor shouldn’t show up at all, but because of things like stochasticity and limited samples, weight can be placed on it to explain events that it is not related to
  22. They then to an RL comparison based on a number of different methods for learning the representation (5 features extracted by)
    1. Their approach
    2. SFA
    3. PCA
    4. Raw 300D representation
    5. (They also compare to ground truth representation)
  23. Use neural fitted Q
  24. screenshot
  25. SFA features are really terrible, their approach is pretty similar to operating from the ground truth
  26. With further investigation on the same results they demonstrate that their method has very good generalization properties
  27. Conjecture that the primary difference between their approach and the other dimension reduction methods are that they didn’t learn to disregard the distractors

MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation. Jonathan, Tompson, LeCun, Bregler. Arxiv 2014

  1. A system for pulling out pose estimation from videos using conv nets – including color and motion features
  2. They propose a new body pose dataset, and their results tests as better than state of the art
  3.  Traditionally, posture estimation has relied on hand coded features like HoG (histogram of gradients), and not motion-based features.  On the other hand, psychophysical experiments show that to people, motion is a powerful cue that by itself can be used to extract a great deal of information including pose
  4. Previous studies involving the use of motion data had negative results, leading to no real improvement in actual performance, and in some cases, intractable inference problems.
    1. Here it is shown that deep learning can take advantage of motion information.  In fact, with their approach, motion data alone outperforms a number of algorithms, showing that there is indeed valuable information in motion data
  5. Contributions:
    1. An algorithm that incorporates motion features and outperforms state of the art for ‘in-the-wild’ data
    2. Algorithm is efficient and is almost real time
  6. Hogg (different from HoG) in 83 was one of the first systems for motion tracking, they often worked from an explicit geometric model and required initialization and then incrementally updated the pose information
  7. Later on, systems without explicit geometrical models were introduced, generally relying on “bags of features” (SIFT, STIP, HoG, HoF)
  8. Most state of the art is based on a combination of HoG and “Deformable Part Models” (DPM)
  9. Previous applications of deep learning to pose recognition lead to better than state of the art performance
  10. Input to their convnet is a rgb image along with a set of motion features
  11. Two broad categories of motion data:
    1. Simple derivatives of RGB video frames
    2. Optical flow features
  12. The simple derivatives are not great, and is high-dimensional data.  It would be hard to get a network to do optical flow, so they compute optical flow separately as a preprocessing step
    1. They mention later that this is a nontrivial amount of information, so it could be a big help to an algorithm, although other algorithms havent been able to take advantage of it, and even just this information alone in their system leads to good performance
  13. Convnet is based on “sliding patches”
  14. <skipping details of arch and optimization, can come back to it if necessary>
  15. Designed only to identify one skeleton on screen, center of torso is marked, which allows for constraints on the rest of skeleton to be used
  16. Training on 4k training, 1k test images takes 12 hours, a forward pass through takes 50ms
  17. They show examples where use of motion data leads to correct classification, but ignoring it leads to errors
    1. Especially in the case when there is a cluttered background
  18. <Seems like they just do head and arms? Torso already given…>
  19. System is pretty robust to range of parameters for optical flow, and removal of camera motion compensation doesn’t change performance much either
  20. Their results really really beat up on other state of the art in their data set
    1. Even motion features alone beat them, but if you want exact results the RGB information is necessary as well
Tagged

Dictionary-Based Compression for Long Time-Series Similarity. Lang, Morse, Patel. Knowledge and Data Engineering 2010

  1. Skimmed this, but just making note to self
  2. Does form of Lempel-Ziv compression on continuous valued time series
    1. In particular, its for finding similarity between time series, like methods of dynamic time warping
  3. Works on continuous data by computing a distance <perhaps Euclidian?> of each point in the time – if its epsilon close to something else in the dictionary, it is called that.
    1. This allows for a discretization, but one thats not based on coarse quantization
  4. They claim this method works as well as other methods for finding similarity between time series but this approach is much cheaper.