Category Archives: MoCap

Unsupervised Learning of Video Representations using LSTMs. Srivastava, Mansimov, Salakhutdinov. Arxiv

  1. LSTMs to learn representations of video sequences
  2. Maps an input video sequence to a fixed-length representation
  3. This representation is then used to do other tasks
  4. Experiment with inputs of pixel patches as well as “high-level” “percepts”
    1. <Definition of latter isn’t clear from abstract but sure it will be explained>
  5. Unsupervised setting
  6. ” The key inductive bias here is that the same operation must be applied at each time step to propagate information to the next step. This enforces the fact that the physics of the world remains the same, irrespective of input. The same physics acting on any state, at any time, must produce the next state.”
  7. Can use NN to do autoencoder, or prediction, or both
  8. Previous work on generative models of video showed that a squared-error loss function is problematic and instead resorts to a dictionary-based method.
    1. Its very hard to make an appropriate and effective loss function for video
    2. Here they simply go with squared error
  9. “The key advantage of using an LSTM unit over a traditional neuron in an RNN is that the cell state in an LSTM unit sums activities over time. Since derivatives distribute over sums, the error derivatives don’t vanish quickly as they get sent back into time. This makes it easy to do credit assignment over long sequences and discover longrange features.”
  10. For unsupervised learning, they use two LSTM networks – one for encoding, another for decoding
  11. “The encoder LSTM reads in this sequence. After the last input has been read, the decoder LSTM takes over and outputs a prediction for the target sequence. The target sequence is same as the input sequence, but in reverse order. Reversing the target sequence makes the optimization easier because the model can get off the ground by looking at low range correlations.”
  12. “The state of the encoder LSTM after the last input has been read is the representation of the input video. The decoder LSTM is being asked to reconstruct back the input sequence from this representation. In order to do so, the representation must retain information about the appearance of the objects and the background as well as the motion contained in the video.”
  13. This is a similar setup as is used to develop representations of word meanings
  14. Also a previous paper that does video prediction (in the to-read list)
  15. Prediction can either be conditioned on the previously predicted output or not:
    1. “There is also an argument against using a conditional decoder from the optimization point-of-view. There are strong short-range correlations in video data, for example, most of the content of a frame is same as the previous one. If the decoder was given access to the last few frames while generating a particular frame at training time, it would find it easy to pick up on these correlations. There would only be a very small gradient that tries to fix up the extremely subtle errors that require long term knowledge about the input sequence. In an unconditioned decoder, this input is removed and the model is forced to look for information deep inside the encoder.”
  16. In the “composite model” the same LSTM representation is passed into two different LSTM decoders, one that does prediction, while the other does reconstruction of the input sequence
  17. <Experiments>
  18. Use UCF-101 and HMDB-51 datasets which each have several thousand videos of several seconds each
  19. They also subsample from a previously existing sports video dataset
    1. Sample down to 300 hours in clips that are 10 seconds each
  20. Unsupervised training over this youtube set was good enough; using the other two sets in addition didn’t really change performance
  21. Whole video was 240×320, they look at just the center 224×224, 30fps
  22. “All models were trained using backprop on a single NVIDIA Titan GPU. A two layer 2048 unit Composite model that predicts 13 frames and reconstructs 16 frames took 18-20 hours to converge on 300 hours of percepts. We initialized weights by sampling from a uniform distribution whose scale was set to 1/sqrt(fan-in). Biases at all the gates were initialized to zero. Peep-hole connections were initialized to zero. The supervised classifiers trained on 16 frames took 5-15 minutes to converge. “
  23. First set of experiments is on moving MNIST digits.  Sequences were 20 frames long and were in  a 64×64 patch
    1. Positions are randomly initialized as well as velocities; they bounce off walls
  24. LSTM has 2048 units
  25. Look at 10 frames at a time, try and reconstruct that 10, predict next 10
  26. “We used logistic output units with a cross entropy loss function.”
  27. “It is interesting to note that the model figures out how to separate superimposed digits and can model them even as they pass through each other. This shows some evidence of disentangling the two independent factors of variation in this sequence. The model can also correctly predict the motion after bouncing off the walls”
  28. Two layer LSTM works better than 1, and using previously predicted outputs to predict future outputs also helped
  29. Next they worked on 32×32 image patches from one of the real-life video datasets
    1. Linear output units, squared error loss
    2. Input 16 frames, reconstruct last 16 and predict next 13
  30. Show outputs for 2048 and 4096 units, claim latter is better <to my eye they look essentially identical>
    1. Next they test for generalization at different time scales.  2048 LSTM units, 64×64 input
    2. Then look at predictions for 100 frames in the future.  Don’t show the image ouputs, but show that the activation doesn’t just average out to some mean amount; it has periodic activations that it maintains
    3. It starts to blur outputs, but maintains motion:
  31. They also used the network that was trained on 2 moving digits on 3 and 1 moving digits; for 1 digit it superimposed another blob on the one digit, and for the 3 it blurred the digits out but maintained motion <to me, on the other hand it looks like it slightly blurs the 1 digit and for the 3 digit turns into 2 blurry digits>
  32. Discuss visualizing features <not paying attention to that at the moment>
  33. Check how unsupervised learning helps supervised learning
  34. 2048 unit autoencoder trained on 300 hours of video.  Encode 16 frames, predict 10
  35. “At test time, the predictions made at each time step are averaged. To get a prediction for the entire video, we average the predictions from all 16 frame blocks in the video with a stride of 8 frames.”
    1. <averaging over predictions seems like a funny thing to do>
  36. ” All classifiers used dropout regularization, where we dropped activations as they were communicated across layers but not through time within the same LSTM” Dropout was important
  37. For small datasets pretraining on unsupervised data helps
  38. Similar experiments on “Temporal Stream Convolutional Nets” <In my reading queue> – also helped there
  39. “We see that the Composite Model always does a better job of predicting the future compared to the Future Predictor. This indicates that having the autoencoder along with the future predictor to force the model to remember more about the inputs actually helps predict the future better. Next, we can compare each model with its conditional variant. Here, we find that the conditional models perform better”
  40. “The improvement for flow features over using a randomly initialized LSTM network is quite small. We believe this is atleast partly due to the fact that the flow percepts already capture a lot of the motion information that the LSTM would otherwise discover. “

Autonomous reinforcement learning on raw visual input data in a real world application. Lange, Riedmiller, Voigtlander. Neural Networks (IJCNN) 2012

  1. Learn slot car racing from raw camera input
  2. Deep-fitted Q
  3. The slot car problem is interesting in particular because as being rewarded for speed, the best policies are the closest to failure (losing control of the car)
  4. Time resolution at about 0.25s
  5. Camera input is 52×80 = 4160d (greyscale it seems?)
  6. They do pretraining: “The size of the input layer is 52×80 = 4160 neurons, one for each pixel provided by the digital camera. The input layer is followed by two hidden layers with 7×7 convolutional kernels each. The first convolutional layer has the same size as the input layer, whereas the second reduces each dimension by a factor of two, resulting in 1 fourth of the original size. The convolutional layers are followed by seven fully connected layers, each reducing the number of its predecessor by a factor of 2. In its basic version the coding layer consists of 2 neurons.
    Then the symmetric structure expands the coding layer towards the output layer, which shall reproduce the input and accordingly consists of 4160 neurons.”
  7. “Training: Training of deep networks per se is a challenge: the size of the network implies high computational
    effort; the deep layered architecture causes problems with vanishing gradient information. We use a special two-stage training procedure, layer-wise pretraining [8], [29] followed by a fine-tuning phase of the complete network. As the learning rule for both phases, we use Rprop [34], which has the advantage to be very fast and robust against parameter choice at the same time. This is particularly important since one cannot afford to do a vast search for parameters, since training times of those large networks are pretty long.”
  8. Training is done with a set of 7,000 images, with the car moving a constant speed.
  9. “For the layer-wise pretraining, in each stage 200 epochs of Rprop training were performed.”
  10. “Let us emphasis the fundamental importance of having the feature space already partially unfolded after pretraining. A partially unfolded feature space indicates at least some information getting past this bottle-neck layer, although errors in corresponding reconstructions are still large. Only because the autoencoder is able to distinguish at least a few images in its code layer it is possible to calculate meaningful derivatives in
    the finetuning phase that allow to further “pull” the activations in the right directions to further unfold the feature space.”
  11. “Altogether, training the deep encoder network takes about 12 hours on an 8-core CPU with 16 parallel threads.”
  12. Because still images don’t include all stat information (position but not velocity) they need to use some method to allow the other parts of the state information in.  One option is to make the state the previous two frames, the other is to do a Kohonen Map <I assume this is what they do>
  13. “As the spatial resolution is non uniform in the feature space spanned by the deep encoder, a difference in the feature space is not necessarily a consistent measure for the dynamics of the system. Hence, another transformation, a Kohonen map … is introduced to linearize that space and to capture the a priori known topology, in this case a ring.”
    1. <Need to read this part over again>
  14. Use a paricular form of value FA I haven’t heard of before
  15. It seems this VFA may be an averager, not based strictly on an NN
  16. 4 discrete actions
  17. “Learning was done by first collecting a number of ’baseline’ tuples, which was done by driving 3 rounds with the constant safe action. This was followed by an exploration phase using an epsioln-greedy policy with epsilon= 0.1 for another 50 episodes. Then the exploration rate was set to 0 (pure exploitation). This was done until an overall of 130 episodes was finished. After each episode, the cluster-based Fitted-Q was performed until the values did not change any more. Altogether, the overall interaction time with the real system was a bit less than 30 minutes.”

Predictive Projections. Sprague. IJCAI 09.

  1. algorithm for linear “projections in which accurate predictions of future states can be made using simple nearest neighbor style learning.”
  2. Experiments on simulated pendulum and real roomba robot
  3. Considers video data
  4. Based on earlier work (by Tesauro among others) in “which nearest neighbor learning is recast in a probabilistic framework that allows for gradient based optimization of the distance metric.  These existing algorithms discover projections of the training data under which nearby points are likely to have the same class label or similar regression targets.”
    1. Neighborhood Components Analysis (NCA) – for nearest neighbor classification
  5. “It would be difficult to directly find an A that minimizes the k-nearest-neighbor classification error rate, because the number of errors will be a highly discontinuous function of A; very small changes in A may change the set of nearest neighbors for some points. The innovation behind the NCA algorithm is to recast nearest neighbor learning in a probabilistic framework. In this framework, expected error is a continuous, differentiable function of A and thus may be minimized using gradient based techniques.”
  6. NCA was originally designed for classification but was later extended to regression
  7. A minor tweak is needed to get NCA for regression to work well on predicting vectors when the problem is stochastic
  8. Here they use conjugate gradient to do the optimization
  9. The approach can scale well; a paper from ’07 was able to run on 60k training points
  10. “Following this gradient acts to reduce the objective in two different ways. It adjusts A to be more predictive by increasing the probability that a neighbor will be chosen if it successfully predicts the next state. It also adjusts A to be more predictable, by moving target states together whenever there is a high probability they will be chosen to predict each other.”
  11. Important to whiten data first
  12. To deal with actions, partition data by action but constrain to use same projection <how is this different than just throwing out the action data>
  13. Only parameters to the algorithm are size of projection matrix A and its initial values
  14. “The predictive projections algorithm as described above may not perform well in cases where the effects of different actions are restricted to specific state dimensions. Since there is no explicit penalty for failing to predict some dimensions, the algorithm may minimize the objective function by finding an A which is not full rank, thus accurately predicting some dimensions while discarding others.”
    1. Although there is a potential fixes for this
  15. For the applications this method is coupled with LSPI (along with RBFs)
  16. Algorithm deals well with Lagoudakis+Parrs’ LSPI
  17. For the roomba experiment they put a camera on top – all the robot needs to do is to continue to move forwards without hitting the wall
    1. <Seems like not a terribly hard task just turn when the wall takes up most of the view?>
  18. temp
  19. Image is 20×20: 1200-d once color is factored in
    1. Then this is reduced via PCA to 25d
    2. Then their approach takes that to 2d (A is initialized to the first 2 principal components)
  20. Their approach basically recovers translation and rotation, while PCA recovers amount of wall present and lighting (their approach can use this to make a good policy, but that can’t be done linearly from what PCA produces)
  21. “To our knowledge, the proto-value function framework has not been applied to the type of noisy, high dimensional control problems addressed in this paper. It seems likely that the neighborhood calculations required for constructing the diffusion model could be dominated by noise dimensions, particularly in very noisy tasks such as the modified pendulum domain described above. In that case, the PVF approach and
    predictive projections would be complementary: The PP algorithm could find a low dimensional state projection that contains relevant state information, and the PVF algorithm could then be used to discover a set of appropriate basis functions in that space”
  22. “Another closely related project is the basis iteration algorithm described in [Sprague, 2007]. This algorithm also uses gradient based metric learning to discover an appropriate projection, but it focuses directly on finding a metric that allows for accurate estimation of the optimal value function. It accomplishes this by iterating value function estimation with updates to the projection matrix. This algorithm has the advantage of incorporating reward information, but it depends on starting with an initial projection that enables a reasonableestimate of the optimal value function.”

State Representation Learning in Robotics: Using Prior Knowledge about Physical Interaction. Jonschowski, Brock. RSS 20Re10Q

related to

  1. Uses the fact that robots interact with the physical world to set constraints on how state representations are learnt
  2. Test on simulated slot car and simulated navigation task, with distractors
  3. How to extract a low dimensional representation relevant to the task being undertaken from high dimensional sensor data?
  4. The visual input in the experiments is 300-D
  5. From the perspective of RL
  6. “According to Bengio et al. [1], the key to successful representation learning is the incorporation of “many general priors about the world around us.” They proposed a list of generic priors for artificial intelligence and argue that refining this list and incorporating it into a method for representation learning will bring us closer to artificial intelligence.”
  7. “State representation learning is an instance of representation learning for interactive problems with the goal to find a mapping from observations to states that allows choosing the right actions. Note that this problem is more difficult than the standard dimensionality reduction problem, addressed by multi-dimensional scaling [14] and other methods [23, 29, 6] because they require knowledge of distances or neighborhood relationships between data samples in state space. The robot, on the other hand, does not know about semantic similarity of sensory input beforehand. In order to know which observations correspond to similar situations with respect to the task, it has to solve the reinforcement learning problem (see Section III), which it cannot solve without a suitable state representation.”
  8. State representation learning can be done by:
    1. Deep autoencoders
    2. SFA (and its similarity to proto-value functions)
    3. Predictability / Predictive actions.  Points to a Michael Bowling paper <I haven’t read – will check out>
  9. Bunch of nice references
  10. The Robotic priors they care about (these are all defined mathematically later):
    1. Simplicity: For any task, only a small number of properties matter
    2. Temporal coherence: important properties change gradually over time
    3. Proportionality: Amount of change in important properties is proportional to action magnitude
    4. Causality: important properties and actions determine the reward
    5. Repeatability: Same as causality but in terms of transition not reward
  11. These properties hold for robotics and physical systems but aren’t necessarily appropriate to all domains
    1. Even in robotics these sometimes don’t hold (for example, a robot running into a wall will have an abrupt change in velocity; once pushed against the wall proportionality doesn’t hold because no amount of pushing will allow the robot to move
  12. They set learning up as an optimization problem that includes loss terms for each of the priors above (just a linear combination)
  13. <these priors aren’t directly applicable to the mocap work I am doing now because they involve actions which we dont have access to>
  14. Formally they would need to compare all samples, leading to n^2 loss, but they restrict this to a window
  15. Linear mapping from observations to states
  16. Epsilon greedy exploration, although there is a bias to repeat the previous action
  17. Have distractors in their simulations
  18. In the navigation task they demonstrate an invariance to perspective by using either an overhead or first-person view from the robot
    1. Representations learned are highly similar
  19. Learned in the course of 5,000 observations
  20. The observations in the experiment are 300 pixels or 100 (not 300×300)
  21. For the simulated slot car task, the state sample matrix has rank 4.  Two large eigenvalues correspond to the position of the controlled car, and two smaller eigenvalues correspond to the position of the distractor car
    1. Ideally the distractor shouldn’t show up at all, but because of things like stochasticity and limited samples, weight can be placed on it to explain events that it is not related to
  22. They then to an RL comparison based on a number of different methods for learning the representation (5 features extracted by)
    1. Their approach
    2. SFA
    3. PCA
    4. Raw 300D representation
    5. (They also compare to ground truth representation)
  23. Use neural fitted Q
  24. screenshot
  25. SFA features are really terrible, their approach is pretty similar to operating from the ground truth
  26. With further investigation on the same results they demonstrate that their method has very good generalization properties
  27. Conjecture that the primary difference between their approach and the other dimension reduction methods are that they didn’t learn to disregard the distractors

Autonomous Learning of State Representations for Control: An Emerging Field Aims to Autonomously Learn State Representations for Reinforcement Learning Agents from Their Real-World Sensor Observations. Bohmer, Springenberg, Beodecker, Riedmiller, Obermayer. Künstliche Intelligenz 2015.

  1. Trying to do robotics control directly from high-dimensional sensor readings.  The trick is that with robots experience is costly so you need to use data very efficiently <the deepmind guys were able to run each game for the equivalent of over a month because its just in simulation>
  2. Review 2 methods for extracting features: deep auto-encoders and slow feature analysis
  3. Mention deep learning’s deep Q network
  4. Until DQN, using FAs has been largely unsuccessful, and has generally worked only on toy domains <although Tesauro’s TD Gammon is a significant exception>
  5. “In the last years several researchers have also considered end-to-end learning of behavioral policies p, represented by general function approximators. First attempts towards this include an actor-critic formulation termed NFQ-CA [14], in which both the policy p and the Q-function are represented using neural networks and learning proceeds by back-propagating the error from the Q-network directly into the policy network—which was, however, only applied to low dimensional control problems.”
  6. Also cites deterministic policy gradient, which worked on a 30-DOF arm
  7.  Cites deep fitted Q (same lab, papers from a couple of years prior that also was used on slot cars and an inverted pendulum from raw camera images
  8. The first part of deep fitted Q is unsupervised, setup as an autoencoder
  9. In order for the RL to work, it is important that the problem has low “intrinsic dimensionality” (the manifold it lies on)
    1. For example, the image a camera gets from a robotic inverted pendulum can be arbitrarily large, but the underlying dimension of the problem is just 2
  10. They do not assume a pomdp; assume that the high-dimensional input completely describes the problem <although this is almost certainly not actually the case with at least the slot car, but that is not a ding on their work it just makes it more impressive>
  11. The representation chosen for the state depends on an optimization problem
  12. Train an autoencoder to minimize MSE
  13. DFQ is a clustering-based approach<?>
    1. There are some problems with DFQ that they will discuss later, but:
    2. Representation learned by DFQ isn’t reward-based, so representation doesn’t care about that
    3. “learning auto-encoders for inputs with high variability (i.e. many objects of relevance) can be hard” Although says regularization can help this, and cites a number of other papers that deal with this
  14. “Despite these drawbacks DFQ also comes with advantages over end-to-end learning: since the auto-encoder merely learns to reconstruct sampled observations and it can be fed with samples generated by any sampling policy for any task, is thus less susceptible to non-stationarity of the training data.”
  15. Mentions things like proto-value functions, laplacian eigenmaps etc for discrete RL, relationship between SFA and LEM and PVF
  16. “From a theoretical point of view, SFA and PVF both approximate subspace-invariant features. This classification has its origin in the the analysis of approximation errors in linear RL [39]. Here subspace-invariant features induce no errors when the future reward is propagated back in time. It can be shown that under these conditions the least-squares temporal difference algorithm (LSTD, [9]) is equivalent to supervised least-squares regression of the true value function [5]. However, this is only possible for the class of RL-tasks with a self-adjoint transition model. As this class is very rare, both SFA and PVF substitute a self-adjoint approximation of the transition model to compute almost subspace-invariant representations.4 An analysis of the optimal solution5 shows that SFA approximates eigenfunctions of the symmetrized transition operator [50]. Moreover, with a Gaussian prior for the reward, one can show that SFA representations minimize a bound on the expected LSTD error of all tasks in the same environment [5]. However, as the solution depends on the sampling distribution, straight forward application for transfer learning is less obvious than in the case of PVF. Future works may rectify this with some sensible importance sampling, though.”
  17. When the data completely covers the space, PVF and SFA are pretty equivalent, but when data is generated from a random walk, PVF fails and SFA still manages to do almost as well as when there is perfect coverage
  18. In less artificial settings/domains there is less of a performance gap between PFV and SFA but it still exists, SFA seems to do a better job generating features for LSTD and LSPI
  19. In these problems, the representation learned must:
    1. Maintain Markov property (no partial observability)
    2. Be able to represent the value function well enough to bootstrap from it and improve the policy
    3. Generalize well (especially to cases that may not be well covered by the training distribution)
    4. Low dimensional
  20. “In stochastic environments one can only compare probability distributions over future states based on a random policy, called diffusion distances. It can be shown that SFA approximates eigenfunctions of the symmetrized transition operator, which encode diffusion distances [5]. SFA features are therefore a good representation for nonlinear RL algorithms as well.”
    1. <Depends how you define “stochastic environments” I assume they mean environments where there is no choice in action selection at all because otherwise what is written is incorrect>
  21. “In summary, SFA representations X seem in principle the better choice for both linear and non-linear RL: nonlinear SFA extracts eigenfunctions of the transition model Pp, which are the same for every isomorphic observation space Z, encode a diffusion metric that generalizes to states with similar futures and approximates a Fourier basis of the (unknown) underlying state space.”
  22. Empirical work shows SFA works best when exposed to data that gets uniform coverage
  23. “a Fourier basis as approximated by SFA grows exponential in the underlying state dimensionality. Linear algorithms, which depend on this basis to approximate the value function, are therefore restricted to low dimensional problems with few or no variables unrelated to the task. Non-linear RL algorithms, on the other hand, could work in principle well with only the first few SFA features of each state-dimension/variable. e. The order in which these variables are encoded as SFA features, however, depends on the slowness of that variable. This can in practice lead to absurd effects. Take our example of a wheeled robot, living in a naturally lit room. The underlying state space that the robot can control is three-dimensional, but the image will also depend on illumination, that is, the position of the sun.”
    1. Describes this as being vulnerable to slow distractors
  24. “Also, auto-encoders minimize the squared error over all input dimensions of Z equally. This can produce incomplete representations if a robot, for example, combines observations from a camera with measurements from multiple joints. Due to the large number of pixels, small improvements in the reconstruction of the image can outweigh large improvements in the reconstruction of the joint positions.”
  25. Compares results from a deep auto encoder and one with a NN that has objectives of slowness and predictability.  The latter produces a much better output <but this is really a garbage comparison because the prior is based on actual physical experiments and the latter is based on an extremely oversimplified clean simulation.>
  26. “Take the example of uncontrollable distractors like blinking lights or activity outside a window. Each distractor is an independent variable of the isomorphic state S, and to learn an isomorphic representation X requires thus samples from all possible combinations of these variables.  The required training samples grow exponentially in the number of distractors.”
    1. <The XSENS suit has tons of blinking lights on it.  I was thinking about this as a potential distractor for the methods we are working on…>
  27. Moves on to a discussion of “factored representations and symbolic RL” at the end of the paper
  28. Basically discuss object oriented RL (discuss it in terms of relational RL)

A Survey on Vision-based Human Action Recognition. Poppe. Image and Vision Computing 2010.

  1. Consider task of annotating action in video
  2. Task is a combination of feature extraction <motion trackers at least solve this>, and then classification
  3. One taxonomy for the task breaks actions down between:
    1. Action primitives: low level (such as limb) movements ex/move left leg
    2. Actions: built of primitives, may be composed of contemporaneous primitives (such as arm and leg movement), may be cyclic. ex/run
    3. Activities: a task that can be given a label. ex/jumping hurdles
    4. <These examples are from the text, and perhaps not the best example, because certainly sometimes running is the entire activity?>
  4. Here do not consider environmental context, multi-person interaction, or objects
    1. Consider only full body movement and not gesture recognition
  5. In the field of gait recognition, the goal is to differentiate people based on the way they walk; here the goal is the opposite – that is to look at multiple people doing different tasks and figure out in which cases they are the same task
    1. Although there is more recently work that tries to do both – identify activity and individuals doing the activity
  6. Pose reconstruction is akin to a regression problem, whereas activity recognition is a classification problem
  7. <Didn’t get to finish-posting for spring cleaning>

Mapping the stereotyped behaviour of freely-moving fruit flies. Berman, Choi, Bialek, Shaevitz. Journal of the Royal Society Interface 2014

  1. “A frequent assumption in behavioural science is that most of an animal’s activities can be described in terms of a small set of stereotyped motifs.”
  2. Create a system to analyze behavior and find that about half the time, behavior is based on 100 different stereotyped behavioral states
  3. Stereotypy – “that an organism’s be- haviours can be decomposed into discrete, reproducible elements”
  4. Although animals can move in a really enormous space of movements, they are thought to keep motion in a small set of motion, made up of stereotyped actions (may be specific to time range, individual throughout life, or a species)
  5. “A discrete behavioural repertoire can potentially arise via a number of mechanisms, including mechanical limits of gait control, habit formation, and selective pressure to generate robust or optimal actions.”
  6. For the most part, stereotypy hasn’t been studied experimentally, mostly because of a “lack of a comprehensive and compelling mathematical frame- work for behavioural analysis”
    1. They introduce a system for doing so
  7. Most previous methods for characterizing behavior fall into one of two categories:
    1. Very coarse metrics (like mean velocity, count # times a barrier is crossed).  This makes analysis and data collection easy, but only captures a tiny amount of relevant information
    2. The other approach is to log behavior in terms of a number of different possible categories.  This can be done by hand or by machine.  The problem with this is it introduces bias (you find only what classes you defined in the first place, and assumes small number of discrete high-level behaviors)
  8. A better system would be one that works from the bottom up, starting directly with the data as opposed to operator-defined classes
  9. “The basis of our approach is to view behaviour as a trajectory through a high-dimensional space of postural dynamics. In this space, discrete behaviours correspond to epochs in which the trajectory exhibits pauses, corresponding to a temporally-extended bout of a particular set of motions. Epochs that pause near particular, repeatable positions represent stereotyped behaviours. Moreover, moments in time in which the trajectory is not stationary, but instead moves rapidly, correspond to non-stereotyped actions.”
  10. Specifically, based on the data found from fruit flies
    1. “These stereotyped behaviours manifest themselves as distinguishable peaks in the behavioural space and correspond to recognizably distinct behaviours such as walking, running, head grooming, wing grooming, etc.”
  11. The fly is isolated into a 200×200 window in each frame (full image has 40k pixels), and then rotated/translated/resized to get standard representation
  12. Nearly all variance (93%) in images can be represented by PCA down to 50D
  13. Use a spectrogram representation of postural dynamics based on Morlet continuous wavelet transform <?>
    1. “Although similar to a Fourier spectrogram, wavelets possess a multi-resolution time- frequency trade-off, allowing for a more complete description of postural dynamics occurring at several time scales”
  14. Embedding is made up of 25 frequency channels for the 50 eigenmodes, so each point in time is represented by 1,250D
    1. Because behavior is highly correlated, they speculate a much lower dimensional manifold lies inside this space that describes behavior
  15. The goal is to map the 1250D data to something much smaller where trajectories in that space pause when stereotyped behavior occurs. “This means that our embedding should minimise any local distortions.”<I don’t know why>
  16. So they approach they chose “reduces dimensionality by altering the distances between more distant points on the manifold.”
    1. PCA, MDS, Isomap to exactly the opposite of this, prioritizing large-scale accuracy over local
  17. “t-Distributed Stochas- tic Neighbor Embedding (t-SNE)” however, does satisfy their requirement
    1. “For t-SNE, the conserved invariants are related to the Markov transition probabilities if a random walk is performed on the data set.”  Transition probabilities between two time points are based on a Gaussian kernel over distance
    2. Minimizes “local distortions” <?>
  18. With t-SNE, transition probabilities are similar to larger-space transition probabilities, but are proportional to Cauchy/Student-t kernel of points’ Euclidian distances
  19. Problem with t-SNE is quadratic memory use – they use importance sampling to subsample to 35k data points and then run on that
  20. A distance function is still needed.  They use KL-divergence
  21. They are able to embed data nicely in 2D – going to 3D leads to a 2% reduction of embedding cost function
  22. Get a probability density over behavior by convolving each point in embedded map with Gaussian
    1. Space has peaks, and trajectories pause at peak locations when conducting stereotyped behavior
  23. This 2D space is then decomposed by a “watershed transform”, where points are grouped together if hill-climbing from them leads to the same (local) maximum
  24. Peaks correspond to commonly defined behaviors, but here, everything is bottom-up from the data.  Also, nearby regions encode similar but distinct behavior
  25. As expected, many behaviors (like running) are confined to an orbit in the low-dimensional space
  26. Were able to pull apart distinguishing characteristics between male and female behavior

Automated monitoring and analysis of social behavior in Drosophila. Dankert, Wang, Hoopfer, Anderson, Perona. Nature Methods 2009.

  1. Machine vision approach to studying aggresion and courtship in flies.
    1. Computes location, orientation, and wing posture
  2. “Ethograms may be constructed automatically from these measurements, saving considerable time and effort.”
  3. “Both aggression and courtship consist of rich ensembles of stereotyped behaviors, which often unfold in a characteristic sequence.”
  4. Approach automates a task that was previously done by hand, so saves labor and is more objective, and may pick up on smaller or shorter-duration details
  5. Being able to study behavior in minute detail will allow better analysis of impact of genetics and environment on behavior
  6. Each video frame corresponds to 25 measurements/features like pose, velocity, wing pose, (size as well because it determines male/female, other special markings like a dot are used for male-male interaction) etc…
  7. Able to identify a number of different high-level actions/behaviors like fighting, courting (accuracy was 90+%)
  8. They then looked if they could pick up on behavioral changes by suppressing some neurons that had previously been determined to reduce aggressive behavior, and they were able to confirm the same result, looked at mutants as well
  9. System allows for visualization of behavior in a number of ways, such as by making histograms or graphs depicting transitions between behaviors with weights

MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation. Jonathan, Tompson, LeCun, Bregler. Arxiv 2014

  1. A system for pulling out pose estimation from videos using conv nets – including color and motion features
  2. They propose a new body pose dataset, and their results tests as better than state of the art
  3.  Traditionally, posture estimation has relied on hand coded features like HoG (histogram of gradients), and not motion-based features.  On the other hand, psychophysical experiments show that to people, motion is a powerful cue that by itself can be used to extract a great deal of information including pose
  4. Previous studies involving the use of motion data had negative results, leading to no real improvement in actual performance, and in some cases, intractable inference problems.
    1. Here it is shown that deep learning can take advantage of motion information.  In fact, with their approach, motion data alone outperforms a number of algorithms, showing that there is indeed valuable information in motion data
  5. Contributions:
    1. An algorithm that incorporates motion features and outperforms state of the art for ‘in-the-wild’ data
    2. Algorithm is efficient and is almost real time
  6. Hogg (different from HoG) in 83 was one of the first systems for motion tracking, they often worked from an explicit geometric model and required initialization and then incrementally updated the pose information
  7. Later on, systems without explicit geometrical models were introduced, generally relying on “bags of features” (SIFT, STIP, HoG, HoF)
  8. Most state of the art is based on a combination of HoG and “Deformable Part Models” (DPM)
  9. Previous applications of deep learning to pose recognition lead to better than state of the art performance
  10. Input to their convnet is a rgb image along with a set of motion features
  11. Two broad categories of motion data:
    1. Simple derivatives of RGB video frames
    2. Optical flow features
  12. The simple derivatives are not great, and is high-dimensional data.  It would be hard to get a network to do optical flow, so they compute optical flow separately as a preprocessing step
    1. They mention later that this is a nontrivial amount of information, so it could be a big help to an algorithm, although other algorithms havent been able to take advantage of it, and even just this information alone in their system leads to good performance
  13. Convnet is based on “sliding patches”
  14. <skipping details of arch and optimization, can come back to it if necessary>
  15. Designed only to identify one skeleton on screen, center of torso is marked, which allows for constraints on the rest of skeleton to be used
  16. Training on 4k training, 1k test images takes 12 hours, a forward pass through takes 50ms
  17. They show examples where use of motion data leads to correct classification, but ignoring it leads to errors
    1. Especially in the case when there is a cluttered background
  18. <Seems like they just do head and arms? Torso already given…>
  19. System is pretty robust to range of parameters for optical flow, and removal of camera motion compensation doesn’t change performance much either
  20. Their results really really beat up on other state of the art in their data set
    1. Even motion features alone beat them, but if you want exact results the RGB information is necessary as well

Temporal Segmentation and Activity Classification from First-person Sensing, Spriggs, De La Torre, Hebert. CVPR 2009.

  1. “We explore first-person sensing through a wearable camera and Intertial Measurement Units (IMUs) for temporally segmenting human motion into actions and performing activity classification in the context of cooking and recipe preparation in a natural environment.”
  2. Try:
    1. Gaussian Mixture Models
    2. HMMs
    3. K-NN
    4. Also try unsupervised methods so that annotation isn’t necessary
  3. Large number of references to prior work
  4. The IMUs used here are set up as:
    1. 5 on the wrist
    2. <5?> on ankles
    3. 1 on waist
  5. And then there is head-cam
  6. There is ambiguity in how to label what is going, especially because people did the same task differently.
    1. Also huge partial observability (even in terms of what is in view of the camera)
  7.  Action clauses are distinct per recipe (at least partially), 29 clauses for brownies
  8. For unsupervised segmentation, they try PCA, then cluster that
    1. Performance is about 70% correct with this method
    2. These features can also be used to try and classify what recipe is being made
  9. Recipe classification is perfect on the small dataset when using data from IMUs
  10. Tried unsupervised clustering of IMU data with HMM but it came up with garbage
  11. They then merge video and IMU data <but its not totally clear how they accomplish this in terms of representation> again, run through PCA and HMM
  12. Recipe classification with this method was ~93%, so just using IMUs is better
  13. Then they move onto supervised, with annotation
  14. ~80% of frames annotated, “stirring the mix” takes up about 25% of labeled frames
    1. They trained only on frames where annotation was available
  15. Then do classification with supervised HMM (poor classification~ 10%) and k-NN (much better at ~60%)
    1. They argue perf of k-NN is from high D data <not sure why that is a good argument though>
Tagged , , , ,