Category Archives: Generative Models

Unsupervised Learning of Video Representations using LSTMs. Srivastava, Mansimov, Salakhutdinov. Arxiv

  1. LSTMs to learn representations of video sequences
  2. Maps an input video sequence to a fixed-length representation
  3. This representation is then used to do other tasks
  4. Experiment with inputs of pixel patches as well as “high-level” “percepts”
    1. <Definition of latter isn’t clear from abstract but sure it will be explained>
  5. Unsupervised setting
  6. ” The key inductive bias here is that the same operation must be applied at each time step to propagate information to the next step. This enforces the fact that the physics of the world remains the same, irrespective of input. The same physics acting on any state, at any time, must produce the next state.”
  7. Can use NN to do autoencoder, or prediction, or both
  8. Previous work on generative models of video showed that a squared-error loss function is problematic and instead resorts to a dictionary-based method.
    1. Its very hard to make an appropriate and effective loss function for video
    2. Here they simply go with squared error
  9. “The key advantage of using an LSTM unit over a traditional neuron in an RNN is that the cell state in an LSTM unit sums activities over time. Since derivatives distribute over sums, the error derivatives don’t vanish quickly as they get sent back into time. This makes it easy to do credit assignment over long sequences and discover longrange features.”
  10. For unsupervised learning, they use two LSTM networks – one for encoding, another for decoding
  11. “The encoder LSTM reads in this sequence. After the last input has been read, the decoder LSTM takes over and outputs a prediction for the target sequence. The target sequence is same as the input sequence, but in reverse order. Reversing the target sequence makes the optimization easier because the model can get off the ground by looking at low range correlations.”
  12. “The state of the encoder LSTM after the last input has been read is the representation of the input video. The decoder LSTM is being asked to reconstruct back the input sequence from this representation. In order to do so, the representation must retain information about the appearance of the objects and the background as well as the motion contained in the video.”
  13. This is a similar setup as is used to develop representations of word meanings
  14. Also a previous paper that does video prediction (in the to-read list)
  15. Prediction can either be conditioned on the previously predicted output or not:
    1. “There is also an argument against using a conditional decoder from the optimization point-of-view. There are strong short-range correlations in video data, for example, most of the content of a frame is same as the previous one. If the decoder was given access to the last few frames while generating a particular frame at training time, it would find it easy to pick up on these correlations. There would only be a very small gradient that tries to fix up the extremely subtle errors that require long term knowledge about the input sequence. In an unconditioned decoder, this input is removed and the model is forced to look for information deep inside the encoder.”
  16. In the “composite model” the same LSTM representation is passed into two different LSTM decoders, one that does prediction, while the other does reconstruction of the input sequence
  17. <Experiments>
  18. Use UCF-101 and HMDB-51 datasets which each have several thousand videos of several seconds each
  19. They also subsample from a previously existing sports video dataset
    1. Sample down to 300 hours in clips that are 10 seconds each
  20. Unsupervised training over this youtube set was good enough; using the other two sets in addition didn’t really change performance
  21. Whole video was 240×320, they look at just the center 224×224, 30fps
  22. “All models were trained using backprop on a single NVIDIA Titan GPU. A two layer 2048 unit Composite model that predicts 13 frames and reconstructs 16 frames took 18-20 hours to converge on 300 hours of percepts. We initialized weights by sampling from a uniform distribution whose scale was set to 1/sqrt(fan-in). Biases at all the gates were initialized to zero. Peep-hole connections were initialized to zero. The supervised classifiers trained on 16 frames took 5-15 minutes to converge. “
  23. First set of experiments is on moving MNIST digits.  Sequences were 20 frames long and were in  a 64×64 patch
    1. Positions are randomly initialized as well as velocities; they bounce off walls
  24. LSTM has 2048 units
  25. Look at 10 frames at a time, try and reconstruct that 10, predict next 10
  26. “We used logistic output units with a cross entropy loss function.”
  27. “It is interesting to note that the model figures out how to separate superimposed digits and can model them even as they pass through each other. This shows some evidence of disentangling the two independent factors of variation in this sequence. The model can also correctly predict the motion after bouncing off the walls”
  28. Two layer LSTM works better than 1, and using previously predicted outputs to predict future outputs also helped
  29. Next they worked on 32×32 image patches from one of the real-life video datasets
    1. Linear output units, squared error loss
    2. Input 16 frames, reconstruct last 16 and predict next 13
  30. Show outputs for 2048 and 4096 units, claim latter is better <to my eye they look essentially identical>
    1. Next they test for generalization at different time scales.  2048 LSTM units, 64×64 input
    2. Then look at predictions for 100 frames in the future.  Don’t show the image ouputs, but show that the activation doesn’t just average out to some mean amount; it has periodic activations that it maintains
    3. It starts to blur outputs, but maintains motion:
  31. They also used the network that was trained on 2 moving digits on 3 and 1 moving digits; for 1 digit it superimposed another blob on the one digit, and for the 3 it blurred the digits out but maintained motion <to me, on the other hand it looks like it slightly blurs the 1 digit and for the 3 digit turns into 2 blurry digits>
  32. Discuss visualizing features <not paying attention to that at the moment>
  33. Check how unsupervised learning helps supervised learning
  34. 2048 unit autoencoder trained on 300 hours of video.  Encode 16 frames, predict 10
  35. “At test time, the predictions made at each time step are averaged. To get a prediction for the entire video, we average the predictions from all 16 frame blocks in the video with a stride of 8 frames.”
    1. <averaging over predictions seems like a funny thing to do>
  36. ” All classifiers used dropout regularization, where we dropped activations as they were communicated across layers but not through time within the same LSTM” Dropout was important
  37. For small datasets pretraining on unsupervised data helps
  38. Similar experiments on “Temporal Stream Convolutional Nets” <In my reading queue> – also helped there
  39. “We see that the Composite Model always does a better job of predicting the future compared to the Future Predictor. This indicates that having the autoencoder along with the future predictor to force the model to remember more about the inputs actually helps predict the future better. Next, we can compare each model with its conditional variant. Here, we find that the conditional models perform better”
  40. “The improvement for flow features over using a randomly initialized LSTM network is quite small. We believe this is atleast partly due to the fact that the flow percepts already capture a lot of the motion information that the LSTM would otherwise discover. “

Learning probability distributions over partially-ordered human everyday activities. Tenorth, De la Torre, Beetz. ICRA 2013

  1. Attempt “… to learn the partially ordered structure inherent in human everyday activities from observations by exploiting variability in the data.”
  2. Learns full joint probability over actions that make up a task, their partial ordering, and their parameters
  3. Can be used for classification, but also figuring out what actions are relevant to a task, what objects are used, if it was done correctly, or what is typical for an individual
  4. Use synthetic data as well as TUM Kitchen and CMU MMAC
  5. Use Bayesian Logic Nets (another paper has author overlap and uses the same approach)
  6. Common alternate approaches are HMMs, conditional random fields (CRFs), or suffix trees
    1. But these are most effective when the ordering of the subtasks are pretty fixed
    2. Also Markov assumption of HMM doesn’t really hold in the way data is often represented and may require all history information
  7. Also some other approaches for representing partial ordering
    1. <Whatever this means> “All these approaches focus only on the ordering among atomic action entities, while our system learns a distribution over the order as well as the action parameters.”
  8. Literature on partially ordered plans require lots of prior information, and have been applied to synthetic problems
  9. Working off the TUM dataset, they needed to resort to bagging and data noisification techniques in order to get enough samples
  10. Needs annotation / data labeling
  11. Learns, for example, that after making brownies (the CMU MMAC dataset) some people like to put the frying pan in the sink, and others on the stove
  12. Performance of this approach is much better than that of CRFs, and is more robust to variability in how people undertake tasks

Masters Thesis: Automatic Decomposition of Continuous Action and State Spaces in Simulation-Based Planning. Colin Schepers


  1. Main goal of the work is to “interchange” the discretization components of:
    1. Tree Learning Search, which discretizes according to a tree of trees where each tree further decomposes the action space
      1. Underlying component is Incremental Regression Tree Induction (IRTI)
    2. HOLOP, where the tree decomposes a space that corresponds to a sequence of actions
      1. Underlying component is HOO
  2. TLS has the issue of throwing away data when new information arrives and trees must be discarded
  3. Also include the idea of transposition tables
  4. Work examines behavior, computation time, computational complexity, and memory of both algs
  5. There are also some new algorithms introduced that extend these algorithms that have better performance in certain situations


Ch 1: Introduction

  1. In terms of doing the actual action planning, talks of two possible options:
    1. Meta Tree Learning (MTL) is the more traditional approach to MCTS, called meta tree because it constructs a tree of trees
    2. Sequence Tree Learning (STL) explodes the sequence into a large space where each dimension corresponds to one step in the sequence (what HOLOP does)
  2. IRTI and HOO can be combined with either MTL or STL to get 4 planning algorithms (originially TLS was coupled with IRTI and HOLOP with STL)
    1. IRTI x MTL = Regression-based Meta Tree Learning (RMTL), very similar to TLS
    2. STL x HOO = Hierarchical Optimistic Sequence Tree Learning (HOSTL), very similar to HOLOP
    3. IRTI x STL = Regression-based Sequence Tree Learning (RSTL), introduced in this work
    4. MTL x HOO = Hierarchical Optimistic Meta Tree Learning (HOMTL), also introduced in this work
  3. The novelty of using transposition tables here is due to the fact that the spaces considered here are continuous, not discrete as is the case with most transposition tables. Two methods are proposed to do this
  4. The Basic Research Questions for the Thesis are:
    1. Can the IRTI component of TLS be replaced by the HOO component of HOLOP
    2. Can the HOO component of HOLOP be replaced by the IRTI component of TLS
    3. How can the information retreived from simulations be reused for TLS
    4. Can a transposition tree increase the performance of simulation based systems in continuous environments
    5. Which combination of techniques is best
    6. What is the memory cost of these algorithms
    7. What is the computational time complexity of the algorithms

Ch 2: Preliminary Knowledge

  1. Discusses use of Zobrish hashing for high-performance transition tables.  I’ve never heard of it.

Ch 3: Automatic Decomposition

  1. IRTI checks all possible tests (splits) in a leaf node, and takes the one with the highest information gain.  If that split yields two different leaves that are statistically significantly different (F-test), the split is made
  2. <p. 14 the rule used by HOO to calculate b-scores is very reminiscent of UCB as well>
  3. <p. 14 in an extended version of the HOO paper on Arxiv, a version of the algorithm is presented where n_0 doesn’t have to be recalculated at each step, if the total number of pulls is known before sampling starts.  This can make the algorithm much more efficient (log n per step instead of n)>
  4. Discussion of scaling of the exploration factor (commonly called C)
    1. The propose scaling it according to the range of values seen from each node, then modifying it by a factor which is called k
    2. <p. 16, in practice, I definitely buy that this helps, but in certain situations this rule will cause problems (when all rewards observed in a node are the same, there will be no bias, for example)>
    3. But they also mention that if vmin, vmax are known for the problem it can be used, so thats all fine.  If you don’t know that you need to resort to something like above
  5. Discuss ways of conserving memory, such as not allowing leaves to split that have too few samples.  Capping tree depth isn’t mentioned, but is also a reasonable option

Ch 4: Multi-step Optimization in Continuous Environments

  1. In normal MCTS, an edge represents 1 action, and a node represents 1 state. In Regression-based Meta Tree Learning, an edge represents a range of actions, and a node represents a range of states
  2. When selecting an action, UCT rules are used (basically just ignore that edges and nodes represent a range and use the samples)
  3. If a sample causes a split in a leaf in an internal leaf, the trees below that point must be discarded
    1. Replay can be used once trees are discarded, but this is somewhat expensive.  It is cheaper to do sequence planning and pass values into the children on splits (sequence planning, though, has the drawback that it can’t form closed-loop plans)
  4. <p. 21 It isn’t the case that the given splitting rule for HOLOP/HOSTL that prioritizes early actions has to be used, but it is one method that intuitively makes sense, and also doesn’t ruin the original proofs of HOO>
  5. <In general, the way the material is presented respects the original algorithms this paper builds on too much.  It basically talks about TLS and HOLOP (which are both unique in both aspects of how to do decomposition as well as plan over sequences) and then crosses them over.  It would be easier for most readers not already familiar with the history of previous publications to present for example the Meta Tree Learning algorithms and then the Sequence Tree Learning algorithms, or something like that.  It starts with the corners of a 2×2 table describing the attributes of the algorithms instead of starting with a row or column>

Ch 5: Continuous Transposition Tree

  1. Two methods for constructing transposition tables are introduced
  2. In Level-based Transposition Trees, there is a clear distinction between the discretizations over the state and action spaces
  3. <p. 24 “This process [of descending a decision tree over the state space] is expected to be less computationally expensive because it does not require any complex computations like the UCT formula.”  Is there really any significant difference between the costs of both aside from constant level operations which are very cheap anyway?>
  4. One important feature of using the transposition table is that it allows planning to be stored – otherwise planning always has to be redone from every current state (of course it can also make planning cheaper further down the tree)
  5. In LTT a decision tree decomposes the state space.  From each leaf in that tree over the state space, another decision tree is rooted that decomposes the action space
    1. Still suffers from the need to discard action trees on splits in the state-tree leaves, but its easy to do replay in this setting
  6. In a Mixed Transposition Tree (MTT), the tree is built both in terms of states and actions (as opposed to states and then actions, as in LTT); a node represents both spaces and an edge from parent to child represents a split in either the state or action dimension
  7. When MTTs are used, from root to leaf states are traversed according to the query state (naturally), while actions are followed according to those that has the highest value according to a particular computation, I think it is basically like UCB, which allows for exploration.
  8. The advantage MTT has over LTT is that trees do not have to be discarded and rebuilt
  9. <In general, this thesis has a lot of verbiage, but is short on diagrams or other items that would express points succinctly, sometimes the material is presented in a way that is a bit vague and the only way to resolve the issue is to go back to pseudocode and grok it>

Ch 6: Experiments and Results

  1. UCT here isn’t the normal version of UCT, somewhat between HOSTL and IRTI
  2. A couple of the test domains are regular 1-step optimization problems.  Two more are multi step problems.  One is navigating in a circle (agent chooses angle to move in), and the other is cart-pole
  3. For the 1-step optimization problems, comparisons are also made with a random agent, a vanilla MC agent (random but choses best found), and a version of UCT that uses unifom discretization
  4. In 1-step optimization, UCT learns a good policy quickly, but HOO eventually outperforms it with near optimal performance; the regret plots are most illustrative of performance.  IRTI is worse than HOO in both cases (in one case better than UCT by the end of the experiment and in one case worse, but anyway as more time is given IRTI will beat UCT anyway)
  5. Good parameterizations are found for each problem separately.
  6. When constraining the algorithms on time instead of samples, HOO performs worst (due to polynomial time complexity) <as mentioned, an nlogn version can be implemented, though>.  When time is the main constraint IRTI performs best, vanilla-MC actually outperfoms UCT
  7. In multistep experiments, all four algorithms (2×2 grid plus random and vanilla mc) are examined
  8. In donut world, when samples are constrained, HOSTL (HOLOP) performs best, IRTI x STL is 2nd best
  9. In stark contrast, in the cart-pole domain, HOSTL is worst besides random (even vanilla MC is better).
    1. Here, RMTL (original tree-learning search) performs best by a wide margin
    2. The domain used here is more challenging than the one I’ve used as it doesn’t allow for much deviation of the pole
    3. This is under time constraints
  10. <p. 35They test MTT by itself which I’m not exactly groking because I thought it was a technique for allowing tranposition tables to be used and not a policy, but I’m missing something>
  11. Have very nice experiments of sample distributions on 1-step optimization problems
  12. The HOO-family of algorithms have by far the worst memory use.  The regression-tree algorithms use the least memory <I guess cause they are constantly throwing trees out?  Even if they are doing replay, the trees are probably much more compact than the HOO versions because it requires a test confirming statistical significance to split>
  13. In terms of measured computation time, not surprisingly HOO is slowest, but what is interesting is that IRTI is faster than UCT <Impressive.  It is because UCT ultimately ends up building a larger tree>

Ch 7: Experiments and Results

  1. It is actually unclear whether transposition tables in the manner used are helpful (sometimes they help and sometimes they do not)
  2. HOSTL (HOLOP) is best when the domain isn’t time constrained, but RMTL(TLS) is best when there are time constraints as it is very efficient because trees are built fairly compactly
  3. While HOO-based algorithms used the most memory, there were no issues of exhausting memory during experiments or anything like that

Non-Linear Monte-Carlo Search in Civilization II. S.R.K. Branavan David Silver Regina Barzilay. IJCAI 2011

  1. Based on the title this must be the most awesomest paper ever.  Its the culmination of my entire existence on earth.
  2. Apply non-linear regression within MCTS to estimate Q-function from random rollouts.
  3. Because it generalizes, it doesn’t need many rollouts to make an estimate
  4. Work also pulls out information from the manual (which seems to be the focus of  a separate paper)
  5. The non-linear FA is much more effective than a linear one
  6. Result beats built-in AI 78% of the time.
    1. That is awesome, but this is AI that dates back to 1996, and was designed to work on home-PCs of the era (33 mhz), so we might expect it isn’t amazing.
    2. They actually used FreeCiv, so this isn’t totally right (still designed for 100 mhz machines), and the AI was constrained not to cheat as it does in the real Civ (which is very fair).  I imagine the AI in freeCiv is much better than Civ2, though.
    3. They do, however, treat unfinished games as a loss, so draws are awarded to the AI
  7. The branching factor in the game is extremely huge, so I’m amazed anything works at all.
  8. This is similar to the Go work that used FAs to learn an evaluation function, but there linear FAs were used
  9. The VFA (value function approximator) is built locally in time (from a particular state)
    1. Global VFA was effective in Backgammon, but not Go or Civ, which are larger
  10. Elements of manual relevant to current game state are modeled as hidden variables in the non-linear VFA
  11. Use 4-layer NN for VFA, trained using rollouts.  Layers are as follows:
    1. Game state, action, and game manual
    2. Linguistic model of manual
    3. Fixed features that perform predefined transformations on first 2 layers
    4. Value function
  12. Rollouts train all of these, including the linguistic model
  13. They argue nonlinear VFAs generalize better from a small number of rollouts – thats nonintuitive to me since the hypothesis space of linear VFAs is smaller and should be learnt more quickly (even if its accuracy is worse at the end)
  14. ANNs are helpful because they can process the input in stages
  15. There is already some literature on RL in civ, but this is the first that plays the entire game, as opposed to some subset of it, or only selecting from options
  16. Use game rules and opponent actions as part of the black-box transition function
  17. Manual text that is extracted and fed into the ANN depends on the situation
  18. Its funny they dont at least carry over the ANN from the last step as an initial point to start learning, I guess it biases samples and random sampling is better.
  19. Stanford parser used on the manual
  20. Rollouts are 20 steps long, and uses game score as the evaluation function.  500 rollouts are performed
  21. Almost half a million features fed into the ANN
  22. Exploration is epslion-greedy
  23. Played through a 100 step game in 1.5 hours, not so bad
  24. Compare to 3 other methods:
    1. MCTS: each item in the game computes its search tree indep, this is the only alg that explicitly builds a search tree
    2. Linear MC – similar to the linear VFA method used for Go
    3. Non-linear MC is basically the same method as in the paper, except no manual
    4. Random-Text same as proposed method, but manual has scrambled word order (but same word content)
  25. The other comparison methods are much less effective against the AI (next best is non-linear MC with about 30%).
    1. Regular MCTS didn’t win a single game
  26. Its worth mentioning that they use the Civ2 manual to play freeciv.  While they are very similar, they are not exactly the same game.
  27. There is another paper “Learning to Win by Reading Manuals in a Monte-Carlo Framework” that deals more with just dealing with the manual.

Monte-Carlo Planning in Large POMDPs. David Silver, Joel Veness. Nips 2010.

  1. Deals with how to get MCTS (specifically UCT) working in POMDPs
  2. Maintains belief state with a particle filter
  3. The tree maintained by UCT is modified slightly
    1. Counts are on hist0ry (action and observations over time).
    2. History maps to the belief state
  4. The method used by UCT here is the unofficial version that builds an additional node onto the tree after each iteration
  5. Discuss convergence of UCT and this version for POMDPs, but don’t mention that sometimes actual learning is intractable
  6. The empirical results are impressive.  In smaller domains more exact methods slightly outperform PO-UCT, but PO-UCT is extremely cheap to run in comparison (multiple orders of magnitude).
  7. Some of the details (especially related to the particle filter) are omitted, but the source code was released
  8. Seems like there is very little literature on POMDPs and MCTS.  The stuff here is nice, although more details on the particle filter would be helpful