Category Archives: Deep Learning

DRAW: A Recurrent Neural Network For Image Generation. Gregor, Danihelka, Graves, Rezende, Wiestra. JMLR 2015

  1. “…introduces the Deep Recurrent Attentive Writer (DRAW)…  DRAW networks combine a novel spatial attention mechanism that mimics the foveation of the human eye, with a sequential variational auto-encoding framework that allows for the iterative construction of complex images.”
  2. Can generate house numbers from the google street number dataset that are indistinguishable from real images
  3. Instead of generating images all at once, this approach tries the equivalent of sketching an image first and then refining it
  4. “The core of the DRAW architecture is a pair of recurrent neural networks: an encoder network that compresses the real images presented during training, and a decoder that reconstitutes images after receiving codes. The combined system is trained end-to-end with stochastic gradient descent, where the loss function is a variational upper bound on the log-likelihood of the data. It therefore belongs to the family of variational auto-encoders…”
  5. They train the attentional system with RL, and use backprop “In this sense it resembles the selective read and write operations developed for the Neural Turing Machine (Graves et al., 2014).”
  6. “tion over images. However there are three key differences. Firstly, both the encoder and decoder are recurrent networks in DRAW, so that a sequence of code samples is exchanged between them; moreover the encoder is privy to the decoder’s previous outputs, allowing it to tailor the codes it sends according to the decoder’s behaviour so far. Secondly, the decoder’s outputs are successively added to the distribution that will ultimately generate the data, as opposed to emitting this distribution in a single step. And thirdly, a dynamically updated attention mechanism is used to restrict both the input region observed by the encoder, and the output region modified by the decoder. In simple terms, the network decides at each time-step “where to read” and “where to write” as well as “what to write”.”
  7. The output of the encoder network is a hidden vector
  8. They use LSTM for their recurrent network
  9. The output of the encoder is used to parameterize a distribution over a latent vector (which is a diagonal Gausian).  They use a diagonal Gaussian instead of the more common Bernoulli distribution because it has a gradient that is easier to work with
  10. A sample from the latent distribution is then passed as input to the decoder
    1. The output of the decoder is added cumulatively to a canvas matrix which creates an image.  The number of steps used to write to the canvas is a parameter to the algorithm
  11. “The total loss is therefore equivalent to the expected compression of the data by the decoder and prior. “
  12. “…, we consider an explicitly twodimensional form of attention, where an array of 2D Gaussian filters is applied to the image, yielding an image ‘patch’ of smoothly varying location and zoom.”
  13. <skipping a bunch>
  14. Generated images of mnist nad street house numbers look good, but generated natural images look very blurry and not much like anything identifiable, although there is clear structure in what is generated.

Unsupervised Learning of Video Representations using LSTMs. Srivastava, Mansimov, Salakhutdinov. Arxiv

  1. LSTMs to learn representations of video sequences
  2. Maps an input video sequence to a fixed-length representation
  3. This representation is then used to do other tasks
  4. Experiment with inputs of pixel patches as well as “high-level” “percepts”
    1. <Definition of latter isn’t clear from abstract but sure it will be explained>
  5. Unsupervised setting
  6. ” The key inductive bias here is that the same operation must be applied at each time step to propagate information to the next step. This enforces the fact that the physics of the world remains the same, irrespective of input. The same physics acting on any state, at any time, must produce the next state.”
  7. Can use NN to do autoencoder, or prediction, or both
  8. Previous work on generative models of video showed that a squared-error loss function is problematic and instead resorts to a dictionary-based method.
    1. Its very hard to make an appropriate and effective loss function for video
    2. Here they simply go with squared error
  9. “The key advantage of using an LSTM unit over a traditional neuron in an RNN is that the cell state in an LSTM unit sums activities over time. Since derivatives distribute over sums, the error derivatives don’t vanish quickly as they get sent back into time. This makes it easy to do credit assignment over long sequences and discover longrange features.”
  10. For unsupervised learning, they use two LSTM networks – one for encoding, another for decoding
  11. “The encoder LSTM reads in this sequence. After the last input has been read, the decoder LSTM takes over and outputs a prediction for the target sequence. The target sequence is same as the input sequence, but in reverse order. Reversing the target sequence makes the optimization easier because the model can get off the ground by looking at low range correlations.”
  12. “The state of the encoder LSTM after the last input has been read is the representation of the input video. The decoder LSTM is being asked to reconstruct back the input sequence from this representation. In order to do so, the representation must retain information about the appearance of the objects and the background as well as the motion contained in the video.”
  13. This is a similar setup as is used to develop representations of word meanings
  14. Also a previous paper that does video prediction (in the to-read list)
  15. Prediction can either be conditioned on the previously predicted output or not:
    1. “There is also an argument against using a conditional decoder from the optimization point-of-view. There are strong short-range correlations in video data, for example, most of the content of a frame is same as the previous one. If the decoder was given access to the last few frames while generating a particular frame at training time, it would find it easy to pick up on these correlations. There would only be a very small gradient that tries to fix up the extremely subtle errors that require long term knowledge about the input sequence. In an unconditioned decoder, this input is removed and the model is forced to look for information deep inside the encoder.”
  16. In the “composite model” the same LSTM representation is passed into two different LSTM decoders, one that does prediction, while the other does reconstruction of the input sequence
  17. <Experiments>
  18. Use UCF-101 and HMDB-51 datasets which each have several thousand videos of several seconds each
  19. They also subsample from a previously existing sports video dataset
    1. Sample down to 300 hours in clips that are 10 seconds each
  20. Unsupervised training over this youtube set was good enough; using the other two sets in addition didn’t really change performance
  21. Whole video was 240×320, they look at just the center 224×224, 30fps
  22. “All models were trained using backprop on a single NVIDIA Titan GPU. A two layer 2048 unit Composite model that predicts 13 frames and reconstructs 16 frames took 18-20 hours to converge on 300 hours of percepts. We initialized weights by sampling from a uniform distribution whose scale was set to 1/sqrt(fan-in). Biases at all the gates were initialized to zero. Peep-hole connections were initialized to zero. The supervised classifiers trained on 16 frames took 5-15 minutes to converge. “
  23. First set of experiments is on moving MNIST digits.  Sequences were 20 frames long and were in  a 64×64 patch
    1. Positions are randomly initialized as well as velocities; they bounce off walls
  24. LSTM has 2048 units
  25. Look at 10 frames at a time, try and reconstruct that 10, predict next 10
  26. “We used logistic output units with a cross entropy loss function.”
  27. “It is interesting to note that the model figures out how to separate superimposed digits and can model them even as they pass through each other. This shows some evidence of disentangling the two independent factors of variation in this sequence. The model can also correctly predict the motion after bouncing off the walls”
  28. Two layer LSTM works better than 1, and using previously predicted outputs to predict future outputs also helped
  29. Next they worked on 32×32 image patches from one of the real-life video datasets
    1. Linear output units, squared error loss
    2. Input 16 frames, reconstruct last 16 and predict next 13
  30. Show outputs for 2048 and 4096 units, claim latter is better <to my eye they look essentially identical>
    1. Next they test for generalization at different time scales.  2048 LSTM units, 64×64 input
    2. Then look at predictions for 100 frames in the future.  Don’t show the image ouputs, but show that the activation doesn’t just average out to some mean amount; it has periodic activations that it maintains
    3. It starts to blur outputs, but maintains motion:
  31. They also used the network that was trained on 2 moving digits on 3 and 1 moving digits; for 1 digit it superimposed another blob on the one digit, and for the 3 it blurred the digits out but maintained motion <to me, on the other hand it looks like it slightly blurs the 1 digit and for the 3 digit turns into 2 blurry digits>
  32. Discuss visualizing features <not paying attention to that at the moment>
  33. Check how unsupervised learning helps supervised learning
  34. 2048 unit autoencoder trained on 300 hours of video.  Encode 16 frames, predict 10
  35. “At test time, the predictions made at each time step are averaged. To get a prediction for the entire video, we average the predictions from all 16 frame blocks in the video with a stride of 8 frames.”
    1. <averaging over predictions seems like a funny thing to do>
  36. ” All classifiers used dropout regularization, where we dropped activations as they were communicated across layers but not through time within the same LSTM” Dropout was important
  37. For small datasets pretraining on unsupervised data helps
  38. Similar experiments on “Temporal Stream Convolutional Nets” <In my reading queue> – also helped there
  39. “We see that the Composite Model always does a better job of predicting the future compared to the Future Predictor. This indicates that having the autoencoder along with the future predictor to force the model to remember more about the inputs actually helps predict the future better. Next, we can compare each model with its conditional variant. Here, we find that the conditional models perform better”
  40. “The improvement for flow features over using a randomly initialized LSTM network is quite small. We believe this is atleast partly due to the fact that the flow percepts already capture a lot of the motion information that the LSTM would otherwise discover. “

Autonomous reinforcement learning on raw visual input data in a real world application. Lange, Riedmiller, Voigtlander. Neural Networks (IJCNN) 2012

  1. Learn slot car racing from raw camera input
  2. Deep-fitted Q
  3. The slot car problem is interesting in particular because as being rewarded for speed, the best policies are the closest to failure (losing control of the car)
  4. Time resolution at about 0.25s
  5. Camera input is 52×80 = 4160d (greyscale it seems?)
  6. They do pretraining: “The size of the input layer is 52×80 = 4160 neurons, one for each pixel provided by the digital camera. The input layer is followed by two hidden layers with 7×7 convolutional kernels each. The first convolutional layer has the same size as the input layer, whereas the second reduces each dimension by a factor of two, resulting in 1 fourth of the original size. The convolutional layers are followed by seven fully connected layers, each reducing the number of its predecessor by a factor of 2. In its basic version the coding layer consists of 2 neurons.
    Then the symmetric structure expands the coding layer towards the output layer, which shall reproduce the input and accordingly consists of 4160 neurons.”
  7. “Training: Training of deep networks per se is a challenge: the size of the network implies high computational
    effort; the deep layered architecture causes problems with vanishing gradient information. We use a special two-stage training procedure, layer-wise pretraining [8], [29] followed by a fine-tuning phase of the complete network. As the learning rule for both phases, we use Rprop [34], which has the advantage to be very fast and robust against parameter choice at the same time. This is particularly important since one cannot afford to do a vast search for parameters, since training times of those large networks are pretty long.”
  8. Training is done with a set of 7,000 images, with the car moving a constant speed.
  9. “For the layer-wise pretraining, in each stage 200 epochs of Rprop training were performed.”
  10. “Let us emphasis the fundamental importance of having the feature space already partially unfolded after pretraining. A partially unfolded feature space indicates at least some information getting past this bottle-neck layer, although errors in corresponding reconstructions are still large. Only because the autoencoder is able to distinguish at least a few images in its code layer it is possible to calculate meaningful derivatives in
    the finetuning phase that allow to further “pull” the activations in the right directions to further unfold the feature space.”
  11. “Altogether, training the deep encoder network takes about 12 hours on an 8-core CPU with 16 parallel threads.”
  12. Because still images don’t include all stat information (position but not velocity) they need to use some method to allow the other parts of the state information in.  One option is to make the state the previous two frames, the other is to do a Kohonen Map <I assume this is what they do>
  13. “As the spatial resolution is non uniform in the feature space spanned by the deep encoder, a difference in the feature space is not necessarily a consistent measure for the dynamics of the system. Hence, another transformation, a Kohonen map … is introduced to linearize that space and to capture the a priori known topology, in this case a ring.”
    1. <Need to read this part over again>
  14. Use a paricular form of value FA I haven’t heard of before
  15. It seems this VFA may be an averager, not based strictly on an NN
  16. 4 discrete actions
  17. “Learning was done by first collecting a number of ’baseline’ tuples, which was done by driving 3 rounds with the constant safe action. This was followed by an exploration phase using an epsioln-greedy policy with epsilon= 0.1 for another 50 episodes. Then the exploration rate was set to 0 (pure exploitation). This was done until an overall of 130 episodes was finished. After each episode, the cluster-based Fitted-Q was performed until the values did not change any more. Altogether, the overall interaction time with the real system was a bit less than 30 minutes.”

State Representation Learning in Robotics: Using Prior Knowledge about Physical Interaction. Jonschowski, Brock. RSS 20Re10Q

related to

  1. Uses the fact that robots interact with the physical world to set constraints on how state representations are learnt
  2. Test on simulated slot car and simulated navigation task, with distractors
  3. How to extract a low dimensional representation relevant to the task being undertaken from high dimensional sensor data?
  4. The visual input in the experiments is 300-D
  5. From the perspective of RL
  6. “According to Bengio et al. [1], the key to successful representation learning is the incorporation of “many general priors about the world around us.” They proposed a list of generic priors for artificial intelligence and argue that refining this list and incorporating it into a method for representation learning will bring us closer to artificial intelligence.”
  7. “State representation learning is an instance of representation learning for interactive problems with the goal to find a mapping from observations to states that allows choosing the right actions. Note that this problem is more difficult than the standard dimensionality reduction problem, addressed by multi-dimensional scaling [14] and other methods [23, 29, 6] because they require knowledge of distances or neighborhood relationships between data samples in state space. The robot, on the other hand, does not know about semantic similarity of sensory input beforehand. In order to know which observations correspond to similar situations with respect to the task, it has to solve the reinforcement learning problem (see Section III), which it cannot solve without a suitable state representation.”
  8. State representation learning can be done by:
    1. Deep autoencoders
    2. SFA (and its similarity to proto-value functions)
    3. Predictability / Predictive actions.  Points to a Michael Bowling paper <I haven’t read – will check out>
  9. Bunch of nice references
  10. The Robotic priors they care about (these are all defined mathematically later):
    1. Simplicity: For any task, only a small number of properties matter
    2. Temporal coherence: important properties change gradually over time
    3. Proportionality: Amount of change in important properties is proportional to action magnitude
    4. Causality: important properties and actions determine the reward
    5. Repeatability: Same as causality but in terms of transition not reward
  11. These properties hold for robotics and physical systems but aren’t necessarily appropriate to all domains
    1. Even in robotics these sometimes don’t hold (for example, a robot running into a wall will have an abrupt change in velocity; once pushed against the wall proportionality doesn’t hold because no amount of pushing will allow the robot to move
  12. They set learning up as an optimization problem that includes loss terms for each of the priors above (just a linear combination)
  13. <these priors aren’t directly applicable to the mocap work I am doing now because they involve actions which we dont have access to>
  14. Formally they would need to compare all samples, leading to n^2 loss, but they restrict this to a window
  15. Linear mapping from observations to states
  16. Epsilon greedy exploration, although there is a bias to repeat the previous action
  17. Have distractors in their simulations
  18. In the navigation task they demonstrate an invariance to perspective by using either an overhead or first-person view from the robot
    1. Representations learned are highly similar
  19. Learned in the course of 5,000 observations
  20. The observations in the experiment are 300 pixels or 100 (not 300×300)
  21. For the simulated slot car task, the state sample matrix has rank 4.  Two large eigenvalues correspond to the position of the controlled car, and two smaller eigenvalues correspond to the position of the distractor car
    1. Ideally the distractor shouldn’t show up at all, but because of things like stochasticity and limited samples, weight can be placed on it to explain events that it is not related to
  22. They then to an RL comparison based on a number of different methods for learning the representation (5 features extracted by)
    1. Their approach
    2. SFA
    3. PCA
    4. Raw 300D representation
    5. (They also compare to ground truth representation)
  23. Use neural fitted Q
  24. screenshot
  25. SFA features are really terrible, their approach is pretty similar to operating from the ground truth
  26. With further investigation on the same results they demonstrate that their method has very good generalization properties
  27. Conjecture that the primary difference between their approach and the other dimension reduction methods are that they didn’t learn to disregard the distractors

Autonomous Learning of State Representations for Control: An Emerging Field Aims to Autonomously Learn State Representations for Reinforcement Learning Agents from Their Real-World Sensor Observations. Bohmer, Springenberg, Beodecker, Riedmiller, Obermayer. Künstliche Intelligenz 2015.

  1. Trying to do robotics control directly from high-dimensional sensor readings.  The trick is that with robots experience is costly so you need to use data very efficiently <the deepmind guys were able to run each game for the equivalent of over a month because its just in simulation>
  2. Review 2 methods for extracting features: deep auto-encoders and slow feature analysis
  3. Mention deep learning’s deep Q network
  4. Until DQN, using FAs has been largely unsuccessful, and has generally worked only on toy domains <although Tesauro’s TD Gammon is a significant exception>
  5. “In the last years several researchers have also considered end-to-end learning of behavioral policies p, represented by general function approximators. First attempts towards this include an actor-critic formulation termed NFQ-CA [14], in which both the policy p and the Q-function are represented using neural networks and learning proceeds by back-propagating the error from the Q-network directly into the policy network—which was, however, only applied to low dimensional control problems.”
  6. Also cites deterministic policy gradient, which worked on a 30-DOF arm
  7.  Cites deep fitted Q (same lab, papers from a couple of years prior that also was used on slot cars and an inverted pendulum from raw camera images
  8. The first part of deep fitted Q is unsupervised, setup as an autoencoder
  9. In order for the RL to work, it is important that the problem has low “intrinsic dimensionality” (the manifold it lies on)
    1. For example, the image a camera gets from a robotic inverted pendulum can be arbitrarily large, but the underlying dimension of the problem is just 2
  10. They do not assume a pomdp; assume that the high-dimensional input completely describes the problem <although this is almost certainly not actually the case with at least the slot car, but that is not a ding on their work it just makes it more impressive>
  11. The representation chosen for the state depends on an optimization problem
  12. Train an autoencoder to minimize MSE
  13. DFQ is a clustering-based approach<?>
    1. There are some problems with DFQ that they will discuss later, but:
    2. Representation learned by DFQ isn’t reward-based, so representation doesn’t care about that
    3. “learning auto-encoders for inputs with high variability (i.e. many objects of relevance) can be hard” Although says regularization can help this, and cites a number of other papers that deal with this
  14. “Despite these drawbacks DFQ also comes with advantages over end-to-end learning: since the auto-encoder merely learns to reconstruct sampled observations and it can be fed with samples generated by any sampling policy for any task, is thus less susceptible to non-stationarity of the training data.”
  15. Mentions things like proto-value functions, laplacian eigenmaps etc for discrete RL, relationship between SFA and LEM and PVF
  16. “From a theoretical point of view, SFA and PVF both approximate subspace-invariant features. This classification has its origin in the the analysis of approximation errors in linear RL [39]. Here subspace-invariant features induce no errors when the future reward is propagated back in time. It can be shown that under these conditions the least-squares temporal difference algorithm (LSTD, [9]) is equivalent to supervised least-squares regression of the true value function [5]. However, this is only possible for the class of RL-tasks with a self-adjoint transition model. As this class is very rare, both SFA and PVF substitute a self-adjoint approximation of the transition model to compute almost subspace-invariant representations.4 An analysis of the optimal solution5 shows that SFA approximates eigenfunctions of the symmetrized transition operator [50]. Moreover, with a Gaussian prior for the reward, one can show that SFA representations minimize a bound on the expected LSTD error of all tasks in the same environment [5]. However, as the solution depends on the sampling distribution, straight forward application for transfer learning is less obvious than in the case of PVF. Future works may rectify this with some sensible importance sampling, though.”
  17. When the data completely covers the space, PVF and SFA are pretty equivalent, but when data is generated from a random walk, PVF fails and SFA still manages to do almost as well as when there is perfect coverage
  18. In less artificial settings/domains there is less of a performance gap between PFV and SFA but it still exists, SFA seems to do a better job generating features for LSTD and LSPI
  19. In these problems, the representation learned must:
    1. Maintain Markov property (no partial observability)
    2. Be able to represent the value function well enough to bootstrap from it and improve the policy
    3. Generalize well (especially to cases that may not be well covered by the training distribution)
    4. Low dimensional
  20. “In stochastic environments one can only compare probability distributions over future states based on a random policy, called diffusion distances. It can be shown that SFA approximates eigenfunctions of the symmetrized transition operator, which encode diffusion distances [5]. SFA features are therefore a good representation for nonlinear RL algorithms as well.”
    1. <Depends how you define “stochastic environments” I assume they mean environments where there is no choice in action selection at all because otherwise what is written is incorrect>
  21. “In summary, SFA representations X seem in principle the better choice for both linear and non-linear RL: nonlinear SFA extracts eigenfunctions of the transition model Pp, which are the same for every isomorphic observation space Z, encode a diffusion metric that generalizes to states with similar futures and approximates a Fourier basis of the (unknown) underlying state space.”
  22. Empirical work shows SFA works best when exposed to data that gets uniform coverage
  23. “a Fourier basis as approximated by SFA grows exponential in the underlying state dimensionality. Linear algorithms, which depend on this basis to approximate the value function, are therefore restricted to low dimensional problems with few or no variables unrelated to the task. Non-linear RL algorithms, on the other hand, could work in principle well with only the first few SFA features of each state-dimension/variable. e. The order in which these variables are encoded as SFA features, however, depends on the slowness of that variable. This can in practice lead to absurd effects. Take our example of a wheeled robot, living in a naturally lit room. The underlying state space that the robot can control is three-dimensional, but the image will also depend on illumination, that is, the position of the sun.”
    1. Describes this as being vulnerable to slow distractors
  24. “Also, auto-encoders minimize the squared error over all input dimensions of Z equally. This can produce incomplete representations if a robot, for example, combines observations from a camera with measurements from multiple joints. Due to the large number of pixels, small improvements in the reconstruction of the image can outweigh large improvements in the reconstruction of the joint positions.”
  25. Compares results from a deep auto encoder and one with a NN that has objectives of slowness and predictability.  The latter produces a much better output <but this is really a garbage comparison because the prior is based on actual physical experiments and the latter is based on an extremely oversimplified clean simulation.>
  26. “Take the example of uncontrollable distractors like blinking lights or activity outside a window. Each distractor is an independent variable of the isomorphic state S, and to learn an isomorphic representation X requires thus samples from all possible combinations of these variables.  The required training samples grow exponentially in the number of distractors.”
    1. <The XSENS suit has tons of blinking lights on it.  I was thinking about this as a potential distractor for the methods we are working on…>
  27. Moves on to a discussion of “factored representations and symbolic RL” at the end of the paper
  28. Basically discuss object oriented RL (discuss it in terms of relational RL)

Human-level control through deep reinforcement learning. A billion authors. Nature 2014

Yet another version of this paper

  1. animals are able to learn to act by combining RL with hierarchical perception
  2. RL has generally only been effective in settings that are either low-D or require handcrafted representations
  3. Train a deep Q-network
  4. Reached a level of a professional human game tester in 49 games, with no change to hyperparameters
  5. They mention that value function divergence with an NN is problematic, but mitigate the issue by using experience replay to spread samples so it doesnt overfit recent data, as well as by doing only occasional value updates
  6. Used SGD to train
  7. The actual score achieved rises much more slowly (pretty linear over the 200 training epochs) than the believed action value (which rises sharply and then plateaus by about 40 epochs).
  8. Plot embeddings of last layer with t-sne – similar states (from visual level) are nearby, as are states of similar believed value
  9. In some cases it is able to lear non-myopic behavior (such as building a route through the blocks to the top in breakout), but in other games like pac-man or montezumas revenge it isn’t able to do much of anything (monetzumas revenge seems to be legitimately a very difficult game to learn, but pac-man doesn’t seem bad at all)
  10. “Notably, the successful integration of reinforcement learning with deep network architectures was critically dependent on our incorporation of a replay algorithm involving the storage and representation of recently experienced transitions. Convergent evidence suggests that the hippocampus may support the physical realization of such a process in the mammalian brain, with the time-compressed reactivation of recently experienced trajectories during offline periods (for example, waking rest) providing a putative mechanism by which value functions may be efficiently updated through interactions with the basal ganglia”
  11. Network is actually not terribly deep <4 layers?>
  12. Network only sees rewards as -1,0,or 1 which helps keep the derivative of the error in check, but means the agent can’t differentiate between different levels of goodness or badness
  13. epsilon greedy exploration with eps starting at 1 and reaching a minimum of 0.1
  14. Train for 50 million frames/equivalent to 38 days of game playing.  Experience replay of 1 million frames
  15. New actions are selected only every 4 frames
  16. Evaluation was done with eps = 0.05 “This procedure is adopted to minimize the possibility of overfitting during evaluation. “
  17. Use experience replay because “Second, learning directly from consecutive samples is inefficient, owing to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. Third, when learning on-policy the current parameters determine the next data sample that the parameters are trained on. For example, if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side; if the maximizing action then switches to the right then the training distribution will also switch. It is easy to see how unwanted feedback loops may arise and the parameters could get stuck in a poor local minimum, or even diverge catastrophically”
  18. Mention uniformly sampling from history for experience replay is probably not efficient and something that prioritizes samples (akin to prioritized sweeping) is probably a better idea
  19. They also do something with cloning networks, looks like they actively update one but its based on the error that is generated from a network that is only updated periodically <?>

Efficient Backprop. LeCun, Bottou, Orr, Muller. Neural Networks: Tricks of the Trade. 1998Q

  1. Backprop can be slow for multilayered nets that are non-convex, high-D, very bumpy, or have many plateaus.  It may not even converge
  2. Instead of doing all dataset in batch, noise introduced by following the gradient of individual samples can be helpful
    1. Tends to be faster (especially cases where there is redundancy in the data)
    2. Converges to better results (noise in samples jumps out of local minima)
    3. Can be used for tracking <nonstationary data>
    4. Better suited to large datasets
  3. Advantages for batch:
    1. Mathematical properties of convergence better understood (stochasticity from individual samples has less of an impact when all the data is considered together, so convergence is easier to analyze)
    2. Some methods (such as conj gradient) only work in batch
  4. Annealing schedule for learning rate helps convergence, but is an additional parameter that has to be controlled
  5. Can try and get the best of both worlds by doing mini batches
    1. Also, trying to eliminate the impact of noise on the final setting of the weights probably isn’t that important; overtraining becomes more of a concern before then
  6. Methods of bagging or shuffling the data help when not running in batch mode.  It is important to keep the samples spread around the range of inputs to prevent overfitting a small region of the input space if that is where samples become concentrated from in some part of the corpus
    1. Proposes a informal boosting method, although this can lead to overfitting the hard-to-fit samples
  7. Inputs should be normalized (average value of each input should be zero), should also have same covariance (whitening)
    1. The exception is if some samples are known to be less important covariance of those samples can be reduced to downweight
  8. Sigmoids are good non linear functions because they serve to keep weights normalized as they propagate through the network because they are more likely to produce values that are on average 0
  9. Tanh better than logistic (as logistic isn’t zero-centered), although now simple relus are preferred
    1. Give recommended constants to be used with the tanh s.t. weights are also likely to have variance of 1 as it goes through the network
  10. One problem with symmetric sigmoids is that error surface can be very flat near the origin.  Therefore it isn’t good to initialize with very small weights.  This is also true far from the origin – adding a small linear term can help get rid of the plateaus
  11. For classification, it isn’t good to use binary target values (+/- 1) as it leads to instabilities.
    1. Because the sigmoids only have these values asymptotically, using these goal outputs will make the weights grow larger and larger.  This then makes the gradients larger
    2. Also, when the outputs end up producing values close to +/-1, there is no measure of confidence
    3. Target values should be at the max of the second derivative of the sigmoid
  12. Intermediate weights should be used for initialization, such that the sigmoid is activated in it its linear region
    1. If weights are too small the gradients will also be too small
  13. Getting everything right requires that the training set be normalized, the sigmoid be chosen properly, and that weights be set correctly
    1. An equation is given for the standard deviation the weights should be set to
  14. Another issue is choosing the learning rates
    1. Many adaptive methods for setting the learning rates only work in batch mode, because in the online case things jump around constantly
    2. This is discussed more later in the paper but one idea is to use independent update values for each weight so that they converge more or less at the same rate (one way to do this is by looking at second derivatives)
    3. Learning rates at lower layers should be larger than those at higher layers because the second derivative of the cost function wrt weights in lower layers is usually smaller than those in the higher layers
    4. For conv. nets, learning rate should be proportional to sqrt(# connections sharing weight)
  15. Mentions one possible rule for adaptive learning rates, but the idea is that it is large away from the optimum and becomes small near it
  16. Some theory about learning rates – it can be computed exactly if the shape can be approximated by a quadratic.  If it isn’t exactly quadratic you can use the rules as an approximation but then need to iterate on it
  17. The hessian is a measure of curvature of the error surface.
  18.  There is an equation involving the learning rate and the hessian, which, if it always shrinks a vector (all eigenvalues < 1) the update equation will converge
    1. Goal then is to have different learning rates across different eigendirections, based on its eigenvalue
    2. If weights are coupled, H must first be rotated s.t. it is diagonal (making the coord axes line up w/ eigendirections)
  19. Now going back to justifications for some tricks
  20. Subtract means from input vars because a nonzero mean makes a very large eigenvalue, which makes convergence very slow.  Likewise, data that isn’t normalized will slow learning as well
  21. Decorrelating the input variables make the method of different learning rates per weight optimal
  22. Whitening the signal can help make the energy surface at the optima spherical which helps convergence
  23. Talk about newton updates <but I think this isnt used in practice? so im not taking notes>
  24. Newton and gradient descent are different, but if you whiten they are the same <?>
  25. Conj Gradient:
    1. O(N)
    2. Doesn’t use explicit Hessian
    3. Tries “…to find descent directions that try to minimally spoil the result achieved in the previous iterations
    4. Uses line search <?>.  Ah, given a descent direction (the gradient), just minimize along this
    5. Only batch
    6. conj directions are orthogonal in space of identity hessian matrix
    7. Good line search method is critical
    8. Can be good for momentum
  26. Quasi-Newton (BFGS)
    1. Iteratively computes estimate of inverse Hessian
    2. O(N^2) – memory as well, so only applicable to small networks <this is what Hessian free fixes right?>
    3. Reqs line search
    4. Batch
  27. <I think much of the rest of the paper is less relevant now than when it was written back then so skimming>
  28. Tricks for computing the Hessian
  29. Large eigenvalues in the Hessian cause problems during training because:
    1. Non-zero mean inputs
    2. Wide variation of 2nd derivatives
    3. Correlation between state vars
  30. Also between layers, the Hessian at first layer is pretty flat but becomes pretty steep by the last layer
  31. “From our experience we know that a carefully tuned stochastic gradient descent is hard to beat on large classification problems.”
  32. There are methods for estimating the principal eigenvalues/vectors of the Hessian w/o actually computing the Hessian

MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation. Jonathan, Tompson, LeCun, Bregler. Arxiv 2014

  1. A system for pulling out pose estimation from videos using conv nets – including color and motion features
  2. They propose a new body pose dataset, and their results tests as better than state of the art
  3.  Traditionally, posture estimation has relied on hand coded features like HoG (histogram of gradients), and not motion-based features.  On the other hand, psychophysical experiments show that to people, motion is a powerful cue that by itself can be used to extract a great deal of information including pose
  4. Previous studies involving the use of motion data had negative results, leading to no real improvement in actual performance, and in some cases, intractable inference problems.
    1. Here it is shown that deep learning can take advantage of motion information.  In fact, with their approach, motion data alone outperforms a number of algorithms, showing that there is indeed valuable information in motion data
  5. Contributions:
    1. An algorithm that incorporates motion features and outperforms state of the art for ‘in-the-wild’ data
    2. Algorithm is efficient and is almost real time
  6. Hogg (different from HoG) in 83 was one of the first systems for motion tracking, they often worked from an explicit geometric model and required initialization and then incrementally updated the pose information
  7. Later on, systems without explicit geometrical models were introduced, generally relying on “bags of features” (SIFT, STIP, HoG, HoF)
  8. Most state of the art is based on a combination of HoG and “Deformable Part Models” (DPM)
  9. Previous applications of deep learning to pose recognition lead to better than state of the art performance
  10. Input to their convnet is a rgb image along with a set of motion features
  11. Two broad categories of motion data:
    1. Simple derivatives of RGB video frames
    2. Optical flow features
  12. The simple derivatives are not great, and is high-dimensional data.  It would be hard to get a network to do optical flow, so they compute optical flow separately as a preprocessing step
    1. They mention later that this is a nontrivial amount of information, so it could be a big help to an algorithm, although other algorithms havent been able to take advantage of it, and even just this information alone in their system leads to good performance
  13. Convnet is based on “sliding patches”
  14. <skipping details of arch and optimization, can come back to it if necessary>
  15. Designed only to identify one skeleton on screen, center of torso is marked, which allows for constraints on the rest of skeleton to be used
  16. Training on 4k training, 1k test images takes 12 hours, a forward pass through takes 50ms
  17. They show examples where use of motion data leads to correct classification, but ignoring it leads to errors
    1. Especially in the case when there is a cluttered background
  18. <Seems like they just do head and arms? Torso already given…>
  19. System is pretty robust to range of parameters for optical flow, and removal of camera motion compensation doesn’t change performance much either
  20. Their results really really beat up on other state of the art in their data set
    1. Even motion features alone beat them, but if you want exact results the RGB information is necessary as well

Deep Visual-Semantic Alignments for Generating Image Descriptions. Karpathy, Fei-Fei. Tech Report 2014

  1. Model generates textual description of natural images
  2. Trained from a corpus of images with included textual descriptions
  3. “Our approach is based on a novel combination of Convolutional Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding.  We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions.”
  4. Previous work in the area revolves around fairly constrained types of descriptions
    1. Discusses related work at length
  5. Use verbal descriptions of images as “weak labels in which contiguous segments of words correspond to some particular, but unknown location in the image.  Our approach is to infer these alignments and use them to learn a generative model of descriptions.”
  6.  “…our work takes advantage of pretrained word vectors […] to obtain low-dimensional representations of words.  Finally, Recurrent Neural Networks have been previously used in language modeling […], but we additionally condition these models on images.”
  7. Training happens on images with coupled text description.
    1. They have a model that aligns sentence segments to image segments through a “multimodal embedding”
    2. Then these correspondences are fed into their multimodal RNN which learns to generate descriptions
  8. Use bidirectional RNN to compute sentence word representation, which means dependency trees aren’t needed “and allowing unbounded interactions of words and their context in the sentence.”
  9. Use a pretrained Region Convolutional Neural Network to pull out both what and where of the image
  10. They embed words in the same dimensional representation that image regions have, they do this by taking each word surrounded by a window and transforming that into a vector of equal size
  11. Screen Shot 2014-11-24 at 11.34.13 AM
  12. Hidden representation they use is on the order of hundreds of dimensions
  13. Use Markov Random Fields to enforce that adjacent words are found to correspond to similar areas in the image
  14. RNN is trained to combine the following to predict the next word:
    1. Word (initialized to “the”)
    2. Previous hidden state (initialized to 0)
    3. Image information
  15. Optimized with SGD, 100-size minibatch, momentum, dropout
  16. RNN is difficult to optimize because of rate of occurrence of rare vs common words
  17. Beat the competitor approaches on all datasets tested (not by enormous margins, but it wins)
  18. They can accurately deal with making sentences that involve even rare items such as “accordion”, which most other models would miss
  19. <The results they show are pretty amazing, although the forms of the sentences are pretty uniform and simplistic subject-verb-noun>
  20. “going directly from an image-sentence dataset to region-level annotations as part of a single model that is trained end-to-end with a single objective remains an open problem.”
Tagged , ,

Large-scale Video Classification with Convolutional Neural Networks. Karpathy, Toderici, Shetty, Leung, Sukthankar, Fei-Fei

  1. Convolutional NN on classification of 1 million youtube videos
  2. Propose a “multilevel, foveated architecture” to speed training
  3. Created a new dataset for this paper: 1-million sports videos belonging to 487 classes of sports (about 1k-3k vids/class)
  4. They are interested in video (so need to factor in temporal information, unlike more common image classification)
  5. Because the networks for learning video have to be so huge, propose a 2-part system to make learning tractable, which involves a context stream on low-res and fovea stream on high-res.  This way is about 3x as fast as naive approach and as accurate
  6. They also use the features learned from this dataset to then improve performance on a smaller corpus (going from 41.3% to 65.4%)
  7. A common approach to video classification involves processing (into something analogous to a bag of words model) and then throwing into a SVM
  8. Previous activity recognition benchmarks are small in terms of sample size, which don’t work well with NNs, so they propose a huge data set
  9. Unlike images that can somewhat easily be rescaled to the same format, there is more variability in video, including the addition of temporal length
  10. Here they treat videos as a bag of short fixed-length clips
  11. They considered at least 4 different classes of topologies for the task, all fundamentally different (structurally, not in terms of how/where pooling is done for example)
    1. Basically, this comes down to where you deal with the temporal aspect, is it at the input layer, or is the input processed as a still and then merged with temporal data further up?
  12. Different architectures
    1. Single-Frame: just process stills alone
    2. Early-Fusion: have the first convolutional layer extend back some number of frames in the past
    3. Late-Fusion: Starts with 2 single-frame networks, (for past 2 frames) and merges their results in “the first fully connected layer”
    4. Slow-Fusion: A more graduated version of merging data (lower levels have less temporal data, higher levels have more)
  13. In order to speed training they tried to reduce the number of weights, but it made classification worse.  Making images lower res sped things up too but also made perf worse.  Then they did the 2-part system (context, fovea)
    1. Context is 178 x 178 but whole pic
    2. Fovea is 89 x 89 but centered with original resolution
  14. Optimize NN with “Downpour Stochastic Gradient Descent”
  15. Use data augmenting to prevent overfitting
  16. Training took a month, although they say their perf isn’t optimized and could run better on GPUs
  17. They also tried to run the features generated by their NN through linear classifiers, but letting the NN do the classification worked better
  18. Lots of incorrectly labeled videos, still performance is said to be good
  19. Slow fusion works the best (although difference between others isn’t enormous)
  20. Camera motion can mess up the classifiers
  21. Errors tend to be in related areas (ex hiking vs backpacking)
  22. Then learning transfer to smaller UCF-101 database
    1. Worked best when retraining the top 3 layers of the net
Tagged ,