Probabilistic machine learning and artificial intelligence. Zoubin Ghahramani. Nature 2015

  1. “discusses some of the state-of-the-art advances in the field, namely, probabilistic programming, Bayesian optimization, data compression and automatic model discovery.”
  2. Given some observed data, there can be many (often infinite) models consistent with the data.  Uncertainty comes up in terms of making the model, and then having the model produce predictions.  “Probability theory provides a framework for modelling uncertainty”
  3. “…he scope of machine-learning tasks is even broader than these pattern classification or mapping tasks, and can include optimization and decision making, compressing data and automatically extracting interpretable models from data.”
  4. “Since any sensible model will be uncertain when predicting unobserved data, uncertainty plays a fundamental part in modelling.”
  5. “There are many forms of uncertainty in modelling. At the lowest level, model uncertainty is introduced from measurement noise, for example, pixel noise or blur in images. At higher levels, a model may have many parameters, such as the coefficients of a linear regression, and there is uncertainty about which values of these parameters will be good at predicting new data. Finally, at the highest levels, there is often uncertainty about even the general structure of the model: is linear regression or a neural network appropriate, if the latter, how many layers should it have, and so on.The probabilistic approach to modelling uses probability theory to express all forms of uncertainty9. Probability theory is the mathematical language for representing and manipulating uncertainty10, in much the same way as calculus is the language for representing and manipulating rates of change.”
  6. Learning occurs by taking a prior, adding data, and producing a posterior
  7. Probability theory is composed of the product rule and sum rule
  8. “The dominant paradigm in machine learning over the past two decades for representing such compositional probabilistic models has been graphical models11, with variants including directed graphs (also known as Bayesian networks and belief networks), undirected graphs (also known as Markov networks and random fields), and mixed graphs with both directed and undirected edges (Fig. 1). As discussed later, probabilistic programming offers an elegant way of generalizing graphical models, allowing a much richer representation of models. The compositionality of probabilistic models means that the behaviour of these building blocks in the context of the larger model is often much easier to understand than, say, what will happen if one couples a non-linear dynamical system (for example, a recurrent neural network) to another.”
  9. Computationally, a problem in many models is integration required to sum out variables not of interest.  In many cases, there is no poly-time algorithm.
    1. Approximation is possible, however, by MCMC and sequential monte-carlo
  10. “t is worth noting that computational techniques are one area in which Bayesian machine learning differs from much of the rest of machine learning: for Bayesian researchers the main computational problem is integration, whereas for much of the rest of the community the focus is on optimization of model parameters. However, this dichotomy is not as stark as it appears: many gradient-based optimization methods can be turned into integration methods through the use of Langevin and Hamiltonian Monte Carlo methods27, 28, while integration problems can be turned into optimization problems through the use of variational approximations24.”
  11. To build flexible models, you can allow for many parameters (such as what is used in large-scale neural networks), or you can use models that are non-parametric
    1. Prior has fixed complexity, latter has complexity that grows with data size (“either by considering a nested sequence of parametric models with increasing numbers of parameters or by starting out with a model with infinitely many parameters.”)
  12. “Many non-parametric models can be derived starting from a parametric model and considering what happens as the model grows to the limit of infinitely many parameters.”
  13. Models with infinitely many parameters would usually overfit, but Bayesian methods don’t do this because they average over instead of fitting parameters
  14. Quick discussion of some Bayesian non-parametrics
    1. Gaussian Processes (cites “GaussianFace” a state of the art application to face recognition that beats humans and deep learning)
    2. Dirichlet Processes (can be used for time-series)
    3. “The IBP [Indian Buffet Process] can be thought of as a way of endowing Bayesian non-parametric models with ‘distributed representations’, as popularized in the neural network literature.”
  15. “An interesting link between Bayesian non-parametrics and neural networks is that, under fairly general conditions, a neural network with infinitely many hidden units is equivalent to a Gaussian process.”
  16. “Note that the above non-parametric components should be thought of again as building blocks, which can be composed into more complex models as described earlier. The next section describes an even more powerful way of composing models — through probabilistic programming.”
  17. Talks about probabilistic programming (like CHURCH)
    1. Very flexible <but computationally very expensive, often built on mcmc>
  18. Bayesian Optimization (like GP-UCB)
  19. Compression
  20. Compression and probabilistic modelling are really the same thing (shannon)
    1. Better model allows more compression
  21. Best compression algorithms are equivalent to Bayesian nonparametric methods
  22. Bayesian methods to make an “automatic statistician” (scientific model discovery)
  23. Central challenge in the field is addressing the computational complexity, although “Modern inference methods have made it possible to scale to millions of data points, making probabilistic methods computationally competitive with conventional methods”

Empirical Evaluation of Gated Recurrent Neural Networks of Sequence Modelling. Chung, Gulcehre, Cho, Bengio. Arxiv 2014

  1. Compares different types of recurrent units in RNNs
  2. Compares LSTMs and newer Gated Recurrent Unit (GRU)
  3. Test on music and speech signal modelling
  4. These units are found to be better than “more traditional recurrent units such as tanh units”
  5. GRU is found to be comparable to LSTM, although GRU is a bit better
  6.  Of all the impressive recent work of RNNs (including everything that works off of variable size inputs), nothing is from vanilla RNNs
  7. Vanilla RNNs are hard to use because of both exploding and vanishing gradients
    1. Discussed many of the points related to this here
  8. GRUs are somewhat similar to LSTM although the model is a bit simpler
    1. Both can capture long-term dependencies
  9. GRU doesn’t have a separate memory cell like LSTM does
    1. Doesn’t have a mechanism to protect memory like LSTM
  10. shot
  11. Calculating the activation with GRU is simpler as well
  12. Both LSTM and GRUs compute deltas as opposed to completely recomputing values at each step
  13. “This additive nature has two advantages. First, it is easy for each unit to remember the existence of a specific feature in the input stream for a long series of steps. Any important feature, decided by either the forget gate of the LSTM unit or the update gate of the GRU, will not be overwritten but be maintained as it is. Second, and perhaps more importantly, this addition effectively creates shortcut paths that bypass multiple temporal steps. These shortcuts allow the error to be back-propagated easily without too quickly vanishing (if the gating unit is nearly saturated at 1) as a result of passing through multiple, bounded nonlinearities,”
  14. “Another difference is in the location of the input gate, or the corresponding reset gate. The LSTM unit computes the new memory content without any separate control of the amount of information flowing from the previous time step. Rather, the LSTM unit controls the amount of the new memory content being added to the memory cell independently from the forget gate. On the other hand, the GRU controls the information flow from the previous activation when computing the new, candidate activation, but does not independently control the amount of the candidate activation being added (the control is tied via the update gate).”
  15. When comparing gated to vanilla RNN “Convergence is often faster, and the final solutions tend to be better. “

An Empirical Exploration of Recurrent Network Architectures. Jozefowicz, Zaremba, SutskeverAR

  1. Vanilla RNNs are usually difficult to train.  LSTMS are a form of RNN that are easier to train
  2. LSTMs though, have arch that “appears to be ad-hoc so it is not clear if it is optimal, and the significance of its individual components is unclear.”
  3. Tested thousands of different models with different architectures based on LSTM, and also compared new Gated Recurrent Units
  4. “We found that adding a bias of 1 to the LSTM’s forget gate closes the gap between the LSTM and the GRU.”
  5. RNNs suffer from exploding/vanishing gradients (the latter was addressed successfully in LSTMs)
    1. There are many other ways to work on the vanishing gradient, such as regularization, second-order optimization, “giving up on learning the recurrent weights altogether”, as well as careful weight initialization
  6. Exploding gradients were easier to address with “a hard constraint over the norm of the gradient”
    1. Later referred to as “gradient clipping”
  7. “We discovered that the input gate is important, that the output gate is unimportant, and that the forget gate is extremely significant on all problems except language modelling. This is consistent with Mikolov et al. (2014), who showed that a standard RNN with a hard-coded integrator unit (similar to an LSTM without a forget gate) can match the LSTM on language modeling.”
  8. exploding/vanishing gradients “are caused by the RNN’s iterative nature, whose gradient is essentially equal to the recurrent weight matrix raised to a high power. These iterated matrix powers cause the gradient to grow or to shrink at a rate that is exponential in the number of timesteps.”
  9. Vanishing gradient issue in RNNs make it easy to learn short-term interactions but not long-term
  10. Through reparameterizing, LSTM cannot have a gradient that vanishes
  11. Basically, instead of recomputing weights from weights at the previous state, it only computes a weight delta which is added to the previous weights
    1. The network has additional machinery to do so
    2. Many LSTM variants
  12. Random initialization of the forget gate will leave it with some fractional value, which introduces a vanishing gradient.
    1. It is commonly ignored, but initializing it to a “large value” such as 1 or 2 will prevent vanishing gradient over time
  13. Use genetic algorithms to optimize architecture and hyperparams
  14. Evaluated 10,000 architectures, 1,000 made them past the first task (which would allow them to compete genetically).  Total of 230,000 hyperparameter configs tested
  15. Three problems tested:
    1. Arithmetic: read in a string which has numbers with an add or subtract symbol inside, then the network has to feed out the output.  There are distractor symbols in the string that need to be ignored
    2. Completion of a random XML dataset
    3. Penn Tree-Bank (word level modelling)
    4. Then there was an extra task to test generalization <validation?>
  16. “Unrolled” RNNs for 35 timesteps, minibatch of size 20
  17. Had a schedule for adjusting the learning rate once learning stopped on the initial value
    1. <nightmare>
  18. “Though there were architectures that outperformed the LSTM on some problems, we were unable to find an architecture that consistently beat the LSTM and the GRU in all experimental conditions.”
  19. “Importantly, adding a bias of size 1 significantly improved the performance of the LSTM on tasks where it fell behind the GRU and MUT1. Thus we recommend adding a bias of 1 to the forget gate of every LSTM in every application”

Protected: Preliminary Results from Mechanical Turk

This content is password protected. To view it please enter your password below:

Two-Stream Convolutional Networks for Action Recognition in Videos. Simonyan, Zisserman. Arxiv 2014

  1. Doing action recognition in video
  2. “Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both.”
  3. “Our proposed architecture is related to the two-streams hypothesis [9], according to which the human visual cortex contains two pathways: the ventral stream (which performs object recognition) and the dorsal stream (which recognises motion); “
  4. “Video can naturally be decomposed into spatial and temporal components. The spatial part, in the form of individual frame appearance, carries information about scenes and objects depicted in the video. The temporal part, in the form of motion across the frames, conveys the movement of the observer (the camera) and the objects… Each stream is implemented using a deep ConvNet, softmax scores of which are combined by late fusion. We consider two fusion methods: averaging and training a multi-class linear SVM”
    1. The two streams are only combined at the very end (7 layers+softmax each)
  5. “…, the input to our model is formed by stacking optical flow displacement fields between several consecutive frames. Such input explicitly describes the motion between video frames, which makes the recognition easier, as the network does not need to estimate motion implicitly.”
  6. They simply stack the displacement vector fields + throw into a convnet
  7. Another method is to use optical flow instead of displacement vector fields <don’t know what the difference is>
  8. In order to zero-center data, “The importance of camera motion compensation has been previously highlighted in [10, 26], where a global motion component was estimated and subtracted from the dense flow. In our case, we consider a simpler approach: from each displacement field d we subtract its mean vector.”
  9. Looks like the representation they use is they do prediction based on a frame in the middle, with some frames before and some after
  10. The spatial convnet is trained only on stills
  11. “A more principled way of combining several datasets is based on multi-task learning [5]. Its aim is to learn a (video) representation, which is applicable not only to the task in question (such as HMDB-51 classification), but also to other tasks (e.g. UCF-101 classification). Additional tasks act as a regulariser, and allow for the exploitation of additional training data. In our case, a ConvNet architecture is modified so that it has two softmax classification layers on top of the last fully- 5 connected layer: one softmax layer computes HMDB-51 classification scores, the other one – the UCF-101 scores. Each of the layers is equipped with its own loss function, which operates only on the videos, coming from the respective dataset. The overall training loss is computed as the sum of the individual tasks’ losses, and the network weight derivatives can be found by back-propagation”
  12. ” The only difference between spatial and temporal ConvNet configurations is that we removed the second normalisation layer from the latter to reduce memory consumption.”
  13. “The network weights are learnt using the mini-batch stochastic gradient descent with momentum (set to 0.9). At each iteration, a mini-batch of 256 samples is constructed by sampling 256 training videos (uniformly across the classes), from each of which a single frame is randomly selected”
  14. “Our implementation is derived from the publicly available Caffe toolbox [13], but contains a number of significant modifications, including parallel training on multiple GPUs installed in a single system. We exploit the data parallelism, and split each SGD batch across several GPUs. Training a single temporal ConvNet takes 1 day on a system with 4 NVIDIA Titan cards, which constitutes a 3.2 times speed-up over single-GPU training.”
    1. <nice>
  15. “Optical flow is computed using the off-the-shelf GPU implementation of [2] from the OpenCV toolbox. “
  16. Pre training makes a big difference
  17. “The difference between different stacking techniques is marginal; it turns out that optical flow stacking performs better than trajectory stacking, and using the bi-directional optical flow is only slightly better than a uni-directional forward flow. Finally, we note that temporal ConvNets significantly outperform the spatial ConvNets (Table 1a), which confirms the importance of motion information for action recognition.”
  18. Performance is comparable to hand-designed state of the art
  19. “We proposed a deep video classification model with competitive performance, which incorporates separate spatial and temporal recognition streams based on ConvNets. Currently it appears that training a temporal ConvNet on optical flow (as here) is significantly better than training on raw stacked frames [14]. The latter is probably too challenging, and might require architectural changes (…). Despite using optical flow as input, our temporal model does not require significant hand-crafting, since the flow is computed using a method based on the generic assumptions of constancy and smoothness.”
  20. “There still remain some essential ingredients of the state-of-the-art shallow representation [26], which are missed in our current architecture. The most prominent one is local feature pooling over spatio-temporal tubes, centered at the trajectories. Even though the input (2) captures the optical flow along the trajectories, the spatial pooling in our network does not take the trajectories into account. Another potential area of improvement is explicit handling of camera motion, which in our case is compensated by mean displacement subtraction.”


  1. “We propose a strong baseline model for unsupervised feature learning using video data. By learning to predict missing frames or extrapolate future frames from an input video sequence, the model discovers both spatial and temporal correlations which are useful to represent complex deformations and motion patterns. The models we propose are largely borrowed from the language modeling literature, and adapted to the vision domain by quantizing the space of image patches into a large dictionary.”
  2. “The biggest hurdle to overcome when learning without supervision is the design of an objective function that encourages the system to discover meaningful regularities. One popular objective is squared Euclidean distance between the input and its reconstruction from some extracted features. Unfortunately, the squared Euclidean distance in pixel space is not a good metric, since it is not stable to small image deformations, and responds to uncertainty with linear blurring. Another popular objective is log-likelihood, reducing unsupervised learning to a density estimation problem. However, estimating densities in very high dimensional spaces can be difficult, particularly the distribution of natural images which is highly concentrated and multimodal .”
  3. There has been previous work on generative models for images, but they have been small in scale
  4. “…spatial-temporal correlations can provide powerful information about how objects deform, about occlusion, object boundaries, depth, and so on”
  5. “The only assumption that we make is local spatial and temporal stationarity of the input (in other words, we replicate the model and share parameters both across space and time),”
  6. Works based on 1-hot representation of video and recurrent nn <?>
  7. They use a dictionary/classification based approach because they found that doing regression just lead to blurring of the frame when prediction was done
    1. Each image patch is unique, so some bucketing scheme must be used – they use k-means
  8. “This sparsity enforces strong constraints on what is a feasible reconstruction, as the k-means atoms “parameterize” the space of outputs. The prediction problem is then simpler because the video model does not have to parameterize the output space; it only has to decide where in the output space the next prediction should go.”
  9. “There is clearly a trade-off between quantization error and temporal prediction error. The larger the quantization error (the fewer the number of centroids), the easier it will be to predict the codes for the next frame, and vice versa. In this work, we quantize small gray-scale 8×8 patches using 10,000 centroids constructed via k-means, and represent an image as a 2d array indexing the centroids.”
  10. Use recurrent convnet
  11. “In the recurrent convolutional neural network (rCNN) we therefore feed the system with not only a single patch, but also with the nearby patches. The model will not only leverage temporal dependencies but also spatial correlations to more accurately predict the central patch at the next time step”
  12. “To avoid border effects in the recurrent code (which could propagate in time with deleterious effects), the transformation between the recurrent code at one time step and the next one is performed by using 1×1 convolutional filters (effectively, by using a fully connected layer which is shared across all spatial locations).”
  13. ” First, we do not pre-process the data in any way except for gray-scale conversion and division by the standard deviation.”
  14. “This dataset [UCF-101] is by no means ideal for learning motion patterns either, since many videos exhibit jpeg artifacts and duplicate frames due to compression, which further complicate learning.”
  15. “Generally speaking, the model is good at predicting motion of fairly fast moving objects of large size, but it has trouble completing videos with small or slowly moving objects.”
  16. x
  17. Optical flow based methods produce results that are less blurred but more distorted
  18. The method can also be used for filling in frames
  19. Discusses future work:
    1. Multi-scale prediction
    2. Multi-step prediction
    3. Regression
    4. Hard coding features for motion
  20. “This model shows that it is possible to learn the local spatio-temporal geometry of videos purely from data, without relying on explicit modeling of transformations. The temporal recurrence and spatial convolutions are key to regularize the estimation by indirectly assuming stationarity and locality. However, much is left to be understood. First, we have shown generation results that are valid only for short temporal intervals, after which long range interactions are lost.”

Unsupervised Learning of Video Representations using LSTMs. Srivastava, Mansimov, Salakhutdinov. Arxiv

  1. LSTMs to learn representations of video sequences
  2. Maps an input video sequence to a fixed-length representation
  3. This representation is then used to do other tasks
  4. Experiment with inputs of pixel patches as well as “high-level” “percepts”
    1. <Definition of latter isn’t clear from abstract but sure it will be explained>
  5. Unsupervised setting
  6. ” The key inductive bias here is that the same operation must be applied at each time step to propagate information to the next step. This enforces the fact that the physics of the world remains the same, irrespective of input. The same physics acting on any state, at any time, must produce the next state.”
  7. Can use NN to do autoencoder, or prediction, or both
  8. Previous work on generative models of video showed that a squared-error loss function is problematic and instead resorts to a dictionary-based method.
    1. Its very hard to make an appropriate and effective loss function for video
    2. Here they simply go with squared error
  9. “The key advantage of using an LSTM unit over a traditional neuron in an RNN is that the cell state in an LSTM unit sums activities over time. Since derivatives distribute over sums, the error derivatives don’t vanish quickly as they get sent back into time. This makes it easy to do credit assignment over long sequences and discover longrange features.”
  10. For unsupervised learning, they use two LSTM networks – one for encoding, another for decoding
  11. “The encoder LSTM reads in this sequence. After the last input has been read, the decoder LSTM takes over and outputs a prediction for the target sequence. The target sequence is same as the input sequence, but in reverse order. Reversing the target sequence makes the optimization easier because the model can get off the ground by looking at low range correlations.”
  12. “The state of the encoder LSTM after the last input has been read is the representation of the input video. The decoder LSTM is being asked to reconstruct back the input sequence from this representation. In order to do so, the representation must retain information about the appearance of the objects and the background as well as the motion contained in the video.”
  13. This is a similar setup as is used to develop representations of word meanings
  14. Also a previous paper that does video prediction (in the to-read list)
  15. Prediction can either be conditioned on the previously predicted output or not:
    1. “There is also an argument against using a conditional decoder from the optimization point-of-view. There are strong short-range correlations in video data, for example, most of the content of a frame is same as the previous one. If the decoder was given access to the last few frames while generating a particular frame at training time, it would find it easy to pick up on these correlations. There would only be a very small gradient that tries to fix up the extremely subtle errors that require long term knowledge about the input sequence. In an unconditioned decoder, this input is removed and the model is forced to look for information deep inside the encoder.”
  16. In the “composite model” the same LSTM representation is passed into two different LSTM decoders, one that does prediction, while the other does reconstruction of the input sequence
  17. <Experiments>
  18. Use UCF-101 and HMDB-51 datasets which each have several thousand videos of several seconds each
  19. They also subsample from a previously existing sports video dataset
    1. Sample down to 300 hours in clips that are 10 seconds each
  20. Unsupervised training over this youtube set was good enough; using the other two sets in addition didn’t really change performance
  21. Whole video was 240×320, they look at just the center 224×224, 30fps
  22. “All models were trained using backprop on a single NVIDIA Titan GPU. A two layer 2048 unit Composite model that predicts 13 frames and reconstructs 16 frames took 18-20 hours to converge on 300 hours of percepts. We initialized weights by sampling from a uniform distribution whose scale was set to 1/sqrt(fan-in). Biases at all the gates were initialized to zero. Peep-hole connections were initialized to zero. The supervised classifiers trained on 16 frames took 5-15 minutes to converge. “
  23. First set of experiments is on moving MNIST digits.  Sequences were 20 frames long and were in  a 64×64 patch
    1. Positions are randomly initialized as well as velocities; they bounce off walls
  24. LSTM has 2048 units
  25. Look at 10 frames at a time, try and reconstruct that 10, predict next 10
  26. “We used logistic output units with a cross entropy loss function.”
  27. “It is interesting to note that the model figures out how to separate superimposed digits and can model them even as they pass through each other. This shows some evidence of disentangling the two independent factors of variation in this sequence. The model can also correctly predict the motion after bouncing off the walls”
  28. Two layer LSTM works better than 1, and using previously predicted outputs to predict future outputs also helped
  29. Next they worked on 32×32 image patches from one of the real-life video datasets
    1. Linear output units, squared error loss
    2. Input 16 frames, reconstruct last 16 and predict next 13
  30. Show outputs for 2048 and 4096 units, claim latter is better <to my eye they look essentially identical>
    1. Next they test for generalization at different time scales.  2048 LSTM units, 64×64 input
    2. Then look at predictions for 100 frames in the future.  Don’t show the image ouputs, but show that the activation doesn’t just average out to some mean amount; it has periodic activations that it maintains
    3. It starts to blur outputs, but maintains motion:
  31. They also used the network that was trained on 2 moving digits on 3 and 1 moving digits; for 1 digit it superimposed another blob on the one digit, and for the 3 it blurred the digits out but maintained motion <to me, on the other hand it looks like it slightly blurs the 1 digit and for the 3 digit turns into 2 blurry digits>
  32. Discuss visualizing features <not paying attention to that at the moment>
  33. Check how unsupervised learning helps supervised learning
  34. 2048 unit autoencoder trained on 300 hours of video.  Encode 16 frames, predict 10
  35. “At test time, the predictions made at each time step are averaged. To get a prediction for the entire video, we average the predictions from all 16 frame blocks in the video with a stride of 8 frames.”
    1. <averaging over predictions seems like a funny thing to do>
  36. ” All classifiers used dropout regularization, where we dropped activations as they were communicated across layers but not through time within the same LSTM” Dropout was important
  37. For small datasets pretraining on unsupervised data helps
  38. Similar experiments on “Temporal Stream Convolutional Nets” <In my reading queue> – also helped there
  39. “We see that the Composite Model always does a better job of predicting the future compared to the Future Predictor. This indicates that having the autoencoder along with the future predictor to force the model to remember more about the inputs actually helps predict the future better. Next, we can compare each model with its conditional variant. Here, we find that the conditional models perform better”
  40. “The improvement for flow features over using a randomly initialized LSTM network is quite small. We believe this is atleast partly due to the fact that the flow percepts already capture a lot of the motion information that the LSTM would otherwise discover. “

Protected: Preliminary Results from Color Task

This content is password protected. To view it please enter your password below:

Autonomous reinforcement learning on raw visual input data in a real world application. Lange, Riedmiller, Voigtlander. Neural Networks (IJCNN) 2012

  1. Learn slot car racing from raw camera input
  2. Deep-fitted Q
  3. The slot car problem is interesting in particular because as being rewarded for speed, the best policies are the closest to failure (losing control of the car)
  4. Time resolution at about 0.25s
  5. Camera input is 52×80 = 4160d (greyscale it seems?)
  6. They do pretraining: “The size of the input layer is 52×80 = 4160 neurons, one for each pixel provided by the digital camera. The input layer is followed by two hidden layers with 7×7 convolutional kernels each. The first convolutional layer has the same size as the input layer, whereas the second reduces each dimension by a factor of two, resulting in 1 fourth of the original size. The convolutional layers are followed by seven fully connected layers, each reducing the number of its predecessor by a factor of 2. In its basic version the coding layer consists of 2 neurons.
    Then the symmetric structure expands the coding layer towards the output layer, which shall reproduce the input and accordingly consists of 4160 neurons.”
  7. “Training: Training of deep networks per se is a challenge: the size of the network implies high computational
    effort; the deep layered architecture causes problems with vanishing gradient information. We use a special two-stage training procedure, layer-wise pretraining [8], [29] followed by a fine-tuning phase of the complete network. As the learning rule for both phases, we use Rprop [34], which has the advantage to be very fast and robust against parameter choice at the same time. This is particularly important since one cannot afford to do a vast search for parameters, since training times of those large networks are pretty long.”
  8. Training is done with a set of 7,000 images, with the car moving a constant speed.
  9. “For the layer-wise pretraining, in each stage 200 epochs of Rprop training were performed.”
  10. “Let us emphasis the fundamental importance of having the feature space already partially unfolded after pretraining. A partially unfolded feature space indicates at least some information getting past this bottle-neck layer, although errors in corresponding reconstructions are still large. Only because the autoencoder is able to distinguish at least a few images in its code layer it is possible to calculate meaningful derivatives in
    the finetuning phase that allow to further “pull” the activations in the right directions to further unfold the feature space.”
  11. “Altogether, training the deep encoder network takes about 12 hours on an 8-core CPU with 16 parallel threads.”
  12. Because still images don’t include all stat information (position but not velocity) they need to use some method to allow the other parts of the state information in.  One option is to make the state the previous two frames, the other is to do a Kohonen Map <I assume this is what they do>
  13. “As the spatial resolution is non uniform in the feature space spanned by the deep encoder, a difference in the feature space is not necessarily a consistent measure for the dynamics of the system. Hence, another transformation, a Kohonen map … is introduced to linearize that space and to capture the a priori known topology, in this case a ring.”
    1. <Need to read this part over again>
  14. Use a paricular form of value FA I haven’t heard of before
  15. It seems this VFA may be an averager, not based strictly on an NN
  16. 4 discrete actions
  17. “Learning was done by first collecting a number of ’baseline’ tuples, which was done by driving 3 rounds with the constant safe action. This was followed by an exploration phase using an epsioln-greedy policy with epsilon= 0.1 for another 50 episodes. Then the exploration rate was set to 0 (pure exploitation). This was done until an overall of 130 episodes was finished. After each episode, the cluster-based Fitted-Q was performed until the values did not change any more. Altogether, the overall interaction time with the real system was a bit less than 30 minutes.”

RealKrimp — Finding Hyperintervals that Compress with MDL for Real-Valued Data. Witteveen, Duivesteijn, Knobbe, Grunwald. Advances in Intelligent Data Analysis 2014

Ah its from a symposium

An implementation exists here

  1. Idea behind minimum description length (MDL) principle is that it is possible to do induction by compression.
  2. Here they take a popular MDL algorithm, KRIMP, and extend it to real valued data
  3. “Krimp seeks frequent itemsets: attributes that co-occur unusually often in the dataset. Krimp employs a mining scheme to heuristically find itemsets that compress the data well, gauged by a decoding function based on the Minimum Description Length Principle.”
  4. RealKRIMP “…finds interesting hyperintervals in real-valued datasets.”
  5. “The Minimum Description Length (MDL) principle [2,3] can be seen as the more practical cousin of Kolmogorov complexity [4]. The main insight is that patterns in a dataset can be used to compress that dataset, and that this idea can be used to infer which patterns are particularly relevant in a dataset by gauging how well they compress: the authors of [1] summarize it by the slogan Induction by Compression. Many data mining problems can be practically solved by compression.”
  6. “An important piece of mathematical background for the application of MDL in data mining, which is relevant for both Krimp and RealKrimp, is the Kraft Inequality, relating code lengths and probabilities”  They extend the Kraft Inequality to continuous spaces
  7. <Ok skipping most – interesting but tight on time.>

Get every new post delivered to your Inbox.