Category Archives: Uncategorized

Active Model Selection. Madani, Lizotte, Greiner. UAI 2004

  1. Considers the case where there is a fixed budget
  2. Shown to be NP-Hard
  3. Consider some heuristics
  4. “We observe empirically that the simple biased-robin algorithm significantly outperforms the other algorithms in the case of identical costs and priors.”
  5. Formalize the problem in terms of coins.  You are given a set of coins with different biases, and are given a budget of number of flips to sample.  Goal is to pick the coin with the highest bias for heads.  Actually consider the case where there are priors over the distributions for each coin, so considers Bayesian case
  6. “We address the computational complexity of the problem, showing that it is in PSPACE, but also NP-hard under different coin costs.”
  7. Metric is based on regret
  8. “A strategy may be viewed as a finite, rooted, directed tree where each leaf node is a special “stop” node, and each internal node corresponds to flipping a particular coin, whose two children are also strategy trees, one for each outcome of the flip”
    1. So naturally the total number of ways this can work out is exponential
  9. “We have observed that optimal strategies for identical priors typically enjoy a similar pattern (with some exceptions): their top branch (i.e., as long as the outcomes are all heads) consists of flipping the same coin, and the bottom branch (i.e., as long as the outcomes are all tails) consists of flipping the coins in a Round-Robin fashion”
  10. Update estimates on coins according to beta distribution
  11. “The proof reduces the Knapsack Problem to a special coins problem where the coins have different costs, and discrete priors with non-zero probability at head probabilities 0 and 1 only. It shows that maximizing the profit in the Knapsack instance is equivalent to maximizing the probability of finding a perfect coin, which is shown equivalent to minimizing the regret. The reduction reveals the packing aspect of the budgeted problem. It remains open whether the problem is NP-hard when the coins have unit costs and/or uni-modal distributions”
  12. “It follows that in selecting the coin to flip, two significant properties of a coin are the magnitude of its current mean, and the spread of its density (think “variance”), that is how changeable its density is if it is queried: if a coin’s mean is too low, it can be ignored by the above result, and if its density is too peaked (imagine no uncertainty), then flipping it may yield little or no information …However, the following simple, two coin example shows that the optimal action can be to flip the coin with the lower mean and lower spread!”
  13. Even if Beta parameters of two coins are fixed, the beta parameter of a third coin make require you to choose the first or second coin depending on their values
  14. Furthermore, “The next example shows that the optimal strategy can be contingent — i.e., the optimal flip at a given stage depends on the outcomes of the previous flips.”
  15. Although the optimal algorithm is contingent, an algorithm that is not contingent may only give up a little bit on optimality
  16. Discusses a number of heuristics including biased robin and interval estimation
  17. Gittins indices are simple and optimal, but only in the infinite horizon discounted case
    1. Discusses a hack to get it to work in the budgeted case (manipulating the discount based on the remaining budget)
  18. Goes on to empirical evaluation of heuristics

Gaussian Process Dynamical Models. Wang, Fleet, Hertzmann. Nips 2006

  1. “A GPDM comprises a low-dimensional latent space with associated dynamics, and a map from the latent space to an observation space.”
  2. “We demonstrate the approach on human motion capture data in which each pose is 62-dimensional.”
  3. “we show that integrating over parameters in nonlinear dynamical systems can also be performed in closed-form. The resulting Gaussian Process Dynamical Model (GPDM) is fully defined by a set of lowdimensional representations of the training data, with both dynamics and observation mappings learned from GP regression.”
  4. As a Bayesian nonparametric, GPs make them easier to use and overfit less
  5. “Despite the large state space, the space of activity-specific human poses and motions has a much smaller intrinsic dimensionality; in our experiments with walking and golf swings, 3 dimensions often suffice.”
  6. “The Gaussian Process Dynamical Model (GPDM) comprises a mapping from a latent space to the data space, and a dynamical model in the latent space…The GPDM is obtained by marginalizing out the parameters of the two mappings, and optimizing the latent coordinates of training data.”
  7. “t should be noted that, due to the nonlinear dynamical mapping in (3), the joint distribution of the latent coordinates is not Gaussian. Moreover, while the density over the initial state may be Gaussian, it will not remain Gaussian once propagated through the dynamics.”
  8. Looks like all predictions are 1-step, can specifically set it up to use more history to make it higher-order
  9. “In effect, the GPDM models a high probability “tube” around the data.”
  10. “Here we consider a simple online method for generating a new motion, called mean-prediction, which avoids the relatively expensive Monte Carlo sampling used above.”
  11. <Wordpress ate the rest of this post.  A very relevant paper I should follow up on.>

Deep Learning. LeCun, Bengio, Hinton. Nature 2015

  1. “Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. “
  2. Previous machine learning methods traditionally relied on significant hand-engineering to process data into something the real learning algorithm could use
  3. “Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations.”
  4. Has allowed for breakthroughs in many different areas
  5. We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available computation and data.
    1. <really? they have a very different definition of very little than i do>
  6. “In practice, most practitioners use a procedure called stochastic gradient descent (SGD).”
  7. Visual classifiers have to learn to be invariant to many things like background, shading, contrast, orientation, zoom, but also have to be very sensitive to other things (for example, learning to distinguish a german shepard from a wolf)
  8. “As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. ”
    1. Which is just an application of the chain rule for derivatives
  9. ReLUs are best for deep networks, can help remove the need for pre training
  10. Theoretical results as to why NNs rarely get stuck in local minima (especially large networks)
  11. Deep NN work started in 2006, when pretraining was done by having each layer model the activity of the layer below
  12. 1st major application of deep nets was speech recognition in 09, by 12 it was doing speech recognition on Android
  13. For small datasets, unsupervised pretraining is helpful
  14. Convnets for vision
  15. “There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.”
  16. “Recent ConvNet architectures have 10 to 20 layers of ReLUs, hundreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.”
  17. “The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intuitive’ inference that underpins effortless commonsense reasoning.”
  18. Machine translation and rnns
  19. Regular RNNs don’t work so well, LSTM fixes major problems
  20. “Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a ‘tape-like’ memory that the RNN can choose to read from or write to88, and memory networks, in which a regular network is augmented by a kind of associative memory89. Memory networks have yielded excellent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions.Beyond simple memorization, neural Turing machines and memory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught ‘algorithms’. Among other things, they can learn to output a sorted list of symbols when their input consists of an unsorted sequence in which each symbol is accompanied by a real value that indicates its priority in the list88. Memory networks can be trained to keep track of the state of the world in a setting similar to a text adventure game and after reading a story, they can answer questions that require complex inference90. In one test example, the network is shown a 15-sentence version of the The Lord of the Ringsand correctly answers questions such as “where is Frodo now?”89.”
  21. Although the focus now is mainly  on supervised learning, expect that unsupervised learning will become most important in the long term
  22. “Systems combining deep learning and reinforcement learning are in their infancy, but they already outperform passive vision systems99 at classification tasks and produce impressive results in learning to play many different video games100.”
  23. “Natural language understanding is another area in which deep learning is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time.”

Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. Singh, Memoli, Carlsson. Eurographics Symposium on Point-Based Graphics 2007

  1. “We present a computational method for extracting simple descriptions of high dimensional data sets in the form of simplicial complexes.”
  2. Discusses a tool Mapper for doing TDA
  3. “The purpose of this paper is to introduce a new method for the qualitative analysis, simplification and visualization of high dimensional data sets”
  4. One application is to help visualization in cases where data is extremely high-dimensional, and standard forms of visualization and dimension reduction don’t produce a reasonable result
  5. “Our construction provides a coordinatization not by using real valued coordinate functions, but by providing a more discrete and combinatorial object, a simplicial complex, to which the data set maps and which can represent the data set in a useful way”
  6. “In the simplest case one can imagine reducing high dimensional data sets to a graph which has nodes corresponding to clusters in the data”
  7. “Our method is based on topological ideas, by which we roughly mean that it preserves a notion of nearness, but can distort large scale distances. This is often a desirable property, because while distance functions often encode a notion of similarity or nearness, the large scale distances often carry little meaning.”
    1. <Reminds me a little of the intuition behind t-sne>
  8. Basic idea is called a partial clustering where subsets of the data are clustered.  If subsets overlap, clusters may also overlap which can be used to build a “simplicial complex” <basically a way of gluing together points so each face has a certain number of vertices, and then connecting the faces>
    1. If simplices and topologies are robust to changes in data subset divisions, then the results are real
  9. “We do not attempt to obtain a fully accurate representation of a data set, but rather a low-dimensional image which is easy to understand, and which can point to areas of interest. Note that it is implicit in the method that one fixes a parameter space, and its dimension will be an upper bound on the dimension of the simplicial complex one studies.”
  10. Unlike other methods of dimension reduction (like isomap, mds) this method is less sensitive to metric
  11. <skipping most of the math because it uses terms Im not familiar with>
  12. “The main idea in passing from the topological version to the statistical version is that clustering should be regarded as the statistical version of the geometric notion of partitioning a space into its connected components.”
  13. Assumes there is a mapping from data points to Reals (the filter), and that interpoint distances can be measured
  14. “Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any conditions on the clustering algorithm. Thus any domain-specific clustering algorithm can be used.”
  15. Cluster based on vector of interpoint distances (this is the representation used, not Euclidian distances)
  16. Also want clustering algorithm that doesn’t require # of clusters to be specified in advance
    1. <I would have thought they would use Chinese-restaurant-processes, but they use something called single-linkage clustering>
  17. “Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any conditions on the clustering algorithm. Thus any domain-specific clustering algorithm can be used.”
    1. Seems like mapper can handle higher-dimensional filters
  18. “The outcome of Mapper is highly dependent on the function(s) chosen to partition (filter) the data set. In this section we identify a few functions which carry interesting geometric information about data sets in general.”
  19. projection pursuit methods
  20. Can use eigenfunctions of the Laplacian as filter functions
  21. Also discuss using meshes and computing distance based on dijkstras over the mesh
  22. “Different poses of the same shape have qualitatively similar Mapper results however different shapes produce significantly different results. This suggests that certain intrinsic information about the shapes, which is invariant to pose, is being retained by our procedure.”
    1. I think this falls out primarily from the distance metric used?
  23. Clusters different 3d models nicely, even though it samples down to just 100 points for each model, and many are similar (horse, camel, cat)

Very Deep Convolutional Networks for Large-scale Image Recognition. Simonyan, Zisserman. ICLR 2015

  1. Discusses approach that got 1st, 2nd place in imagenet challenge 2014
  2. Basic idea is to use very small convolutions (3×3) and a deep network (16-19 layers)
  3. Made the implementation public
  4. Works well on other data sets as well
  5. Last year people moved to make smaller receptive windows, smaller stride, and using training data more thoroughly, (at multiple scales)
  6. 224×224: only preprocessing is doing mean-subtraction of RGB values for each pixel
  7.  “local response normalization” didnt help performance and consumed more memory
  8. Earlier state of the art used 11×11 convolutions w/stride 4 (or 7×7 stride 2)
    1. Here they only did 3×3 with stride 1
    2. They also have 3 non-linear rectification layers instead of 1, so the decisions made by those layers can be more flexible
  9. Their smaller convolutions have a much smaller number of parameters, which can be seen as a form of regularization
  10. Optimized multinomial logistic regression using minibatch (size 256) gradient descent from backprop + momentum.
  11. “The training was regularised by weight decay (the L2 penalty multiplier set to 5 · 10−4 ) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5).”
    1. <How does this weight decay work exactly?  Need to check it out>
  12. Ends up training faster than Krizhevsky et al., 2012’s network because of some pretraining, and also because the network is narrower, but deeper (more regularized)
    1. Pretrain 1st 4 convolutional layers, and last 3 fully connected layers
    2. They found out later that pretraining wasn’t really needed if they used a particular random initialization procedure
  13. Implementation based on Caffe, including very efficient paralleization
  14. With 4 Titan GPUs, took 2-3 weeks to train
  15. Adding further layers didn’t improve performance, although they say it might have if the data set was even larger
  16. “scale jittering helps” <i guess this has to do with how images are cropped and scaled to fit in 224×224, and randomizing this process a bit helps>
  17. “Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.”
  18. Method was simpler than a number of other near state-of-the-art

Empirical Evaluation of Gated Recurrent Neural Networks of Sequence Modelling. Chung, Gulcehre, Cho, Bengio. Arxiv 2014

  1. Compares different types of recurrent units in RNNs
  2. Compares LSTMs and newer Gated Recurrent Unit (GRU)
  3. Test on music and speech signal modelling
  4. These units are found to be better than “more traditional recurrent units such as tanh units”
  5. GRU is found to be comparable to LSTM, although GRU is a bit better
  6.  Of all the impressive recent work of RNNs (including everything that works off of variable size inputs), nothing is from vanilla RNNs
  7. Vanilla RNNs are hard to use because of both exploding and vanishing gradients
    1. Discussed many of the points related to this here
  8. GRUs are somewhat similar to LSTM although the model is a bit simpler
    1. Both can capture long-term dependencies
  9. GRU doesn’t have a separate memory cell like LSTM does
    1. Doesn’t have a mechanism to protect memory like LSTM
  10. shot
  11. Calculating the activation with GRU is simpler as well
  12. Both LSTM and GRUs compute deltas as opposed to completely recomputing values at each step
  13. “This additive nature has two advantages. First, it is easy for each unit to remember the existence of a specific feature in the input stream for a long series of steps. Any important feature, decided by either the forget gate of the LSTM unit or the update gate of the GRU, will not be overwritten but be maintained as it is. Second, and perhaps more importantly, this addition effectively creates shortcut paths that bypass multiple temporal steps. These shortcuts allow the error to be back-propagated easily without too quickly vanishing (if the gating unit is nearly saturated at 1) as a result of passing through multiple, bounded nonlinearities,”
  14. “Another difference is in the location of the input gate, or the corresponding reset gate. The LSTM unit computes the new memory content without any separate control of the amount of information flowing from the previous time step. Rather, the LSTM unit controls the amount of the new memory content being added to the memory cell independently from the forget gate. On the other hand, the GRU controls the information flow from the previous activation when computing the new, candidate activation, but does not independently control the amount of the candidate activation being added (the control is tied via the update gate).”
  15. When comparing gated to vanilla RNN “Convergence is often faster, and the final solutions tend to be better. “

An Empirical Exploration of Recurrent Network Architectures. Jozefowicz, Zaremba, SutskeverAR

  1. Vanilla RNNs are usually difficult to train.  LSTMS are a form of RNN that are easier to train
  2. LSTMs though, have arch that “appears to be ad-hoc so it is not clear if it is optimal, and the significance of its individual components is unclear.”
  3. Tested thousands of different models with different architectures based on LSTM, and also compared new Gated Recurrent Units
  4. “We found that adding a bias of 1 to the LSTM’s forget gate closes the gap between the LSTM and the GRU.”
  5. RNNs suffer from exploding/vanishing gradients (the latter was addressed successfully in LSTMs)
    1. There are many other ways to work on the vanishing gradient, such as regularization, second-order optimization, “giving up on learning the recurrent weights altogether”, as well as careful weight initialization
  6. Exploding gradients were easier to address with “a hard constraint over the norm of the gradient”
    1. Later referred to as “gradient clipping”
  7. “We discovered that the input gate is important, that the output gate is unimportant, and that the forget gate is extremely significant on all problems except language modelling. This is consistent with Mikolov et al. (2014), who showed that a standard RNN with a hard-coded integrator unit (similar to an LSTM without a forget gate) can match the LSTM on language modeling.”
  8. exploding/vanishing gradients “are caused by the RNN’s iterative nature, whose gradient is essentially equal to the recurrent weight matrix raised to a high power. These iterated matrix powers cause the gradient to grow or to shrink at a rate that is exponential in the number of timesteps.”
  9. Vanishing gradient issue in RNNs make it easy to learn short-term interactions but not long-term
  10. Through reparameterizing, LSTM cannot have a gradient that vanishes
  11. Basically, instead of recomputing weights from weights at the previous state, it only computes a weight delta which is added to the previous weights
    1. The network has additional machinery to do so
    2. Many LSTM variants
  12. Random initialization of the forget gate will leave it with some fractional value, which introduces a vanishing gradient.
    1. It is commonly ignored, but initializing it to a “large value” such as 1 or 2 will prevent vanishing gradient over time
  13. Use genetic algorithms to optimize architecture and hyperparams
  14. Evaluated 10,000 architectures, 1,000 made them past the first task (which would allow them to compete genetically).  Total of 230,000 hyperparameter configs tested
  15. Three problems tested:
    1. Arithmetic: read in a string which has numbers with an add or subtract symbol inside, then the network has to feed out the output.  There are distractor symbols in the string that need to be ignored
    2. Completion of a random XML dataset
    3. Penn Tree-Bank (word level modelling)
    4. Then there was an extra task to test generalization <validation?>
  16. “Unrolled” RNNs for 35 timesteps, minibatch of size 20
  17. Had a schedule for adjusting the learning rate once learning stopped on the initial value
    1. <nightmare>
  18. “Though there were architectures that outperformed the LSTM on some problems, we were unable to find an architecture that consistently beat the LSTM and the GRU in all experimental conditions.”
  19. “Importantly, adding a bias of size 1 significantly improved the performance of the LSTM on tasks where it fell behind the GRU and MUT1. Thus we recommend adding a bias of 1 to the forget gate of every LSTM in every application”

Protected: Preliminary Results from Mechanical Turk

This content is password protected. To view it please enter your password below:

Two-Stream Convolutional Networks for Action Recognition in Videos. Simonyan, Zisserman. Arxiv 2014

  1. Doing action recognition in video
  2. “Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both.”
  3. “Our proposed architecture is related to the two-streams hypothesis [9], according to which the human visual cortex contains two pathways: the ventral stream (which performs object recognition) and the dorsal stream (which recognises motion); “
  4. “Video can naturally be decomposed into spatial and temporal components. The spatial part, in the form of individual frame appearance, carries information about scenes and objects depicted in the video. The temporal part, in the form of motion across the frames, conveys the movement of the observer (the camera) and the objects… Each stream is implemented using a deep ConvNet, softmax scores of which are combined by late fusion. We consider two fusion methods: averaging and training a multi-class linear SVM”
    1. The two streams are only combined at the very end (7 layers+softmax each)
  5. “…, the input to our model is formed by stacking optical flow displacement fields between several consecutive frames. Such input explicitly describes the motion between video frames, which makes the recognition easier, as the network does not need to estimate motion implicitly.”
  6. They simply stack the displacement vector fields + throw into a convnet
  7. Another method is to use optical flow instead of displacement vector fields <don’t know what the difference is>
  8. In order to zero-center data, “The importance of camera motion compensation has been previously highlighted in [10, 26], where a global motion component was estimated and subtracted from the dense flow. In our case, we consider a simpler approach: from each displacement field d we subtract its mean vector.”
  9. Looks like the representation they use is they do prediction based on a frame in the middle, with some frames before and some after
  10. The spatial convnet is trained only on stills
  11. “A more principled way of combining several datasets is based on multi-task learning [5]. Its aim is to learn a (video) representation, which is applicable not only to the task in question (such as HMDB-51 classification), but also to other tasks (e.g. UCF-101 classification). Additional tasks act as a regulariser, and allow for the exploitation of additional training data. In our case, a ConvNet architecture is modified so that it has two softmax classification layers on top of the last fully- 5 connected layer: one softmax layer computes HMDB-51 classification scores, the other one – the UCF-101 scores. Each of the layers is equipped with its own loss function, which operates only on the videos, coming from the respective dataset. The overall training loss is computed as the sum of the individual tasks’ losses, and the network weight derivatives can be found by back-propagation”
  12. ” The only difference between spatial and temporal ConvNet configurations is that we removed the second normalisation layer from the latter to reduce memory consumption.”
  13. “The network weights are learnt using the mini-batch stochastic gradient descent with momentum (set to 0.9). At each iteration, a mini-batch of 256 samples is constructed by sampling 256 training videos (uniformly across the classes), from each of which a single frame is randomly selected”
  14. “Our implementation is derived from the publicly available Caffe toolbox [13], but contains a number of significant modifications, including parallel training on multiple GPUs installed in a single system. We exploit the data parallelism, and split each SGD batch across several GPUs. Training a single temporal ConvNet takes 1 day on a system with 4 NVIDIA Titan cards, which constitutes a 3.2 times speed-up over single-GPU training.”
    1. <nice>
  15. “Optical flow is computed using the off-the-shelf GPU implementation of [2] from the OpenCV toolbox. “
  16. Pre training makes a big difference
  17. “The difference between different stacking techniques is marginal; it turns out that optical flow stacking performs better than trajectory stacking, and using the bi-directional optical flow is only slightly better than a uni-directional forward flow. Finally, we note that temporal ConvNets significantly outperform the spatial ConvNets (Table 1a), which confirms the importance of motion information for action recognition.”
  18. Performance is comparable to hand-designed state of the art
  19. “We proposed a deep video classification model with competitive performance, which incorporates separate spatial and temporal recognition streams based on ConvNets. Currently it appears that training a temporal ConvNet on optical flow (as here) is significantly better than training on raw stacked frames [14]. The latter is probably too challenging, and might require architectural changes (…). Despite using optical flow as input, our temporal model does not require significant hand-crafting, since the flow is computed using a method based on the generic assumptions of constancy and smoothness.”
  20. “There still remain some essential ingredients of the state-of-the-art shallow representation [26], which are missed in our current architecture. The most prominent one is local feature pooling over spatio-temporal tubes, centered at the trajectories. Even though the input (2) captures the optical flow along the trajectories, the spatial pooling in our network does not take the trajectories into account. Another potential area of improvement is explicit handling of camera motion, which in our case is compensated by mean displacement subtraction.”


  1. “We propose a strong baseline model for unsupervised feature learning using video data. By learning to predict missing frames or extrapolate future frames from an input video sequence, the model discovers both spatial and temporal correlations which are useful to represent complex deformations and motion patterns. The models we propose are largely borrowed from the language modeling literature, and adapted to the vision domain by quantizing the space of image patches into a large dictionary.”
  2. “The biggest hurdle to overcome when learning without supervision is the design of an objective function that encourages the system to discover meaningful regularities. One popular objective is squared Euclidean distance between the input and its reconstruction from some extracted features. Unfortunately, the squared Euclidean distance in pixel space is not a good metric, since it is not stable to small image deformations, and responds to uncertainty with linear blurring. Another popular objective is log-likelihood, reducing unsupervised learning to a density estimation problem. However, estimating densities in very high dimensional spaces can be difficult, particularly the distribution of natural images which is highly concentrated and multimodal .”
  3. There has been previous work on generative models for images, but they have been small in scale
  4. “…spatial-temporal correlations can provide powerful information about how objects deform, about occlusion, object boundaries, depth, and so on”
  5. “The only assumption that we make is local spatial and temporal stationarity of the input (in other words, we replicate the model and share parameters both across space and time),”
  6. Works based on 1-hot representation of video and recurrent nn <?>
  7. They use a dictionary/classification based approach because they found that doing regression just lead to blurring of the frame when prediction was done
    1. Each image patch is unique, so some bucketing scheme must be used – they use k-means
  8. “This sparsity enforces strong constraints on what is a feasible reconstruction, as the k-means atoms “parameterize” the space of outputs. The prediction problem is then simpler because the video model does not have to parameterize the output space; it only has to decide where in the output space the next prediction should go.”
  9. “There is clearly a trade-off between quantization error and temporal prediction error. The larger the quantization error (the fewer the number of centroids), the easier it will be to predict the codes for the next frame, and vice versa. In this work, we quantize small gray-scale 8×8 patches using 10,000 centroids constructed via k-means, and represent an image as a 2d array indexing the centroids.”
  10. Use recurrent convnet
  11. “In the recurrent convolutional neural network (rCNN) we therefore feed the system with not only a single patch, but also with the nearby patches. The model will not only leverage temporal dependencies but also spatial correlations to more accurately predict the central patch at the next time step”
  12. “To avoid border effects in the recurrent code (which could propagate in time with deleterious effects), the transformation between the recurrent code at one time step and the next one is performed by using 1×1 convolutional filters (effectively, by using a fully connected layer which is shared across all spatial locations).”
  13. ” First, we do not pre-process the data in any way except for gray-scale conversion and division by the standard deviation.”
  14. “This dataset [UCF-101] is by no means ideal for learning motion patterns either, since many videos exhibit jpeg artifacts and duplicate frames due to compression, which further complicate learning.”
  15. “Generally speaking, the model is good at predicting motion of fairly fast moving objects of large size, but it has trouble completing videos with small or slowly moving objects.”
  16. x
  17. Optical flow based methods produce results that are less blurred but more distorted
  18. The method can also be used for filling in frames
  19. Discusses future work:
    1. Multi-scale prediction
    2. Multi-step prediction
    3. Regression
    4. Hard coding features for motion
  20. “This model shows that it is possible to learn the local spatio-temporal geometry of videos purely from data, without relying on explicit modeling of transformations. The temporal recurrence and spatial convolutions are key to regularize the estimation by indirectly assuming stationarity and locality. However, much is left to be understood. First, we have shown generation results that are valid only for short temporal intervals, after which long range interactions are lost.”