- “We propose a strong baseline model for unsupervised feature learning using video data. By learning to predict missing frames or extrapolate future frames from an input video sequence, the model discovers both spatial and temporal correlations which are useful to represent complex deformations and motion patterns. The models we propose are largely borrowed from the language modeling literature, and adapted to the vision domain by quantizing the space of image patches into a large dictionary.”
- “The biggest hurdle to overcome when learning without supervision is the design of an objective function that encourages the system to discover meaningful regularities. One popular objective is squared Euclidean distance between the input and its reconstruction from some extracted features. Unfortunately, the squared Euclidean distance in pixel space is not a good metric, since it is not stable to small image deformations, and responds to uncertainty with linear blurring. Another popular objective is log-likelihood, reducing unsupervised learning to a density estimation problem. However, estimating densities in very high dimensional spaces can be difficult, particularly the distribution of natural images which is highly concentrated and multimodal .”
- There has been previous work on generative models for images, but they have been small in scale
- “…spatial-temporal correlations can provide powerful information about how objects deform, about occlusion, object boundaries, depth, and so on”
- “The only assumption that we make is local spatial and temporal stationarity of the input (in other words, we replicate the model and share parameters both across space and time),”
- Works based on 1-hot representation of video and recurrent nn <?>
- They use a dictionary/classification based approach because they found that doing regression just lead to blurring of the frame when prediction was done
- Each image patch is unique, so some bucketing scheme must be used – they use k-means

- “This sparsity enforces strong constraints on what is a feasible reconstruction, as the k-means atoms “parameterize” the space of outputs. The prediction problem is then simpler because the video model does not have to parameterize the output space; it only has to decide where in the output space the next prediction should go.”
- “There is clearly a trade-off between quantization error and temporal prediction error. The larger the quantization error (the fewer the number of centroids), the easier it will be to predict the codes for the next frame, and vice versa. In this work, we quantize small gray-scale 8×8 patches using 10,000 centroids constructed via k-means, and represent an image as a 2d array indexing the centroids.”
- Use recurrent convnet
- “In the recurrent convolutional neural network (rCNN) we therefore feed the system with not only a single patch, but also with the nearby patches. The model will not only leverage temporal dependencies but also spatial correlations to more accurately predict the central patch at the next time step”
- “To avoid border effects in the recurrent code (which could propagate in time with deleterious effects), the transformation between the recurrent code at one time step and the next one is performed by using 1×1 convolutional filters (effectively, by using a fully connected layer which is shared across all spatial locations).”
- ” First, we do not pre-process the data in any way except for gray-scale conversion and division by the standard deviation.”
- “This dataset [UCF-101] is by no means ideal for learning motion patterns either, since many videos exhibit jpeg artifacts and duplicate frames due to compression, which further complicate learning.”
- “Generally speaking, the model is good at predicting motion of fairly fast moving objects of large size, but it has trouble completing videos with small or slowly moving objects.”
- x
- Optical flow based methods produce results that are less blurred but more distorted
- The method can also be used for filling in frames
- Discusses future work:
- Multi-scale prediction
- Multi-step prediction
- Regression
- Hard coding features for motion

- “This model shows that it is possible to learn the local spatio-temporal geometry of videos purely from data, without relying on explicit modeling of transformations. The temporal recurrence and spatial convolutions are key to regularize the estimation by indirectly assuming stationarity and locality. However, much is left to be understood. First, we have shown generation results that are valid only for short temporal intervals, after which long range interactions are lost.”