Category Archives: Supervised Learning

One-shot learning by inverting a compositional causal process. Lake, Salakhutdinov, Tenenbaum. NIPS 2013

  1. Deals with one-shot learning
  2. “…a Hierarchical Bayesian model based on compositionality and causality that can learn a wide range of natural (although simple) visual concepts, generalizing in human-like ways from just one image.”
  3. In 1-shot learning did about as well as people, and better than deep learning methods
  4. People can learn a new concept from a tiny amount of data – can learn a class from one image (which is a very high dimensional piece of data)
  5. Even MNIST, which is an old and small dataset, still has 6k samples/class, but people often only need 1 example
  6. “Additionally, while classification has received most of the attention in machine learning, people can generalize in a variety of other ways after learning a new concept. Equipped with the concept “Segway” or a new handwritten character (Figure 1c), people can produce new examples, parse an object into its critical parts, and fill in a missing part of an image. While this flexibility highlights the richness of people’s concepts, suggesting they are much more than discriminative features or rules, there are reasons to suspect that such sophisticated concepts would be difficult if not impossible to learn from very sparse data. ”
    1. Looks like people both have a rich hypothesis space (because they can do all the above with a small amount of data), but also don’t overfit, which is the theoretical downside to having a large hypothesis class.  How do they do it?
  7. Here, focus is on handwritten characters
    1. Idea is to use something more natural and complex than simple synthetic stimuli, but something less complex than natural images
  8. Use a new omniglot dataset, which has 16k characters with 20 examples each
    1. Also has time data so the strokes are recorded as well
  9. “… this paper also introduces Hierarchical Bayesian Program Learning (HBPL), a model that exploits the principles of compositionality and causality to learn a wide range of simple visual concepts from just a single example.”
  10. Also use the method to generate new examples of a class, and then do a Turing test with it by asking other humans which was human generated and which was machine generated
  11. The HBPL “…is compositional because characters are represented as stochastic motor programs where primitive structure is shared and re-used across characters at multiple levels, including strokes and sub-strokes.”
  12. The model attempts to find a “structural description” that explains the image by breaking the character down into parts
  13. A Character is made of:
    1. A set of strokes
      1. Each stroke is made of simple sub-strokes modeled by a “uniform cubic b-spline” and is built of primitive motor elements that are defined by a 1st order Markov Process
    2. Set of spatial relationships between strokes, can be:
      1. Independent: A stroke that has a location independent of other strokes
      2. Start/end: A stroke that starts at beginning/end of another stroke
      3. Along: A stroke that starts somewhere along a previous stroke
  14. ” Each trajectory … is a deterministic function of a starting location … token-level control points … and token-level scale …. The control points and scale are noisy versions of their type-level counterparts…”
  15. Used 30 most common alphabets for training, and another 20 for evaluation.  The training set was used to learn hyperparameters, a set of 1000 primitive motor elements, and stroke placement.  They attempted to do cross-validation within the training set
  16. The full set of possible ways a stroke could be created is enormous, so they have a botto-up way of finding a set of the K most likely parses.  They approximate the posterior based on this finite, size-K sample based on their relative likelihoods
    1. They actually then use metropolis-hasting to get a number of samples of each parse with a little variance each to get a better estimate of the likelihoods
  17. “Given an approximate posterior for a particular image, the model can evaluate the posterior predictive score of a new image by re-fitting the token-level variables…”
  18. Results
  19. For the 1-shot tasks, a letter from an alphabet was presented with 20 other letters from the same alphabet.  Each person did this 10 times, but each time was with a totally new alphabet, so no characters was ever seen twice
  20. Get K=5 parses of each character presented (along with MCMC), and then run K gradient searches to reoptimize the token-level variables to fit the query image.
  21. They can also, however, attempt to reoptimize the query image to fit the 20 options presented
  22. Compare against:
    1. Affine model
    2. Deep Boltzmann Machines
    3. Hierarchical Deep Model
    4. Simple Strokes (a simplified HBPL)
    5. NN
  23. Humans and HBPL ~4.5% error rate, affine model next at 18.2%
  24. Then they did one-shot Turing test where people and algorithms had to copy a single query character
    1. <For what its worth, I think Affine looks better than both results from people and HBPL>
  25. In the “Turing test” there was feedback after each 10 trials, for a total of 50 trials
    1. <Note that this test doesn’t ask which character looks best, it is which is most confusable with human writing (which is pretty sloppy from the images they show).  I’m curious if the affine model could be made more human just by adding noise to its output>
  26. <Playing devil’s advocate, the images of characters were collected on mTurk, and look like they were probably drawn with a mouse — that is to say I feel they don’t look completely like natural handwriting.  I wonder how much of this program is picking up on those artifacts?  At least in terms of reproduction, the affine method looks best>


Science 2015

  1. “Concepts are represented as simple probabilistic programs—that is, probabilistic generative models expressed as structured procedures in an abstract description language (…). Our framework brings together three key ideas—compositionality, causality, and learning to learn—that have been separately influential in cognitive science and machine learning over the past several decades (…). As programs, rich concepts can be built “compositionally” from simpler primitives
  2. “In short, BPL can construct new programs by reusing the pieces of existing ones, capturing the causal and compositional properties of real-world generative processes operating on multiple scales.”
  3. <Looks like exactly the same paper, just more brief.  The accuracies of both BPL and other methods seems improved here, though.  Convnets get 13.5% error; BPL gets 3.3%; people get 4.5%.  “A deep Siamese convolutional network optimized for this one-shot learning task achieved 8.0% errors”>
  4. “BPL’s advantage points to the benefits of modeling the underlying causal process in learning concepts, a strategy different from the particular deep learning approaches examined here.”
    1. <Or equivalently you can just say BPL does better because it has a small and highly engineered hypothesis class>
  5. Also run BPL with various “lesions” and gets error rates in the teens.  Also did more poorly in the “Turing test” part
  6. Instead of training on 30 background alphabets, they also did with just 5, and there the error rates are about 4%; on the same set convnets did about 20% error
  7. Supplementary Material

  8. <I assumed that they would ask individuals who actually learned how to write the languages to do the recordings.  Instead, they just took pictures of characters and had people write them.  This seems like a problem to me because of inconsistencies in the way people would actually do the strokes of a letter in an alphabet they do not know.>
  9. <Indeed, they were also drawn by mouse in a box on a screen, which is a very unnatural way to do things>
  10. <From what I can tell the characters are recorded in pretty low resolution as well which looks like it can cause artifacts, looks like 105×105>
  11. <This basically has the details that were included in the main part of the NIPS paper>
  12. Some extra tricks like convolving with Gaussian filter, randomly flipping bits
  13. Primitives are scale-selective
  14. “For each image, the center of mass and range of the inked pixels was computed. Second, images were grouped by character, and a transformation (scaling and translation) was computed for each image so that its mean and range matched the group average.”
  15. ” In principle, generic MCMC algorithms such as the one explored in (66) can be used, but we have found this approach to be slow, prone to local minima, and poor at switching between different parses. Instead, inspired by the speed of human perception and approaches for faster inference in probabilistic programs (67), we explored bottom-up methods to compute a fast structural analysis and propose values of the latent variables in BPL. This produces a large set of possible motor programs – each approximately fit to the image of interest. The most promising motor programs are chosen and refined with continuous optimization and MCMC.”
  16. “A candidate parse is generated by taking a random walk on the character skeleton with a “pen,” visiting nodes until each edge has been traversed at least once. Since the parse space grows exponentially in the number of edges, biased random walks are necessary to explore the most interesting parts of the space for large characters. The random walker stochastically prefers actions A that minimize the local angle of the stroke trajectory around the decision point…”
  17. For the ANN they used cafe, and took a network that works well on MNIST
    1. <But it seems like this system doesn’t have any of the special engineering that went into this that deals specifically with strokes as opposed to whole images>
    2. “The raw data was resized to 28 x 28 pixels and each image was centered based on its center of mass as in MNIST. We tried seven different architectures varying in depth and layer size, and we reported the model that performed best on the one-shot learning task.”
    3. <This may make the task easier, but MNIST deals with a small number of characters, many of which are much less complex than some of the characters used here.   It might be the case that some of the more complex characters can’t be accurately reduced to such a small size, so this may be hobbling performance>
    4. Also the network is not very deep – only 2 conv layers and a max-pooling
    5. “One-shot classification was performed by computing image similarity through the feature representation in the 3000 unit hidden layer and using cosine similarity.”
    6. They used a smaller net for the 1-shot classification with less data, <so that was nice of them>
  18. The full “Siamese network” did work on the 105×105 image, had 4 conv layers and 1 standard hidden layer.  Parameters were optimized with Bayesian method
  19. “The Hierarchical Deep model is more “compositional” than the deep convnet, since learning-to-learn endows it with a library of high-level object parts (29). However, the model lacks a abstract causal knowledge of strokes, and its internal representation is quite different than an explicit motor program. “
  20. For data collection “The raw mouse trajectories contain jitter and discretization artifacts, and thus spline smoothing was applied.”
  21. <Ok, skipping the rest>

Temporal Segmentation and Activity Classification from First-person Sensing, Spriggs, De La Torre, Hebert. CVPR 2009.

  1. “We explore first-person sensing through a wearable camera and Intertial Measurement Units (IMUs) for temporally segmenting human motion into actions and performing activity classification in the context of cooking and recipe preparation in a natural environment.”
  2. Try:
    1. Gaussian Mixture Models
    2. HMMs
    3. K-NN
    4. Also try unsupervised methods so that annotation isn’t necessary
  3. Large number of references to prior work
  4. The IMUs used here are set up as:
    1. 5 on the wrist
    2. <5?> on ankles
    3. 1 on waist
  5. And then there is head-cam
  6. There is ambiguity in how to label what is going, especially because people did the same task differently.
    1. Also huge partial observability (even in terms of what is in view of the camera)
  7.  Action clauses are distinct per recipe (at least partially), 29 clauses for brownies
  8. For unsupervised segmentation, they try PCA, then cluster that
    1. Performance is about 70% correct with this method
    2. These features can also be used to try and classify what recipe is being made
  9. Recipe classification is perfect on the small dataset when using data from IMUs
  10. Tried unsupervised clustering of IMU data with HMM but it came up with garbage
  11. They then merge video and IMU data <but its not totally clear how they accomplish this in terms of representation> again, run through PCA and HMM
  12. Recipe classification with this method was ~93%, so just using IMUs is better
  13. Then they move onto supervised, with annotation
  14. ~80% of frames annotated, “stirring the mix” takes up about 25% of labeled frames
    1. They trained only on frames where annotation was available
  15. Then do classification with supervised HMM (poor classification~ 10%) and k-NN (much better at ~60%)
    1. They argue perf of k-NN is from high D data <not sure why that is a good argument though>
Tagged , , , ,

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning. Guo, Singh, Lee, Lewis, Wang. NIPS 2014

  1. Model-based Atari planners are better than the famous DQN, but very slow
  2. Goal here is to build a real-time MCTS agent better than DQN
  3. Use UCT to train NN, 300 max depth, 10k trajectories (minor modifications for a couple of games)
  4. A few particular approaches:
    1. Do 800 episodes in the game through UCTs policy, and learn the values computed by UCT to train the NN
    2. Use UCT as above, but to use NN to train a classifier to pick an action
      1. One issue is that the policy generated by these approaches will in general be different from that found by UCT during search, so the values will not be for the policy used
    3. So they propose an “interleaved” approach where part of the behavior is on-policy UCT, and part UCT is used to compute values, but the actions taken are according to NN
  5. Learn in 84×84 greyscale
  6. Use same NN arch as DQN
  7. UCT with classification is as good or better as DQN,  UCT with regression is worse
  8. To be consistent with DQN paper, they leave in 5% exploration for recording quality, although it doesn’t make any sense to do that when evaluating performance
  9. UCT without any realtime constraints really beats up on everything (roughly double score of best)
  10. Also, the emulator runs at 60 fps, not 30 that might be expected <the 60 is consistent with atari hardware, I checked>
  11. Interleaved method performance jumped %10 when doubling trajectories
  12. <I thought they were going to try a DYNA 2 approach, wonder how it would work>
  13. They observed some horizon effects in UCT where some good behavior requires really long timescales that couldn’t be reasoned about

HC-Search: Learning Heuristics and Cost Functions for Structured Prediction. Rao Doppa, Fern, Tadepalli. AAAI 2013

Won best paper award

  1. “Structured prediction is the problem of learning a function from structured inputs to structured outputs.”
  2. This paper deals with structured prediction applied to search (although it is not the first such paper)
  3. “Given a structured input, the framework uses a search procedure guided by a learned heuristic H to uncover high quality candidate outputs and then uses a separate learned cost function C to select a final prediction among those outputs.”
  4. The regret can be found as inability of H to find good results, and then C not selecting the best output <not sure what these two things mean exactly yet>
  5. Based on these two factors of regret, regret is minimized first by “training H to quickly uncover high quality outputs via imitation learning, and then training C to correctly rank the outputs generated via H according to their true losses.”
  6. <Claim> Excellent empirical results <this is probably true otherwise it would not win best paper award>
  7. Structured prediction is just supervised learning.  <Although it doesn’t seem like it is always part of structured learning> They leverage a cost function <which I suppose must be equivalent to an error metric>
  8. In general minimizing the cost function is intractable, and in practice “… heuristic inference methods such as loopy belief propagation and variational inference have shown some success in practice.  However, the learning algorithms generally assume exact inference and their behavior in the context of heuristic inference is not well understood.”
  9. <It seems like the problem is discrete, because they say that if the optimization problem is a tree then it is tractable, and also…>
  10. “Another approach to dealing with the Argmin problem is to eliminate it altogether and instead attempt to learn a classifier that can greedily produce a structured output by making a series of discrete decisions (…)  Unfortunately, in many problems, some decisions are difficult to make by a greedy classifier, but are crucial for good performance.”
  11. Mention some method which seems to be state of the art at the moment, but mention that it has a number of serious limitations that will be addressed here
  12. <I honestly am not sure what type of paper I’m reading here so far.  Is it search, optimization, or supervised learning?>
  13. The decomposition of the problem into H,C “… the performance of the approaches with a single function can be arbitrarily bad when compared to that of HC-Search in the worst case.”
  14. In practice, HC-Search also works better<, so it seems like a case where assumptions in the theory hold up>
  15. “Finally, we note that our methodology can be viewed as similar in spirit to Re-Ranking (Collins 2000), which uses a generative model to propose a k-best list of outputs, which are then ranked by a separate ranking function. In contrast, we search in efficient search spaces guided by a learned heuristic that has minimal restrictions on its representation.”
  16. <Seems like> the setting is like standard supervised learning, except you are also provided with a cost function that must be optimized <I suppose most supervised learning algorithms assert what cost they are optimizing, such as MSE, which might not be the case in the problem defined here>
  17. There is a sort of forward model given <seems like it can do multistep look ahead, starting with x and proposing different possible solutions y>
  18. <So it seems like there is a sort of combinatoric search problem>
  19. Both learned heuristic H and learned cost function C are mappings from × Y to reals.
    1. In addition there is some search strategy <seems like here they just assume greedy>
  20. Given some x and some time bound, is run using H and the best item found in terms of C during the whole search is returned
  21. They “… consider the feasibility of an exact solution to this learning problem in the simplest setting of greedy search using linear heuristic and cost functions represented by their weight vectors wH and wC respectively.
  22. They “… consider the HC-Search Consistency Problem, where the input is a training set of structured examples, and we must decide whether or not there exists wH and wC such that HC-Search using greedy search will achieve zero loss on the training set.”
    1. This problem is NP-Hard.  In fact, both problems of finding a zero-loss heuristic and cost are np-hard
  23. Even though the problem is NP-hard, the decomposition of and C still helps
  24. Because the problem is intractable, they resort to greedy stagewise search for and C
  25. Most generally, learning a heuristic can be viewed as a Reinforcement Learning (RL) problem where the heuristic is viewed as a policy for guiding “search actions” and rewards are received for uncovering high quality outputs. In fact, this approach has been explored for structured prediction in the case of greedy search (Wick et al. 2009) and was shown to be effective given a carefully designed reward function and action space. While this is a viable approach, general purpose RL can be quite sensitive to the algorithm parameters and specific definition of the reward function and actions, which can make designing an effective learner quite challenging. Indeed, recent work (Jiang et al. 2012), has shown that generic RL algorithms can struggle for some structured prediction problems, even with significant effort put forth by the designer.”
    1. They go with a non-RL approach, imitation learning
  26. Their formulation of imitation learning is distinct from supervised learning in that they only care that it helps the heuristic make correct choices.  Basically the only concern is that it gets the ordering of the values of the options correct, as opposed to getting the values themselves right
    1. This type of strategy is called rank-based
    2. “Most common search procedures such as greedy search, beam search, and best-first search fall under this category.”
  27. The rank-based learner they used is called “the online passive-aggressive algorithm”
  28. The cost-function learner <seems> to be trained to accurately predict the data from the distribution that A and H yield
  29. Their experimental results cover large standard test sets, such as handwriting recognition, scene labeling and some others
  30. “The results in the scene labeling domain are the most significant improving the error rate from 27.05 [the previous state of the art] to 19.71 .”
  31. “One of the advantages of our approach compared to many frameworks for structured prediction is the ability to use more expressive feature spaces without paying a huge computational price.”
  32. Outperforms other methods basically across the board <are there other state of the art methods here that aren’t listed?>
  33. Plain C-Search (as opposed to HC-Search) doesn’t work as well, and even with 3rd order features works poorer than HC-Search with 2nd order features.  The difference in performance can be attributed to the fact that HC-Search optimizies for the distribution found by H, which plain C-Search does not
  34. When giving the true loss function instead of the learned loss function H, improvements in performance weren’t significant

How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis. Escalante-B, Wiskott. CogPrints 2013.

  1. Their extension for supervised SFA is called graph-based SFA
  2. “The algorithm extracts a label-predictive low-dimensional set of features that can be post processed by typical supervised algorithms to generate the final label or class estimation.”
  3. Trained with a graph where edge weights represent similarities
  4. The modification to SFA made here is that it accepts weights
  5. There are different ways of building these graphs, a very simple method generates results equivalent to the Fisher linear discriminant
  6. Claim is that supervised learning on high-dimensional data is tough, so often a dimension reduction step is taken (perhaps unsupervised).
    1. Here, a supervised dimension reduction step is proposed
  7. “GSFA and LE [Laplacian Eigenmaps] have the same objective function, but in general GSFA uses different edge-weight (adjacency) matrices, has different normalization constraints, supports nonde-weights, and uses function spaces.”
  8. GSFA can be used for both regression or classification, many approaches only work for one of the two
  9. “The central idea behind GSFA is to encode the label information implicitly in the structure of the input data, as some type of similarity matrix called edge-weight matrix, to indirectly solve the supervised learning problem, rather than performing an explicit fit to the labels.”
  10. In the graph, there are edge weights along with node weights, which specify a-priori sample properties
  11. “… hierarchical processing can also be seen as a regularization method because the number of parameters to be learned is typically smaller than if a single SFA node with the number of parameters than if a single SFA node with a huge input is used, leading to better generalization.”
    1. Another advantage is that if non-linear bases are used, the nonlinearity can allow for increasingly more complex functions per layer
  12. In graph edges are undirected, weighed, although it seems that the approach trivially generalizes to the directed case
  13. Basically they rewrite the original constraints of SFA with added weights
  14. Non-existing edges are given 0-weight
  15. Seems like they just end up using the graph to exactly calculate what the dynamics would be based on initialization probabilities (vertex weights) and transition probabilities (edge weights)
  16. How to construct the graph for either classification or regression is then discussed
  17. For graphs, they simply generate a separate graph for each class, with each item in each graph fully connected, and each sub-graph completely unconnected to items in a separate class, so basically there are independent fully connected components for each class
    1. There are some tricks that can be used due to the symmetry in each class cluster to make processing cheaper
  18. What is found by the classifier in this construction is equivalent to that of the Fisher linear discriminant
  19. “Consistent with FDA, the theory of SFA using unrestricted function space (optimal free responses) predicts that, for this type of problem, the first S – 1 slow features extracted are orthogonal step functions, and are piece-wise constant for samples from the same identity (…).  This closely approximates what has been observed empirically, which can be informally described as features that are approximately constant for sample of the same identity, with moderate noise.”
  20. <Immediately next paragraph> “When the features extracted are close to the theoretical predictions (e.g., their Δ-values are small), their structure is simple enough that one can use even a modest supervised step after SFA, such as a nearest centroid or a Gaussian classifier (in which a Gaussian distribution is fitted to each class) on S-1 slow features or less.”
    1. Using SVMs over Gaussians doesn’t make performance that much better, while being computationally more expensive
  21. Now on to regression
  22. For regression “The fundamental idea is to treat labels as the value of a hidden slow parameter that we want to learn.  In general, SFA will not extract the label values exactly.  However, optimization for slowness implies that samples with similar label values are typically mapped to similar output values.  After SFA reduces the dimensionality of the data, a complimentary explicit regression step on a few features solves the original regression problem.”
  23. They discuss 4 ways of doing the regression for SFA, the first one actually doesn’t even leverage
  24. In the version that doesn’t leverage graphs, simply sort data and then pass into SFA.  “Due to limitations of the feature space considered, insufficient data, noise, etc., one typically obtains noisy and distorted versions of the predicted signals.”
    1. On the other hand, its the easiest to implement (partially because vanilla SFA can be used) so “… we recommend its use for first experiments.”  If that doesn’t work well, use the GSFA approaches
  25. In the “sliding window training graph” items are sorted as above, but each vertex is connected to the d closest left and right items
  26. They recommend not using just 0 and 1 weights as it leads to “pathological solutions” – this may be what we’re picking up in ToH, and talk about why that happens. <This is worth investigating further.>
  27. In the “serial training graph,” data points are binned together – then points are all connected to points in adjacent bins, but they don’t connect all the points in a same bin  <why?>
    1. As is the case in other particular structures, can set up GSFA to be more efficient for this particular case
    2. Naturally, there is tuning required to see that the binning was done correctly
  28. The “mixed training graph” adds connections within a bin
  29. Then there is a supervised step on top of this stuff <am I missing something – I thought there were 4 in total?>
  30. “There are at least three approaches to implement the supervised step on top of SFA to learn a mapping from slow features to the labels. ” <
    1. First option is linear or nonlinear regression
    2. To bin and then classify <so you end up with discrete approx of regression?>
    3. Do a weighted version of #2 so you get continuous estimations
    4. <#s 2 and 3 immediately above look terribly hacky, if I am groking them correctly>
  31. Experimental results
  32. For classification they only check to see that indeed SFA does the same thing as Fisher linear discriminant (because that has already been studied exhaustively), which it does
    1. Interestingly in the benchmark task used, convnets are best, and outperform humans
  33. In the regression problems they take photos of people and estimate the horizontal position of the face, vertical position, and size.  This is all done separately <why?  Ah, because the sorting depends on the output variable, so you can only sort according to one… although it seems like a pretty simple extension could handle higher-dimensional outputs>
  34. Take face pictures from a number of data sets (a total of 64,471) and were “… automatically pre-processed through a pose-normalization and pose-reintroduction step. Basically they are all centered and then from there shifted and zoomed according to a distribution.  This way, they know what the x,y,z values they are estimating are
  35. Because of the size of the corpus and images themselves, its difficult to apply algs like SVMs directly, so they use hierarchical SFA and GSFA (which they also call HSFA <- great>)
    1. They also do a hierarchical version of PCA, which sort of does the opposite thing of SFA.  The 120 HPCA features used explain 88% of the variance
  36. Used a few different post-dimension reduction classifiers, including SVM
  37. The slow features of the data gets organized in a more orderly fashion as go up in hierarchy
  38. “… GSFA was 4% to 13% more accurate than the basic reordering of samples employing standard SFA.  In turn, reordering was at least 10% better than HPCA for this dataset.”
  39. Only 5 HSFA features are used, whereas 54 for HPCA. “This can be explained because PCA is sensitive to many factors that are irrelevant to solve the regression problem, such as the vertical position of the face, its scale, the background lighting, etc. Thus, the information that encodes the horizontal position of a face is mixed with other information and distributed over many principal components, whereas it is more concentrated in the slowest components of SFA.”
  40. Mixed and serial (straight SFA) outperformed the sliding window graphs <they were surprised but I’m not, at least with regards to mixed as it regular sliding window just seems like a strange construction).  The serial was actually better than the mixed, although the difference wasn’t significant
  41. They call these approaches “implicitly supervised” because the construction of the graph depends on the supervised labels, but the algorithm never sees those labels explicitly
  42. “The experimental results demonstrate that the larger number of connections considered by GSFA indeed provides a more robust learning than standard SFA.”
  43. Knock unsupervised dimension reduction by doing dimension reduction that doesn’t necessarily help in the task you are actually interested in <But this is only “implictly” supervised, by the same logic fully supervised dimension reduction would be better yet.>
  44. Being able to simply specify a graph means there is no need to exhaustively harvest data from a graph you may already have, as is the case in standard SFA
  45. GSFA has a tendency to overfit because it is not regularized, and is sensitive (in a bad way) to multiple types of data being used

Scalable kNN Search on Vertically Stored Time Series. Panagiotis Karras. Talk

  1. Want a method that scales knn on time series efficiently both in terms of number of time series as well as length of the time series
  2. Methods are based on some form of dimension reduction (PCA is common) 
    1. Because dimensions are thrown out, introduces some innacuracy.  Want to establish that the distance you have here is a lower bound on the actual distance
  3. I guess this projects the time series into feature vectors of same size
    1. Actually, all time series must be the same length
  4. VA-file is an algorithm that does not do indexing, which was unique at the time it was published, won a 10-year best paper award
  5. But in general there is some compact representation that indexes the large data, and you have to go back and forth between the compact and full data
  6. The approach here introduces a step when going from the index to the full data
  7. The idea is to have a sequences of indices – from very compact (and erroneous) to progressively larger and more complete
  8. This can be improved further by having all the indices together to comprise the full data set
  9. This method, however, requires both an upper and lower bound on distances (both bounds are needed to do more pruning)
  10. The derivation of the bounds are based on a Haar Wavelet Transform
    1. Constructs a tree, where each node has some piece of information that is used to assemble the final information at the leaf (for example, numbers in nodes that you add together going from root to leaf
  11. The method used here has both an upper and lower bound, as well as much tighter bounds than were previously used
  12. <Not going into details of how the bounds are actually computed>
  13. As you would expect, the empirical results are good
  14. Multiple type of transforms can be used (cosine, fourier)

PILCO: A Model-Based and Data-Efficient Approach to Policy Search. Diesenroth, Rasmussen. ICML 2011

  1. Paper uses model-building along with policy search to get better sample complexity
  2. Discusses the problem of model-bias.  I think this is a problem only in cases where you aren’t doing smart exploration, though.  The papers he references for that are from the mid-late 90s
  3. Anyway, they propose using a probabilistic model of the dynamics, they use non-parametric parametric gaussian processes for this
  4. The uncertainty is also incorporated into planning and policy evaluation
  5. PILCO stands for probabilistic inference for learning control
  6. There are some other references for how to deal with model uncertainty that are mentioned
  7. They build a separate model for each dimension (of the next state being predicted)
  8. Does policy gradients based on the stochastic model
  9. The gradient is estimated analytically, which is much better than estimating it from sampling
  10. Its very surprising this actually works on real robots
    1. (its a little strange that they say the data they report is “typical”, not best or worst case – they should just present all the data and let the reader decide)
  11. Although the models are separate for each dimension, they covary.  PILCO also takes the correlation of all control and state dimensions into account during planning and control
  12. They dont mention how they came up with the basis functions for the controller?
  13. Definitely not groking the math now, but its a neat paper.
  14. Learns swing up in about 90 seconds of interaction
  15. Mentions problems talked about by Erez in that convergence is only locally optimal, and that in domains where there are plateaus in terms of parameters/policy, the algorithm can have trouble
    1. The algorithm also benefits from shaping
  16. When they used the same approach but threw out the uncertainty of the GP it failed
  17. Experiments here did not use expert demonstration; the first few trajectories are randomized

Masters Thesis: Automatic Decomposition of Continuous Action and State Spaces in Simulation-Based Planning. Colin Schepers


  1. Main goal of the work is to “interchange” the discretization components of:
    1. Tree Learning Search, which discretizes according to a tree of trees where each tree further decomposes the action space
      1. Underlying component is Incremental Regression Tree Induction (IRTI)
    2. HOLOP, where the tree decomposes a space that corresponds to a sequence of actions
      1. Underlying component is HOO
  2. TLS has the issue of throwing away data when new information arrives and trees must be discarded
  3. Also include the idea of transposition tables
  4. Work examines behavior, computation time, computational complexity, and memory of both algs
  5. There are also some new algorithms introduced that extend these algorithms that have better performance in certain situations


Ch 1: Introduction

  1. In terms of doing the actual action planning, talks of two possible options:
    1. Meta Tree Learning (MTL) is the more traditional approach to MCTS, called meta tree because it constructs a tree of trees
    2. Sequence Tree Learning (STL) explodes the sequence into a large space where each dimension corresponds to one step in the sequence (what HOLOP does)
  2. IRTI and HOO can be combined with either MTL or STL to get 4 planning algorithms (originially TLS was coupled with IRTI and HOLOP with STL)
    1. IRTI x MTL = Regression-based Meta Tree Learning (RMTL), very similar to TLS
    2. STL x HOO = Hierarchical Optimistic Sequence Tree Learning (HOSTL), very similar to HOLOP
    3. IRTI x STL = Regression-based Sequence Tree Learning (RSTL), introduced in this work
    4. MTL x HOO = Hierarchical Optimistic Meta Tree Learning (HOMTL), also introduced in this work
  3. The novelty of using transposition tables here is due to the fact that the spaces considered here are continuous, not discrete as is the case with most transposition tables. Two methods are proposed to do this
  4. The Basic Research Questions for the Thesis are:
    1. Can the IRTI component of TLS be replaced by the HOO component of HOLOP
    2. Can the HOO component of HOLOP be replaced by the IRTI component of TLS
    3. How can the information retreived from simulations be reused for TLS
    4. Can a transposition tree increase the performance of simulation based systems in continuous environments
    5. Which combination of techniques is best
    6. What is the memory cost of these algorithms
    7. What is the computational time complexity of the algorithms

Ch 2: Preliminary Knowledge

  1. Discusses use of Zobrish hashing for high-performance transition tables.  I’ve never heard of it.

Ch 3: Automatic Decomposition

  1. IRTI checks all possible tests (splits) in a leaf node, and takes the one with the highest information gain.  If that split yields two different leaves that are statistically significantly different (F-test), the split is made
  2. <p. 14 the rule used by HOO to calculate b-scores is very reminiscent of UCB as well>
  3. <p. 14 in an extended version of the HOO paper on Arxiv, a version of the algorithm is presented where n_0 doesn’t have to be recalculated at each step, if the total number of pulls is known before sampling starts.  This can make the algorithm much more efficient (log n per step instead of n)>
  4. Discussion of scaling of the exploration factor (commonly called C)
    1. The propose scaling it according to the range of values seen from each node, then modifying it by a factor which is called k
    2. <p. 16, in practice, I definitely buy that this helps, but in certain situations this rule will cause problems (when all rewards observed in a node are the same, there will be no bias, for example)>
    3. But they also mention that if vmin, vmax are known for the problem it can be used, so thats all fine.  If you don’t know that you need to resort to something like above
  5. Discuss ways of conserving memory, such as not allowing leaves to split that have too few samples.  Capping tree depth isn’t mentioned, but is also a reasonable option

Ch 4: Multi-step Optimization in Continuous Environments

  1. In normal MCTS, an edge represents 1 action, and a node represents 1 state. In Regression-based Meta Tree Learning, an edge represents a range of actions, and a node represents a range of states
  2. When selecting an action, UCT rules are used (basically just ignore that edges and nodes represent a range and use the samples)
  3. If a sample causes a split in a leaf in an internal leaf, the trees below that point must be discarded
    1. Replay can be used once trees are discarded, but this is somewhat expensive.  It is cheaper to do sequence planning and pass values into the children on splits (sequence planning, though, has the drawback that it can’t form closed-loop plans)
  4. <p. 21 It isn’t the case that the given splitting rule for HOLOP/HOSTL that prioritizes early actions has to be used, but it is one method that intuitively makes sense, and also doesn’t ruin the original proofs of HOO>
  5. <In general, the way the material is presented respects the original algorithms this paper builds on too much.  It basically talks about TLS and HOLOP (which are both unique in both aspects of how to do decomposition as well as plan over sequences) and then crosses them over.  It would be easier for most readers not already familiar with the history of previous publications to present for example the Meta Tree Learning algorithms and then the Sequence Tree Learning algorithms, or something like that.  It starts with the corners of a 2×2 table describing the attributes of the algorithms instead of starting with a row or column>

Ch 5: Continuous Transposition Tree

  1. Two methods for constructing transposition tables are introduced
  2. In Level-based Transposition Trees, there is a clear distinction between the discretizations over the state and action spaces
  3. <p. 24 “This process [of descending a decision tree over the state space] is expected to be less computationally expensive because it does not require any complex computations like the UCT formula.”  Is there really any significant difference between the costs of both aside from constant level operations which are very cheap anyway?>
  4. One important feature of using the transposition table is that it allows planning to be stored – otherwise planning always has to be redone from every current state (of course it can also make planning cheaper further down the tree)
  5. In LTT a decision tree decomposes the state space.  From each leaf in that tree over the state space, another decision tree is rooted that decomposes the action space
    1. Still suffers from the need to discard action trees on splits in the state-tree leaves, but its easy to do replay in this setting
  6. In a Mixed Transposition Tree (MTT), the tree is built both in terms of states and actions (as opposed to states and then actions, as in LTT); a node represents both spaces and an edge from parent to child represents a split in either the state or action dimension
  7. When MTTs are used, from root to leaf states are traversed according to the query state (naturally), while actions are followed according to those that has the highest value according to a particular computation, I think it is basically like UCB, which allows for exploration.
  8. The advantage MTT has over LTT is that trees do not have to be discarded and rebuilt
  9. <In general, this thesis has a lot of verbiage, but is short on diagrams or other items that would express points succinctly, sometimes the material is presented in a way that is a bit vague and the only way to resolve the issue is to go back to pseudocode and grok it>

Ch 6: Experiments and Results

  1. UCT here isn’t the normal version of UCT, somewhat between HOSTL and IRTI
  2. A couple of the test domains are regular 1-step optimization problems.  Two more are multi step problems.  One is navigating in a circle (agent chooses angle to move in), and the other is cart-pole
  3. For the 1-step optimization problems, comparisons are also made with a random agent, a vanilla MC agent (random but choses best found), and a version of UCT that uses unifom discretization
  4. In 1-step optimization, UCT learns a good policy quickly, but HOO eventually outperforms it with near optimal performance; the regret plots are most illustrative of performance.  IRTI is worse than HOO in both cases (in one case better than UCT by the end of the experiment and in one case worse, but anyway as more time is given IRTI will beat UCT anyway)
  5. Good parameterizations are found for each problem separately.
  6. When constraining the algorithms on time instead of samples, HOO performs worst (due to polynomial time complexity) <as mentioned, an nlogn version can be implemented, though>.  When time is the main constraint IRTI performs best, vanilla-MC actually outperfoms UCT
  7. In multistep experiments, all four algorithms (2×2 grid plus random and vanilla mc) are examined
  8. In donut world, when samples are constrained, HOSTL (HOLOP) performs best, IRTI x STL is 2nd best
  9. In stark contrast, in the cart-pole domain, HOSTL is worst besides random (even vanilla MC is better).
    1. Here, RMTL (original tree-learning search) performs best by a wide margin
    2. The domain used here is more challenging than the one I’ve used as it doesn’t allow for much deviation of the pole
    3. This is under time constraints
  10. <p. 35They test MTT by itself which I’m not exactly groking because I thought it was a technique for allowing tranposition tables to be used and not a policy, but I’m missing something>
  11. Have very nice experiments of sample distributions on 1-step optimization problems
  12. The HOO-family of algorithms have by far the worst memory use.  The regression-tree algorithms use the least memory <I guess cause they are constantly throwing trees out?  Even if they are doing replay, the trees are probably much more compact than the HOO versions because it requires a test confirming statistical significance to split>
  13. In terms of measured computation time, not surprisingly HOO is slowest, but what is interesting is that IRTI is faster than UCT <Impressive.  It is because UCT ultimately ends up building a larger tree>

Ch 7: Experiments and Results

  1. It is actually unclear whether transposition tables in the manner used are helpful (sometimes they help and sometimes they do not)
  2. HOSTL (HOLOP) is best when the domain isn’t time constrained, but RMTL(TLS) is best when there are time constraints as it is very efficient because trees are built fairly compactly
  3. While HOO-based algorithms used the most memory, there were no issues of exhausting memory during experiments or anything like that

What can 20,000 models teach us? Lessons learned from a large-scale empirical comparison of learning algorithms Alexandru Niculescu-Mizil. Talk

  1. This talk covered the speaker’s experience in implementing a large number of supervised learning methods and his observations of performance.
  2. There were some parts that some members of the audience disagreed with, such as the way probabilities were inferred from the classifiers (not all of them are designed to allow this, so making this happen is a bit off-label), and the way regression algorithms were made to perform classification is a problem not restricted to this talk, but overall there was definitely a large amount of useful information.  Its also worth pointing out the speaker was very open about where we should draw conclusions from results and where not to, which was refreshing.
    1. Also worth pointing out that he found new parameterizations for each validation fold, which I thought is normally not done, I think this helps brittle algorithms too much
    2. He had many different metrics and combined them in a way I thought could be improved upon (although it was reasonable).  Things were normalized between performance probability levels and the best algorithm; I would have used z-scores or something similar.
  3. The speaker pointed out the problems he worked on were of medium size (for classification, binary, 10-200 features, 5000 (I think I dropped a zero samples)), so this is not big data, but definitely medium-sized
  4. 10 learning algorithms with numerous parameterizations tested, which is where the 20,000 models comes from
  5. The performance of the algorithms as initially tested was surprising.   The order from best to worst was:
    1. Bagged Trees
    2. Random Forests
    3. ANNs (I was very surprised to see them this high)
    4. Boosted Trees (I expected these above bagged trees and random forests)
    5. KNN (I thought this would be lower)
    6. SVMs (This low?  I feel like something may have been funny with the implementation or parameter search)
    7. Boosted Stumps
    8. Decision Trees (This was a shocker)
    9. Logistic Regression
    10. Naive Bayes (Naturally)
  6. Boosted trees in particular was #1 in the most categories, but was particularly bad in a few
  7. Also, almost every algorithm was best in at least one dataset and/or metric overall, so the speaker was clear in pointing out that this list should only inform you of the order to try methods on your particular problem until you find one that works
  8. Discussed the use of Platt scaling and Isotonic regression to correct some problems in boosted trees, after which the algorithms performed better than anything else.  I had not heard of these approaches before.
  9. There was a lot of focus on calibration in the talk
  10. The punchline?  Use ensemble methods.  The ensembles by far had consistently the best performance, unlike the best non-ensemble methods, that were outperformed depending on the task.
  11. Overall, he said the best performance metric is cross entropy.  I am familiar with this phrase in terms of optimization only.