Category Archives: Classification

One-shot learning by inverting a compositional causal process. Lake, Salakhutdinov, Tenenbaum. NIPS 2013

  1. Deals with one-shot learning
  2. “…a Hierarchical Bayesian model based on compositionality and causality that can learn a wide range of natural (although simple) visual concepts, generalizing in human-like ways from just one image.”
  3. In 1-shot learning did about as well as people, and better than deep learning methods
  4. People can learn a new concept from a tiny amount of data – can learn a class from one image (which is a very high dimensional piece of data)
  5. Even MNIST, which is an old and small dataset, still has 6k samples/class, but people often only need 1 example
  6. “Additionally, while classification has received most of the attention in machine learning, people can generalize in a variety of other ways after learning a new concept. Equipped with the concept “Segway” or a new handwritten character (Figure 1c), people can produce new examples, parse an object into its critical parts, and fill in a missing part of an image. While this flexibility highlights the richness of people’s concepts, suggesting they are much more than discriminative features or rules, there are reasons to suspect that such sophisticated concepts would be difficult if not impossible to learn from very sparse data. ”
    1. Looks like people both have a rich hypothesis space (because they can do all the above with a small amount of data), but also don’t overfit, which is the theoretical downside to having a large hypothesis class.  How do they do it?
  7. Here, focus is on handwritten characters
    1. Idea is to use something more natural and complex than simple synthetic stimuli, but something less complex than natural images
  8. Use a new omniglot dataset, which has 16k characters with 20 examples each
    1. Also has time data so the strokes are recorded as well
  9. “… this paper also introduces Hierarchical Bayesian Program Learning (HBPL), a model that exploits the principles of compositionality and causality to learn a wide range of simple visual concepts from just a single example.”
  10. Also use the method to generate new examples of a class, and then do a Turing test with it by asking other humans which was human generated and which was machine generated
  11. The HBPL “…is compositional because characters are represented as stochastic motor programs where primitive structure is shared and re-used across characters at multiple levels, including strokes and sub-strokes.”
  12. The model attempts to find a “structural description” that explains the image by breaking the character down into parts
  13. A Character is made of:
    1. A set of strokes
      1. Each stroke is made of simple sub-strokes modeled by a “uniform cubic b-spline” and is built of primitive motor elements that are defined by a 1st order Markov Process
    2. Set of spatial relationships between strokes, can be:
      1. Independent: A stroke that has a location independent of other strokes
      2. Start/end: A stroke that starts at beginning/end of another stroke
      3. Along: A stroke that starts somewhere along a previous stroke
  14. ” Each trajectory … is a deterministic function of a starting location … token-level control points … and token-level scale …. The control points and scale are noisy versions of their type-level counterparts…”
  15. Used 30 most common alphabets for training, and another 20 for evaluation.  The training set was used to learn hyperparameters, a set of 1000 primitive motor elements, and stroke placement.  They attempted to do cross-validation within the training set
  16. The full set of possible ways a stroke could be created is enormous, so they have a botto-up way of finding a set of the K most likely parses.  They approximate the posterior based on this finite, size-K sample based on their relative likelihoods
    1. They actually then use metropolis-hasting to get a number of samples of each parse with a little variance each to get a better estimate of the likelihoods
  17. “Given an approximate posterior for a particular image, the model can evaluate the posterior predictive score of a new image by re-fitting the token-level variables…”
  18. Results
  19. For the 1-shot tasks, a letter from an alphabet was presented with 20 other letters from the same alphabet.  Each person did this 10 times, but each time was with a totally new alphabet, so no characters was ever seen twice
  20. Get K=5 parses of each character presented (along with MCMC), and then run K gradient searches to reoptimize the token-level variables to fit the query image.
  21. They can also, however, attempt to reoptimize the query image to fit the 20 options presented
  22. Compare against:
    1. Affine model
    2. Deep Boltzmann Machines
    3. Hierarchical Deep Model
    4. Simple Strokes (a simplified HBPL)
    5. NN
  23. Humans and HBPL ~4.5% error rate, affine model next at 18.2%
  24. Then they did one-shot Turing test where people and algorithms had to copy a single query character
    1. <For what its worth, I think Affine looks better than both results from people and HBPL>
  25. In the “Turing test” there was feedback after each 10 trials, for a total of 50 trials
    1. <Note that this test doesn’t ask which character looks best, it is which is most confusable with human writing (which is pretty sloppy from the images they show).  I’m curious if the affine model could be made more human just by adding noise to its output>
  26. <Playing devil’s advocate, the images of characters were collected on mTurk, and look like they were probably drawn with a mouse — that is to say I feel they don’t look completely like natural handwriting.  I wonder how much of this program is picking up on those artifacts?  At least in terms of reproduction, the affine method looks best>

 

Science 2015

  1. “Concepts are represented as simple probabilistic programs—that is, probabilistic generative models expressed as structured procedures in an abstract description language (…). Our framework brings together three key ideas—compositionality, causality, and learning to learn—that have been separately influential in cognitive science and machine learning over the past several decades (…). As programs, rich concepts can be built “compositionally” from simpler primitives
  2. “In short, BPL can construct new programs by reusing the pieces of existing ones, capturing the causal and compositional properties of real-world generative processes operating on multiple scales.”
  3. <Looks like exactly the same paper, just more brief.  The accuracies of both BPL and other methods seems improved here, though.  Convnets get 13.5% error; BPL gets 3.3%; people get 4.5%.  “A deep Siamese convolutional network optimized for this one-shot learning task achieved 8.0% errors”>
  4. “BPL’s advantage points to the benefits of modeling the underlying causal process in learning concepts, a strategy different from the particular deep learning approaches examined here.”
    1. <Or equivalently you can just say BPL does better because it has a small and highly engineered hypothesis class>
  5. Also run BPL with various “lesions” and gets error rates in the teens.  Also did more poorly in the “Turing test” part
  6. Instead of training on 30 background alphabets, they also did with just 5, and there the error rates are about 4%; on the same set convnets did about 20% error
  7. Supplementary Material

  8. <I assumed that they would ask individuals who actually learned how to write the languages to do the recordings.  Instead, they just took pictures of characters and had people write them.  This seems like a problem to me because of inconsistencies in the way people would actually do the strokes of a letter in an alphabet they do not know.>
  9. <Indeed, they were also drawn by mouse in a box on a screen, which is a very unnatural way to do things>
  10. <From what I can tell the characters are recorded in pretty low resolution as well which looks like it can cause artifacts, looks like 105×105>
  11. <This basically has the details that were included in the main part of the NIPS paper>
  12. Some extra tricks like convolving with Gaussian filter, randomly flipping bits
  13. Primitives are scale-selective
  14. “For each image, the center of mass and range of the inked pixels was computed. Second, images were grouped by character, and a transformation (scaling and translation) was computed for each image so that its mean and range matched the group average.”
  15. ” In principle, generic MCMC algorithms such as the one explored in (66) can be used, but we have found this approach to be slow, prone to local minima, and poor at switching between different parses. Instead, inspired by the speed of human perception and approaches for faster inference in probabilistic programs (67), we explored bottom-up methods to compute a fast structural analysis and propose values of the latent variables in BPL. This produces a large set of possible motor programs – each approximately fit to the image of interest. The most promising motor programs are chosen and refined with continuous optimization and MCMC.”
  16. “A candidate parse is generated by taking a random walk on the character skeleton with a “pen,” visiting nodes until each edge has been traversed at least once. Since the parse space grows exponentially in the number of edges, biased random walks are necessary to explore the most interesting parts of the space for large characters. The random walker stochastically prefers actions A that minimize the local angle of the stroke trajectory around the decision point…”
  17. For the ANN they used cafe, and took a network that works well on MNIST
    1. <But it seems like this system doesn’t have any of the special engineering that went into this that deals specifically with strokes as opposed to whole images>
    2. “The raw data was resized to 28 x 28 pixels and each image was centered based on its center of mass as in MNIST. We tried seven different architectures varying in depth and layer size, and we reported the model that performed best on the one-shot learning task.”
    3. <This may make the task easier, but MNIST deals with a small number of characters, many of which are much less complex than some of the characters used here.   It might be the case that some of the more complex characters can’t be accurately reduced to such a small size, so this may be hobbling performance>
    4. Also the network is not very deep – only 2 conv layers and a max-pooling
    5. “One-shot classification was performed by computing image similarity through the feature representation in the 3000 unit hidden layer and using cosine similarity.”
    6. They used a smaller net for the 1-shot classification with less data, <so that was nice of them>
  18. The full “Siamese network” did work on the 105×105 image, had 4 conv layers and 1 standard hidden layer.  Parameters were optimized with Bayesian method
  19. “The Hierarchical Deep model is more “compositional” than the deep convnet, since learning-to-learn endows it with a library of high-level object parts (29). However, the model lacks a abstract causal knowledge of strokes, and its internal representation is quite different than an explicit motor program. “
  20. For data collection “The raw mouse trajectories contain jitter and discretization artifacts, and thus spline smoothing was applied.”
  21. <Ok, skipping the rest>
Advertisements

Large-scale Video Classification with Convolutional Neural Networks. Karpathy, Toderici, Shetty, Leung, Sukthankar, Fei-Fei

  1. Convolutional NN on classification of 1 million youtube videos
  2. Propose a “multilevel, foveated architecture” to speed training
  3. Created a new dataset for this paper: 1-million sports videos belonging to 487 classes of sports (about 1k-3k vids/class)
  4. They are interested in video (so need to factor in temporal information, unlike more common image classification)
  5. Because the networks for learning video have to be so huge, propose a 2-part system to make learning tractable, which involves a context stream on low-res and fovea stream on high-res.  This way is about 3x as fast as naive approach and as accurate
  6. They also use the features learned from this dataset to then improve performance on a smaller corpus (going from 41.3% to 65.4%)
  7. A common approach to video classification involves processing (into something analogous to a bag of words model) and then throwing into a SVM
  8. Previous activity recognition benchmarks are small in terms of sample size, which don’t work well with NNs, so they propose a huge data set
  9. Unlike images that can somewhat easily be rescaled to the same format, there is more variability in video, including the addition of temporal length
  10. Here they treat videos as a bag of short fixed-length clips
  11. They considered at least 4 different classes of topologies for the task, all fundamentally different (structurally, not in terms of how/where pooling is done for example)
    1. Basically, this comes down to where you deal with the temporal aspect, is it at the input layer, or is the input processed as a still and then merged with temporal data further up?
  12. Different architectures
    1. Single-Frame: just process stills alone
    2. Early-Fusion: have the first convolutional layer extend back some number of frames in the past
    3. Late-Fusion: Starts with 2 single-frame networks, (for past 2 frames) and merges their results in “the first fully connected layer”
    4. Slow-Fusion: A more graduated version of merging data (lower levels have less temporal data, higher levels have more)
  13. In order to speed training they tried to reduce the number of weights, but it made classification worse.  Making images lower res sped things up too but also made perf worse.  Then they did the 2-part system (context, fovea)
    1. Context is 178 x 178 but whole pic
    2. Fovea is 89 x 89 but centered with original resolution
  14. Optimize NN with “Downpour Stochastic Gradient Descent”
  15. Use data augmenting to prevent overfitting
  16. Training took a month, although they say their perf isn’t optimized and could run better on GPUs
  17. They also tried to run the features generated by their NN through linear classifiers, but letting the NN do the classification worked better
  18. Lots of incorrectly labeled videos, still performance is said to be good
  19. Slow fusion works the best (although difference between others isn’t enormous)
  20. Camera motion can mess up the classifiers
  21. Errors tend to be in related areas (ex hiking vs backpacking)
  22. Then learning transfer to smaller UCF-101 database
    1. Worked best when retraining the top 3 layers of the net
Tagged ,

Quantifying the Internal Structure of Categories Using a Neural Typicality Measure. Davis, Poldrack. Cerebral Cortex 2014

  1. Deals with the internal structure/representation of category information
  2. <Seems like assumption is there is something of an exemplar representation>
  3. “Internal structure refers to how the natural variability between-category members is coded so that we are able to determine which members are more typical or better examples of their category. Psychological categorization models offer tools for predicting internal structure and suggest that perceptions of typicality arise from similarities between the representations of category members in a psychological space.”
  4. Based on these models, develop a “neural typicality measure” that checks if a category member has a pattern of activation similar to other members of its group, as well as what is central to a neural space.
  5. Use an artificial categorization task, find a connection between stimulus and response
    1. “find that neural typicality in occipital and temporal regions is significantly correlated with subjects’ perceptions of typicality.”
  6. “The prefrontal cortex (PFC) is thought to represent behaviorally relevant aspects of categories such as
    rules associated with category membership (…). Motor and premotor regions may represent habitual responses associated with specific categories (…). The medial temporal lobe (MTL) and subregions of the striatum are thought to bind together aspects of category representations from these other systems.”
  7. Different areas and different neurons and patterns of activation in an area can “reliably discriminate
    between many real world object categories”
  8. Consider examples of category data as having some sort of “internal structure” or feature representation specific to that class.
    1. These features can say things like how typical a concrete example is, and is related to how quickly and accurately classification occurs
  9. “Depending on the specific model, a category representation may be a set of points associated with a given category (exemplar models; …), a summary statistic ( prototype models; …), or a set of statistics (clustering models; …) computed over points associated with a category.”
  10. Items closer to other examples in the class, or to the prototype are considered to be most typical or likely
  11. But they don’t propose that an accurate model is exactly the same thing a computer does, as there are examples of where nonintuitive things happen.
    1. Ex/ culture can influence how things are categorized, as can a current task or other context
  12. “Here, our goal is to develop a method for measuring the internal structure of neural category representations and test how it relates to physical and psychological measures of internal structure.”
  13. The neural typicality measure is related to nonparametric kernel density estimators, but “A key difference between our measure and related psychological and statistical models is that instead of using psychological or
    physical exemplar representations, our measure of neural typicality is computed over neural activation patterns…”
  14. Use a well studied research paradigm of categorizing simple bird illustrations into 4 categories based on neck angle and leg length.  Previous results show people reconstruct classes based on average item for each category
  15. “Our primary hypothesis is that psychological and neural measures of internal structure will be linked, without regard to where in the brain this might occur.”
    1. Also expect that some categorization will happen in visual cortex, and higher level temporal and medial-temporal regions, which “…. are theorized to bind together features from early visual regions into flexible conjunctive category representations (…).”
    2. There are other parts relevant to categorization, but not particularly this form of visual categorization, and other parts may be sensitive to things like entropy
  16. “To foreshadow the results, we find that neural typicality significantly correlates with subjects’ perceptions of typicality in early visual regions as well as regions of the temporal and medial temporal cortex. These results suggest that neural and psychological representational spaces are linked and validate the neural typicality measure as a useful tool for uncovering the aspects of category representations coded by specific brain regions.”
  17. “For analysis of behavioral responses, response time, and typicality ratings, a distance-to-the-bound variable was constructed that gave each stimulus’ overall distance from the boundaries that separate the categories in the stimulus space. Distance-to-the-bound is a useful measure of idealization: items that are distant from the bound are more idealized than items close to the bound (…).”
  18. “For the psychological typicality measure, a value for each of the Test Phase stimuli was generated by interpolating, on an individual subjects basis, a predicted typicality rating from the subjects’ observed typicality ratings…”
  19. Also did a physical typicality measure, which is pretty simple to understand (just neck angle, leg length measurements)
  20. Then a neural typicality <too much details to list here>
    1. “Our neural typicality measure is based on similarities between multivariate patterns of activation elicited for
      stimuli in the task. Stimuli that elicit activation patterns that are like other members of their category are more neurally typical than those that elicit dissimilar patterns of activation.”
  21. Subjects’ behavioral responses were predicted by SVM
  22. Typicality ratings were highly correlated with distance-to-the-bound
    1. Reveals that most typical items, and not the average item are the one that is used for category representation.  There are a few other results that show this is the case through other methodology
  23. Neural typicality is linked to psychological typicality
  24. Found activity in visual cortex and MTL that have been found to be linked to categorization
  25. “These results suggest that, in the present task, the internal structure of neural category representations in temporal and occipital regions are linked to subjects’ psychological category representations such that objects that are idealized or physical caricatures of their category elicit patterns of activation that are most (mathematically) similar to other members of their category.”
  26. “… in the present task, physical similarity is not a significant contributor to the internal structure of neural category representations, at least not at a level that is amenable to detection using fMRI.”
  27. Also did MDS for classification on the neural data, <results don’t seem amazing, but only ok>
  28. SVM for classification “The SVMs are given no information about the underlying stimulus space, and unlike
    the MDS analysis, do not make any assumptions about how the dimensions that separate the categories will be organized. Thus, the SVMs can be sensitive to regions that code rulebased or behavioral differences between categories, regions that encode information about their perceptual differences, or regions that code some combination of behavioral and perceptual information.”
  29. “Although there is strong overlap in the visual and MTL regions that discriminate between categories and represent
    internal structure, the motor/premotor, insula, and frontal regions were only identified in the between-category analysis. These results are consistent with the hypothesis that PFC and motor/premotor regions are more sensitive to behavioral aspects of categories (…). However, because behavioral responses are strongly associated with the perceptual characteristics of each category, the SVM results are also consistent with the hypothesis that these regions contain some perceptual information about the categories.”
  30. “The present research adds to the growing consensus that categorization depends on interactions between a number of
    different brain regions… An important point that this observation highlights is that there may not be any brain region that can be thought of representing all aspects of categories, and thus it might be most accurate to think of brain regions in terms of the aspects of category representations that they code.”
  31. “…in the present context, the deactivation of regions of the striatum with increasing typicality likely indicates an uncertainty signal, as opposed to category representation…”
  32. “Because our neural typicality measure is not based on mean activation-level differences between stimuli, it may be
    more directly interpretable and less susceptible to adjacency effects in studies of longer term internal category structure.”

    1. <Hm, should read their methodology more carefully on another read-through>
  33. They don’t have results that indicate suppression of adjacent stimulus
  34. Says their methodology should be tested in real-world, and more artificial settings
  35. Evidence of “dimensional selective attention” where not all features are attended to for classificaiton
    1. “Attentional mechanisms in the PFC that instantiate rule-based strategies (…) may contribute to selective attention effects by influencing neural representations in a top-down manner.”
    2. Although: “In the present context, dimensional selective attention is insufficient for explaining the idealization effect because dimensional selective attention affects an entire dimension uniformally… additional mechanisms are required.”
  36. “Attention has been found to create a spotlight around salient regions of visual space such that the processing of stimuli
    close to this location in space is enhanced (not just differences along a specific dimension of visual space; …). It is conceptually straightforward to predict that the same or similar spotlight mechanisms may affect the topography of stored neural stimulus representations, such that regions of a category space that contain highly idealized category members are enhanced and contribute more to categorization and typicality judgments than exemplars in ambiguous regions of category space.”
  37. Another model is one that specifically tries to “… to reduce prediction error and confusion between categories (…). In these models, category members are simultaneously pulled toward representations/members of their own categories and repelled by members of opposing categories.”
    1. But this doesn’t seem to be a possible explanation here because “… the neural effects as actual neuronal changes in regions of early visual cortex happen on a much longer scale than our task.”
  38. This study only tried to find correlation between “psychological” and “neurological” responses, but more in-depth exploration of their relationship is a good idea and left for future work
  39. “Our task involves learning to distinguish multiple categories, akin to A/B tasks, and so our finding that early visual cortex is involved with representing category structure may be at odds with theories emphasizing the role of task demands (as opposed to featural qualities) in determining which perceptual regions will be recruited to represent categories.”
    1. Although these distinctions may be an artifact of the type of analysis used

Comparing size and accuracy of decision trees with and without synchronic arcs

Right, well when predicting x’, the feature vector is (x, y, z, y’, z’), but not x’, so all the variables aside from the one we’re trying to predict with that decision tree are available as inputs.

oh, I see… so, that’s a bit stronger than needed, but if that
doesn’t produce much smaller trees, then I guess it’s harder than I
thought

In response to this, I reran the taxi/decision tree experiment with and without synchronic arcs, and compared the size of the decision trees and their accuracy in both cases.  In all cases, the accuracy of the decision trees with synchronic arcs was better (or at least as good) as the decision tree without.  As far as size goes, in some cases the trees with synchronic arcs are larger and in some cases they are not.

In order to avoid divide by zero problems, both classifiers are said to have at least one error; for some attributes the decision trees can perfectly predict the value with no error at all.

Comparing: @attribute taxi_x-0 {0,1,2,3,4}
	 (Tree size w/o current / Tree size w/ current): 1.0
	 (Num leaves w/o current / Num leaves w/ current): 1.0
	 (Error rate w/o current / Error rate w/ current): 1.0

Comparing: @attribute taxi_y-0 {0,1,2,3,4}
	 (Tree size w/o current / Tree size w/ current): 1.0
	 (Num leaves w/o current / Num leaves w/ current): 1.0
	 (Error rate w/o current / Error rate w/ current): 1.0

Comparing: @attribute taxi_delta_x-0 {-1,0,1}
	 (Tree size w/o current / Tree size w/ current): 2.625
	 (Num leaves w/o current / Num leaves w/ current): 3.3636363636363638
	 (Error rate w/o current / Error rate w/ current): 1.0

Comparing: @attribute taxi_delta_y-0 {-1,0,1}
	 (Tree size w/o current / Tree size w/ current): 2.625
	 (Num leaves w/o current / Num leaves w/ current): 3.3636363636363638
	 (Error rate w/o current / Error rate w/ current): 1.0

Comparing: @attribute passenger_x-0 {0,1,2,3,4}
	 (Tree size w/o current / Tree size w/ current): 0.46153846153846156
	 (Num leaves w/o current / Num leaves w/ current): 0.5
	 (Error rate w/o current / Error rate w/ current): 1.7999999999999998

Comparing: @attribute passenger_y-0 {0,1,2,3,4}
	 (Tree size w/o current / Tree size w/ current): 0.3
	 (Num leaves w/o current / Num leaves w/ current): 0.3333333333333333
	 (Error rate w/o current / Error rate w/ current): 19.0

Comparing: @attribute destination_x-0 {0,1,2,3,4}
	 (Tree size w/o current / Tree size w/ current): 1.0
	 (Num leaves w/o current / Num leaves w/ current): 1.0
	 (Error rate w/o current / Error rate w/ current): 1.0

Comparing: @attribute destination_y-0 {0,1,2,3,4}
	 (Tree size w/o current / Tree size w/ current): 1.0
	 (Num leaves w/o current / Num leaves w/ current): 1.0
	 (Error rate w/o current / Error rate w/ current): 1.0

Comparing: @attribute action-0 {U,D,W,E,PickUp,DropOff,Refuel}
	 (Tree size w/o current / Tree size w/ current): 0.9667341934108113
	 (Num leaves w/o current / Num leaves w/ current): 0.9798368971938611
	 (Error rate w/o current / Error rate w/ current): 1.0632673515409672

Comparing: @attribute fuel-0 {-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13}
	 (Tree size w/o current / Tree size w/ current): 0.995260663507109
	 (Num leaves w/o current / Num leaves w/ current): 0.8274111675126904
	 (Error rate w/o current / Error rate w/ current): 25.999999999999996

Comparing: @attribute fuelDelta-0 {-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13}
	 (Tree size w/o current / Tree size w/ current): 3.3157894736842106
	 (Num leaves w/o current / Num leaves w/ current): 2.6
	 (Error rate w/o current / Error rate w/ current): 25.999999999999996

Comparing: @attribute reward-0 {20,-1,-10,-20}
	 (Tree size w/o current / Tree size w/ current): 2.022727272727273
	 (Num leaves w/o current / Num leaves w/ current): 1.605263157894737
	 (Error rate w/o current / Error rate w/ current): 21.999999999999996

Comparing: @attribute hasPassenger-0 {true,false}
	 (Tree size w/o current / Tree size w/ current): 0.9135802469135802
	 (Num leaves w/o current / Num leaves w/ current): 0.7971014492753623
	 (Error rate w/o current / Error rate w/ current): 20.0

Comparing: @attribute terminal-0 {true,false}
	 (Tree size w/o current / Tree size w/ current): 2.4
	 (Num leaves w/o current / Num leaves w/ current): 2.25
	 (Error rate w/o current / Error rate w/ current): 8.999999999999998

Comparing: @attribute wallN-0 {true,false}
	 (Tree size w/o current / Tree size w/ current): 1.2027027027027026
	 (Num leaves w/o current / Num leaves w/ current): 1.25
	 (Error rate w/o current / Error rate w/ current): 1.5913978494623655

Comparing: @attribute wallS-0 {true,false}
	 (Tree size w/o current / Tree size w/ current): 1.0704225352112675
	 (Num leaves w/o current / Num leaves w/ current): 1.117283950617284
	 (Error rate w/o current / Error rate w/ current): 2.5555555555555554

Comparing: @attribute wallW-0 {true,false}
	 (Tree size w/o current / Tree size w/ current): 2.15311004784689
	 (Num leaves w/o current / Num leaves w/ current): 2.33125
	 (Error rate w/o current / Error rate w/ current): 5.446808510638299

Comparing: @attribute wallE-0 {true,false}
	 (Tree size w/o current / Tree size w/ current): 3.0174418604651163
	 (Num leaves w/o current / Num leaves w/ current): 3.111940298507463
	 (Error rate w/o current / Error rate w/ current): 6.925925925925926

DecisionTreeBot versus Trackfire

Here is a video of the learned policy versus Trackfire.

In general, the policy seems to be:

  • If you are in the middle, use SpinBot.  This actually works well because Trackfire shoots alot, and most miss.  Since shooting drains energy, this is effective because the other tank just “punches itself out,” although I should point out the learner isn’t even modelling that
  • When near the wall, there are some regions where WallBot is used, and some where TrackFire is used, there is a corner (top right) where it goes from WallBot to SpinBot.  Of course, TrackFire vs TrackFire is 50/50, and WallBot has an advantage vs TrackFire.

But anyway, here’s the actual found strategy, its a bit more noisy than whats learned against WallBot, but there are clear regions defined:

Policy learned by decision tree against TrackFire

Policy learned by decision tree against TrackFire

And here’s a clip of it in action.  Sorry for the poor quality, the original video looks perfect, but Youtube manages to mangle it to the point its barely watchable.  Anyway, the agent is yellow and TrackFire is red.

Theres one funny part at about a minute where the agent is looping in SpinBot directly around the agent, and neither shoot each other until a timeout.

Ideas for future work

I’m writing on the assumption that we are basically done with Robocode – I don’t think we’ll be able to get results much better than I found in my last experiment.  In my opinion they should be just as happy with a good robocode solution as a good hillcar solution; both are essentially independent of the framework, so I think its more of an engineering project than a science project at this point.

That being said, here are the next 2 things I thought of potentially working on:

1. Yet another robot navigation scheme

We talked about this briefly before I left.  I’ve been reading up on MS’s robot studio (and played with it a bit).  I am thinking about developing a system which performs classification based on sensor readings (and therefore can do things like path planning based on sensor readings).  For example, learning that going forwards when the laser rangefinder pointing ahead of says an object is nearby is probably a bad idea.  I could also see doing something similar to Bethany’s work where the terrain is different in different locations, which should yield different planning.

2. Transfer learning in MDPs

In more information theory related work, I’ve been reading Thomas Cover (and Joy Thomas)’s information theory book, where the entropy of Markov Chains are discussed, this also opens up other information theoretic metrics that can be applied such as the Kullback-Leiber (K/L) divergence on a Markov Chain.

I was considering something along the line of this:

Given an MDP with reward, learn a policy to maximize reward.  Then, change that MDP in some way (I suppose the structure should remain the same, that is all possible actions from any given state are the same), and then measure the K/L divergence of the policy on the new MDP versus the old MDP.  I was considering looking at the K/L divergence versus the amount of learning that is required to come to an optimal policy on the new MDP.

So?

To me, the first experiment is perhaps slightly less interesting from the second, just for the reason that I feel there are plenty of solutions to this problem (but perhaps its the approach, not the problem that is interesting?).  That being said, if Bethany worked on robot navigation stuff for years I could see it still has potential.

The second idea seems much less practically applicable, and I’m not sure whether it is at all interesting from a theoretic angle.  I’m not even sure its a well-posed problem.

I’ll open it up to you to give me your feedback, interesting, or not?  Let me know where you need clarification.

Of course, I could always to something completely different than either of these if you have something in mind.  Let me know.  Before the Robocode stuff picked up you mentioned evolving reward functions?

Also:

I’m reading Kearns & Vazirani’s Intro. to Computational Learning theory, and was thinking about the delta parameter that is used as part of the definition of whether an algorithm is PAC learnable.  My understanding is the delta parameter represents the probability of having a very bad sample to train from, as the book puts it:

The confidence parameter delta is necessary since the learning algorithm may occasionally be extremely unlucky and draw a terribly “unrepresentative” sample of the target concept – for instance, a sample consisting only of repeated draws of the same instance despite the fact that the distribution is spread evenly over all instances.

Well from my understating, in the information theory literature, they use the asymptotic equipartition property (AEP) to basically say when looking at a large number of samples, only a certain number of them are at all probable (related to the law of large numbers).  I was wondering if it were possible to apply AEP to the defintion of PAC and remove the delta parameter.  Of course, AEP only applies at the limit, so I suppose that it only helps proving PAC-learnability, and not efficient PAC-learnability.  Not sure if the idea is either correct or useful, but something I thought of.

Robocode – I think this may be about as good as it gets

So I included the ideas which I mentioned at the end of my previous post, which were:

  1. Making a different decision tree for each advisor
  2. Adding 2 new (“derived”) variables to the state representation, being distance to nearest wall, and distance to nearest corner.

I’ll cut right to the results and then give an explanation:

Where the y-axis represents the goodness ratio for each metric for the agent relative to WallBot, and the x-axis represents training epochs (1st being completely random).  For example, the green triangle at (2, 0.6) represents (rounds agent won)/(total rounds).  After the last training epoch, the policy had converged completely, as determined by the decision tree.

Some interesting things to note:

  1. I needed to sample 3,000 matches under the random policy before Weka would generate a decision tree larger than one leaf node (all “bad”) for each advisor.  For all subsequent epochs, the duration was 1,000 matches.  All data was merged, so each epoch had available all previous data when generating decision tree, not just previous epoch.
  2. On epoch #2 (as labeled in chart), based on the decision tree, it believed that spinbot was the best choice in almost all situations, which resulted in the very poor performance during that epoch
  3. On epoch #3 (as labeled in chart), based on decision tree, it ceased to used SpinBot all together, based on the poor performance of SpinBot in the previous training epoch.  SpinBot was never again used at all.
  4. Policy seemed to be mostly converged by the 4th epoch, but continued to change slightly up until the 6th epoch.
  5. By the 6th epoch, score, damage caused, and win rates are all better than WallBot.
  6. I attempted to add more state variables (such as last recorded opponent distance, which is important for TrackFire), and it actually reduced the win rate from over 50% to about 25%.  I suppose it may be a case of overfitting?

The behavior of the converged policy is consistent and always uses WallBot at the beginnging of the the round, until it gets near or at the wall, at which point it switches to TrackFire for the duration of the round.   SpinBot is not used at all.  This is pretty close to what I would hand-code for a policy.  Interestingly, the policy found by the decision trees is extremely short, and I will post it below:

SpinBot:

down (0.79524)

TrackFire:

disToWall <= 17.98183: down 0.7172
disToWall > 17.98183
|   disToWall <= 35.515157: up 0.7242
|   disToWall > 35.515157: down 0.8641

Walls:

disToCorner <= 263.925892: down (0.7227)
disToCorner > 263.925892
|   disToCorner <= 264.569543: up (0.777)
|   disToCorner > 264.569543: down (0.789)

Interestingly, TrackFire only cares about the distance to the nearest wall, while WallBot only cares about the distance to the nearest corner.

The actions which resulted from this policy have the following interesting visualization, which is the damage ratio change experienced based on x, y coordinate, when using trackfire.  The “sweet spot” lights up in blue.  Note that x, y coordinates were not used as part of this policy:

A similar plot for data logged using SpinBot, or WallBot do not show patterns that are as consistent as this, and this visualization tends to look much more “random”.

At any rate, we finally have an algorithm and a policy that has a slight edge over WallBot, which I am fairly pleased with considering how simple the policy found is, and how imperfect our representation is of what is occuring in the actual game (energy lost from each shot isn’t modeled, nor is energy gained from hit shots).  Additionally, this was done using much less data than Dan&Co. was using to craft his solution.

Tomorrow, I am going to try and use the approaches of the last experiment (using x, y location as state, as opposed to distances to wall, corner) and this experiment (different decision trees for each policy) and see what the result is.  I’m expecting performance will be as good, or worse than this, but I’ll have word soon.

Visualizations from Robocode experiments

Below are two images which are visualizations of data which was logged during trials against Wallbot in Robocode.  The axes on both images are the X, Y axis of the actual battlefield. 

The first image shows where the agent prefered to use each advisor, where notably red corresponds to trackfire.  As you can see, the prefered location to use trackfire is very close to, but not exactly along the wall, which is exactly where the sweet spot against wallbot is.  Interestingly, it didn’t seem to recognize that there is a sweet spot along the upper wall as well.

 The second image indicates where the recorded (but as we learned, not necessarily actual) damage ratio changed.  Blue indicates up, and red indicates down.  Observe the very high correlation with red points above (where trackfire was used) and the blue points below (where the damage ratio was expected to go favorably for the agent.  Just as the red band along the top wall is absent above, similarly there was no found high reward band below.  

Interestingly, over the iterations, the sweet spot started at the lower left corner and “grew out” over the iterations to cover the entire lower and left walls, and then up the right wall towards the end.  I think with even more training, it would have found the upper sweet spot as well.

 These results came from a policy that only observed the advisor and current x, y coordinates as the state.  Performance hovered at about a 25% win rate, and score at about 35%.  It seems quite clear that the agent is finding the sweet spot in most places (outside at the top, as mentioned).  So why isn’t it doing better?  I have a few ideas (still not giving up)!

  1. It doesn’t “know” that it can use wallbot to get out of the center.  Since one decision tree is being used for all three advisors, whenever it asks which advisor to use in the center, the decision tree doesn’t differentiate between the three, and just says “bad,” and so all three are given the same score (which I know doesn’t make sense; wallbot should be used in the center so it can enter that dangerous region).  I plan on switching to a separate decision tree for each advisor, so I can know how bad any point is for all three advisors independently, so there is a low probability of a tie (there is a particular method im using to tie-break but in practice its pretty close to uniform, which is not good here).  In the center trackbot and spinbot should end up with much lower scores than wallbot since they basically hang out in that bad area.
  2. Instead of using x, y coordinates as state, I plan on just using distance to the nearest wall, and perhaps also distance to the nearest corner (even though my guess is that isn’t totally necessary).  That way it can generalize very easily and there shouldn’t be missing bands of goodness as occurred in this experiment.

Once I get those experiments up, I’ll have the results posted.  I think there still may be some decent results to be found with tweaks to the representation and data analysis.