Untangling Invariant Object Recognition. Dicarlo, Cox. TRENDS in Cognitive Sciences 2007.

Read this because this paper is pointed to as evidence of slow features doing object recognition, but really there is very little mention of SFA.

  1. Considers object recognition, and “…show that the primate ventral visual processing stream achieves a particularly effective solution in which single-neuron invariance is not the goal.”
  2. “Because brains compute with neurons, the subject must have neurons somewhere in its nervous system — ‘read-out’ — neurons – that can successfully report if objectg A was present [...].”
  3. “The central issues are: what is the format of the representation used to support the decision (the substrate on which the decision functions operate); and what kinds of decision functions (i.e. read-out tools) are applied to that representation?”
  4. “… we treat object recognition fundamentally as a problem of data representation and re-representation, and we use simple decision functions (linear classifiers) to examine those representations.”
  5. Object recognition is fundamentally difficult, because “… each image projected into the eye is one point in an ~1 million dimensional retinal ganglion cell representation (…).”
  6. Manifolds: “… all the possible retinal images that face could ever produce… form a continuous, low-dimensional, curved surface inside the retinal image space called an object ‘manifold’…”
  7. Consider two potential image manifolds that “… do not cross or superimpose — they are like two sheets of paper crumpled together… We argue that this describes the computational crux of ‘everyday’ recognition: the problem is typically not a lack of information or noisy information, but that the information is badly formatted in the retinal representation — it is tangled (…).”
  8. “One way of viewing the overarching goal of the brain’s object recognition machinery, then, is as a transformation from visual representations that are easy to build (e.g. center-surround filters in the retina), but are not easily decoded … into representations that we do not yet know how to build (e.g. representations in IT), but are easily decoded…”
  9. “… single neurons at the highest level of the monkey ventral visual stream — the IT cortex — display spiking responses that are probably useful for object recognition.  Specifically, many individual IT neurons respond selectively to particular classes of objects, such as facesor other complex shapes, yet show some tolerance to changes in object position, size, pose, and illumination, and low-level shape cues.”
  10. In <what I suppose is actual> neural responses from 200 neurons in IT (the highest part of the visual system). Simple linear classifiers on activity were robust to variation in object position and size.  More sophisticated classifiers didn’t improve performance much, and were ineffective when applied directly to v1 activity.
    1. V1 doesn’t allow for simple separation and identification of objects but IT does.
  11. The classical approach to understanding visual processing is to consider the action of individual neurons — here they consider more the activity of entire populations in the representation of visual images
  12. According to this perspective, the best way to build object representations is to find a way to separate manifolds
  13. Because populations are considered, it is not the case that a single neuron must represent things (although if a single neuron can split the manifolds that is great).  Because “… real IT neurons are not, for example, position and size invariant, in that they have limited spatial receptive fields <as opposed to wide spatial receptive fields, as would be required if single neurons were actually responsible for representing invariances>[...].  It is now easy to see that this ‘limitation’ <of relying on populations of neurons as opposed to single neurons> is an advantage.”
  14. Manifold flattening may be hard wired, but it may also be the case that the means to do this is learned from the statistics of natural images. <The latter must play at least part of the role>
  15. Progressive flattening by increasing depth in visual system
    1. “… we believe that the most fruitful computational algorithms will be those that a visual system (natural or artificial) could apply locally and iteratively at each cortical processing stage (…) in a largely unsupervised manner(…) and that achieve some local object manifold flattening.”
  16. In the actual visual system, each layer projects information into a spaces of increasing dimension “… e.g. ~100 times more v1 neurons than retinal ganglion neurons …” as data gets projected to higher dimensions (if the projection is good) it makes simpler methods of analysis and classification more effective
    1. Additionally, at each stage the distribution of activity is “… allocated in a way that matches the distribution of visual information encountered in the real world …” which means the projections into higher dimensions is actually using the additional space reasonably
  17. The addition that temporal information helps untangle manifolds: “In the language of object tangling, this is equivalent to saying that temporal image evolution spells out the degrees of freedom of object manifolds.  The ventral stream might use this temporal evolution to achieve progressive flattening of object manifolds across neuronal processing stages.  Indeed, recent studies in our laboratory … have begun to connect this computational idea with biological vision, showing that invariant object recognition can be predictably manipulated by the temporal statistics of the environment.”
    1. This is where SFA is cited

How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. Collins, Frank. European Journal of Neuroscience 2012.

  1. Uses an experiment specifically designed to tease apart contributions of working memory (WM) and the part of the brain more traditionally associated with RL (“…corticostriatial circuitry and the dopaminergic system.”)
    1. “By systematically varying the size of the learning problem and delay between stimulus repetitions, we separately extracted WM-specific effects of load and delay on learning. “
  2. Propose a new model for interactoin of RL and WM
  3. “Incorporating capacity-limited WM into the model allowed us to capture behavioral variance that could not be captured in a pure RL framework even if we (implausibly) allowed separate RL systems for each set size.  The WM component also allowed for a more reasonable estimation of a single RL process.”
  4. Also some genetics work <although I will probably go light on it in this reading>
  5. “Activity and plasticity in striatal neurons, a major target of dopaminergic efferents, are dynamically sensitive to these dopaminergic prediction error signals, which enable the striatum to represent RL values (O’Doherty et al., 2004; Frank, 2005; Daw & Doya, 2006)”
  6. “Thus, although this remains an active research area and aficionados continue to debate some of the details, there is widespread general agreement that the basal ganglia (BG) and dopamine are critically involved in the implementation of RL.”
  7. Humans, at least probably rely on more than simply prediction errors, we have the ability to do forward search, for example
  8. Genes controlling dopaminergic function may cause changes in behavior either after initial learning has occurred or during learning.  “Similarly, functional imaging studies have shown that dopaminergic drugs modulate striatal reward prediction error signals during learning, but that these striatal signals do not influence learning rates during acquisition itself; nevertheless, they are strongly predictive of subsequent choice indices measuring the extent to which learning was sensitive to probabilistic reward contingencies (Jocham et al., 2011).”
  9. Their experiments involve binary payoffs, document interaction of “…higher order WM and lower level RL components…” and show that accounting for WM can explain “…crucial aspects of behavioral variance…”
  10. “We further show that a genetic marker of prefrontal cortex function is associated with the WM capacity estimate of the model, whereas a genetic marker specific to BG function relates to the RL learning rate.”
  11. In the experiment there is binary with a presented signal and 3 possible responses (a 3-arm contextual bandit)
  12. They did genetic evaluation of the subjects
  13. Subjects did well in the task, with the last couple of trials/block being at >94% accuracy, and learning generally stabilized in 10 or less samples
  14. Different learning episodes varied in the size of the set that needed to be learned.  They considered impact of working memory in terms of load and delay, in terms of load, there may be a limit to the number of stimulus-response mappings that can be remembered, but in delay they consider the case where information may be cycling through, so consider the temporal connections between stimuli
    1. Considered how behavior diverged from optimal <regret!> in terms of delay.  This means they consider cases where the correct response was already given to a given stimulus
  15. Turns out that people were more likely to respond in error if the same stimulus was presented twice in a row than if there was a short delay between them <seems like the opposite of a switch cost?>
    1. “This indicated that, when the same stimulus was presented twice in a row, subjects were more likely to make an error in the second trial after having just responded correctly to that stimulus for lower set sizes. This finding may reflect a lower degree of task engagement for easier blocks, leading to a slightly higher likelihood of attentional lapses.”
  16. Logisitic regression done on the following variables: set size, delay since correct response to particular stimulus, total number of correct responses to stimulus
    1. Main effect of set size and correct repetitions
    2. Effect of delay was actually not a main effect but it did interact with # correct repititions
    3. “These results support the notion that, with higher set sizes as WM capacity was exceeded, subjects relied on more incremental RL, and less on delay-sensitive memory.”
  17. The logistic regression captured performance pretty accurately (mostly seems to have smoothed out the actual results in a good way) so gives a reasonable means to determine contributions of the various components to the final result
  18. Penalized increasingly complex models with Aikake’s information criterion
  19. The RL+WM model had the best fit “Thus, accounting for both capacity-limited WM and RL provides a better fit
    of the data than either process on its own (i.e. pure WM or pure RL models). Importantly, there was no trade-off in estimated parameters between the two main parameters of interest: capacity and RL learning rate, as would be revealed by a negative correlation between them”
  20. The capacity of WM estimated by the models was 3.7 +/- 0.14 which <kind of> agrees with standard estimtes
  21. <Basically skipping genetics part>
  22. ” The vast majority of neuroscientific studies of RL have focused on mechanisms underlying the class of ‘model-free’ RL algorithms; they capture the incremental learning of values associated with states and actions, without considering the extent to which the subject … can explicitly plan which actions to make based on their knowledge about the structure of the environment.”
  23. “It is clear that one such computational cost [of doing forward-search] is the capacity limitation of WM, which would be required to maintain a set of if⁄then relationships in mind in order to plan effectively … This secondary process is often attributed to the prefrontal cortex and its involvement in WM.”
  24. Here they show that WM has implications in even the simplest sort of behavior tasks, which models almost always don’t factor for
  25. When standard model-free RL algorithms are used to model behavior, their parameters naturally need to be searched over to fit the actual data.  Results here, however, show that they are really (partially) trying to adjust for capacity in WM as opposed to what they actually mean in the algorithm.   They are “… no longer estimates of the intended RL processes and are therefore misleading. Even when separate RL parameters were estimated for each set size in our experiment (an implausible, non-parsimonious model with 10 parameters), it did not provide as good a fit to the data as did our simpler hybrid model estimating WM contributions together with a simple process.”
  26. “The experimental protocol allowed us to determine that variance in behavior was explained separately by two characteristics of WM: its capacity and its stability. Indeed, behaviorally, learning was slower in problems with greater load, but there were minimal differences in asymptotic performance. Furthermore, although performance was initially highly subject to degradation due to the delay since the presented stimulus was last observed, this delay effect disappeared over learning.”
    1. Eventually the RL system supercedes the WM system
  27. Results here show that some components of commonly conducted RL experiments may be producing unanticipated influences on the results, also implications on whether policy is studied during learning or only after learning has occurred

How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis. Escalante-B, Wiskott. CogPrints 2013.

  1. Their extension for supervised SFA is called graph-based SFA
  2. “The algorithm extracts a label-predictive low-dimensional set of features that can be post processed by typical supervised algorithms to generate the final label or class estimation.”
  3. Trained with a graph where edge weights represent similarities
  4. The modification to SFA made here is that it accepts weights
  5. There are different ways of building these graphs, a very simple method generates results equivalent to the Fisher linear discriminant
  6. Claim is that supervised learning on high-dimensional data is tough, so often a dimension reduction step is taken (perhaps unsupervised).
    1. Here, a supervised dimension reduction step is proposed
  7. “GSFA and LE [Laplacian Eigenmaps] have the same objective function, but in general GSFA uses different edge-weight (adjacency) matrices, has different normalization constraints, supports nonde-weights, and uses function spaces.”
  8. GSFA can be used for both regression or classification, many approaches only work for one of the two
  9. “The central idea behind GSFA is to encode the label information implicitly in the structure of the input data, as some type of similarity matrix called edge-weight matrix, to indirectly solve the supervised learning problem, rather than performing an explicit fit to the labels.”
  10. In the graph, there are edge weights along with node weights, which specify a-priori sample properties
  11. “… hierarchical processing can also be seen as a regularization method because the number of parameters to be learned is typically smaller than if a single SFA node with the number of parameters than if a single SFA node with a huge input is used, leading to better generalization.”
    1. Another advantage is that if non-linear bases are used, the nonlinearity can allow for increasingly more complex functions per layer
  12. In graph edges are undirected, weighed, although it seems that the approach trivially generalizes to the directed case
  13. Basically they rewrite the original constraints of SFA with added weights
  14. Non-existing edges are given 0-weight
  15. Seems like they just end up using the graph to exactly calculate what the dynamics would be based on initialization probabilities (vertex weights) and transition probabilities (edge weights)
  16. How to construct the graph for either classification or regression is then discussed
  17. For graphs, they simply generate a separate graph for each class, with each item in each graph fully connected, and each sub-graph completely unconnected to items in a separate class, so basically there are independent fully connected components for each class
    1. There are some tricks that can be used due to the symmetry in each class cluster to make processing cheaper
  18. What is found by the classifier in this construction is equivalent to that of the Fisher linear discriminant
  19. “Consistent with FDA, the theory of SFA using unrestricted function space (optimal free responses) predicts that, for this type of problem, the first S – 1 slow features extracted are orthogonal step functions, and are piece-wise constant for samples from the same identity (…).  This closely approximates what has been observed empirically, which can be informally described as features that are approximately constant for sample of the same identity, with moderate noise.”
  20. <Immediately next paragraph> “When the features extracted are close to the theoretical predictions (e.g., their Δ-values are small), their structure is simple enough that one can use even a modest supervised step after SFA, such as a nearest centroid or a Gaussian classifier (in which a Gaussian distribution is fitted to each class) on S-1 slow features or less.”
    1. Using SVMs over Gaussians doesn’t make performance that much better, while being computationally more expensive
  21. Now on to regression
  22. For regression “The fundamental idea is to treat labels as the value of a hidden slow parameter that we want to learn.  In general, SFA will not extract the label values exactly.  However, optimization for slowness implies that samples with similar label values are typically mapped to similar output values.  After SFA reduces the dimensionality of the data, a complimentary explicit regression step on a few features solves the original regression problem.”
  23. They discuss 4 ways of doing the regression for SFA, the first one actually doesn’t even leverage
  24. In the version that doesn’t leverage graphs, simply sort data and then pass into SFA.  “Due to limitations of the feature space considered, insufficient data, noise, etc., one typically obtains noisy and distorted versions of the predicted signals.”
    1. On the other hand, its the easiest to implement (partially because vanilla SFA can be used) so “… we recommend its use for first experiments.”  If that doesn’t work well, use the GSFA approaches
  25. In the “sliding window training graph” items are sorted as above, but each vertex is connected to the d closest left and right items
  26. They recommend not using just 0 and 1 weights as it leads to “pathological solutions” – this may be what we’re picking up in ToH, and talk about why that happens. <This is worth investigating further.>
  27. In the “serial training graph,” data points are binned together - then points are all connected to points in adjacent bins, but they don’t connect all the points in a same bin  <why?>
    1. As is the case in other particular structures, can set up GSFA to be more efficient for this particular case
    2. Naturally, there is tuning required to see that the binning was done correctly
  28. The “mixed training graph” adds connections within a bin
  29. Then there is a supervised step on top of this stuff <am I missing something – I thought there were 4 in total?>
  30. “There are at least three approaches to implement the supervised step on top of SFA to learn a mapping from slow features to the labels. ” <
    1. First option is linear or nonlinear regression
    2. To bin and then classify <so you end up with discrete approx of regression?>
    3. Do a weighted version of #2 so you get continuous estimations
    4. <#s 2 and 3 immediately above look terribly hacky, if I am groking them correctly>
  31. Experimental results
  32. For classification they only check to see that indeed SFA does the same thing as Fisher linear discriminant (because that has already been studied exhaustively), which it does
    1. Interestingly in the benchmark task used, convnets are best, and outperform humans
  33. In the regression problems they take photos of people and estimate the horizontal position of the face, vertical position, and size.  This is all done separately <why?  Ah, because the sorting depends on the output variable, so you can only sort according to one… although it seems like a pretty simple extension could handle higher-dimensional outputs>
  34. Take face pictures from a number of data sets (a total of 64,471) and were “… automatically pre-processed through a pose-normalization and pose-reintroduction step. Basically they are all centered and then from there shifted and zoomed according to a distribution.  This way, they know what the x,y,z values they are estimating are
  35. Because of the size of the corpus and images themselves, its difficult to apply algs like SVMs directly, so they use hierarchical SFA and GSFA (which they also call HSFA <- great>)
    1. They also do a hierarchical version of PCA, which sort of does the opposite thing of SFA.  The 120 HPCA features used explain 88% of the variance
  36. Used a few different post-dimension reduction classifiers, including SVM
  37. The slow features of the data gets organized in a more orderly fashion as go up in hierarchy
  38. “… GSFA was 4% to 13% more accurate than the basic reordering of samples employing standard SFA.  In turn, reordering was at least 10% better than HPCA for this dataset.”
  39. Only 5 HSFA features are used, whereas 54 for HPCA. “This can be explained because PCA is sensitive to many factors that are irrelevant to solve the regression problem, such as the vertical position of the face, its scale, the background lighting, etc. Thus, the information that encodes the horizontal position of a face is mixed with other information and distributed over many principal components, whereas it is more concentrated in the slowest components of SFA.”
  40. Mixed and serial (straight SFA) outperformed the sliding window graphs <they were surprised but I’m not, at least with regards to mixed as it regular sliding window just seems like a strange construction).  The serial was actually better than the mixed, although the difference wasn’t significant
  41. They call these approaches “implicitly supervised” because the construction of the graph depends on the supervised labels, but the algorithm never sees those labels explicitly
  42. “The experimental results demonstrate that the larger number of connections considered by GSFA indeed provides a more robust learning than standard SFA.”
  43. Knock unsupervised dimension reduction by doing dimension reduction that doesn’t necessarily help in the task you are actually interested in <But this is only “implictly” supervised, by the same logic fully supervised dimension reduction would be better yet.>
  44. Being able to simply specify a graph means there is no need to exhaustively harvest data from a graph you may already have, as is the case in standard SFA
  45. GSFA has a tendency to overfit because it is not regularized, and is sensitive (in a bad way) to multiple types of data being used

Slowness: An Objective for Spike-Timing-Dependent Plasticity? Sprekeler, Michaelis, Wiskott. PLoS Computational Biology 2007

I Picked up this paper because I saw a reference to it saying this paper shows SFA equivalent to PCA of a low-pass fiter

  1. Explores how SFA can be implemented “…within the limits of biologically realistic spike-based learning rules.”
  2. Show a few ways SFA could be implemented with different neural models
  3. Fastness is measured by the average variance of the time derivative of the output features
  4. The part of the algorithm that is least plausible from the perspective of a biological system is the eigen decomposition.  The main aim of this paper is to show how this could be implemented “… in a spiking model neuron”
  5. “In the following, we will first consider a continuous model neuron and demonstrate that a modified Hebbian learning rule enables the neuron to learn the slowest (in the sense of SFA) linear combination of its inputs.  Apart from providing the basis for the analysis of the spiking model, this section reveals a mathematical link between SFA and the trace learning rule, another implementation of the slowness principle.  We then examine if these findings also hold for a spiking model neuron, and find that for a linear Poisson neuron, spike-timing-dependednt plasticity (STDP) can be interpreted as an implementation of the slowness principle
  6. “Even though there are neurons with transient responses to changes in the input, we believe it would be more plausible if we could derive an SFA-learning rule that does not depend on the time derivative, because it might be difficult to extract, especially for spiking neurons.”
    1. The time derivative can be replaced by a low-pass filter (a fair amount of math to show that)
    2. <But earlier in the paper they wrote> “It is important to note that the function g1(x) is required to be an instantaneous function of the input signal.  Otherwise, slow output signals could be generated by low-pass filtering the input signal.  As the goal of the slowness principle is to detect slowly varying features of the input signals, a mere low-pass filter would certainly generate slow output signals, but it would not serve the purpose.” <So then what is the difference between this and the low pass filter they just mentioned?>
    3. <After all the math> “Thus, SFA can be achieved either by minimizing the variance of the time derivative of the output signal or by maximizing the variance of the appropriately filtered output signal.” <Oh, I see.  You can’t just filter the output, you have to set up the system so it maximizes the variance of the filtered output?>
  7. Basically from a whitened input, you can either: use the time derivative and then choose the direction of minimal variance, or use a low-pass filter and the choose the direction of maximal variance
  8. “… standard Hebbian learning under the constraint of a unit weight vector applied to a linear unit maximizes the variance of the output signal. We have seen in the previous section that SFA can be reformulated as a maximization problem for the variance of the low-pass filtered output signal.  To achieve this, we simply apply Hebbian learning to the filtered input and output signals, instead of the original signals.” <The goal with analysis for the Hebbian rule is to find something more biologically plausible>
  9. “Thus, the filtered Hebbian plasticity rule … optimizes slowness … under the constraint of unit variance… ”  The requirement that the data already has unit variance  “… underlines the necessity for a clear distinction between processing <cleaning up the data?> and learning.”  <They talk more about processing vs learning but its not clear to me what they mean, which is unfortunate because they say the distinction is even more important when moving to the Poisson model neuron>
  10. SFA is a quadratic approximation of the trace rule (which comes from different power spectrums for the low pass filters the two use) <I don’t know what anything on this line means.>.  In the case that the power spectrum is parabolic (most input power is concentrated at lower frequencies) <Is that what that would mean?>, then the results of both will be similar.
  11. In SFA the outputs of each step are real-valued, but in a neuron there is simply spiking information.  Here a neuron is modeled by “inhomogenous Poisson processes” – here they simply consider information contained in the spike rate, and ignore information held in the exact spike timing (they describe this model mathematically)
  12. “…in an ensemble-averaged sense it is possible to generate the same weight distribution as in the continuous mode by means of an STDP rule with a specific learning window.”
  13. <I guess the model is optimal for learning slowness (mostly skipped that section) because it then follows up saying that they have yet to discuss why it is optimal>
  14. <Skipping neural modelling almost entirely>
  15. “In the first part of the paper, we show that for linear continuous model neurons, the slowest direction in the input signal can be learned by means of Hebbian learning on low-pass filtered versions of the input and output signal.  The power spectrum of the low-pass filter required for implementing SFA can be derived from the learning objective and has the shape of an upside-down parabola.”
  16. <Immediately following> “The idea of using low-pass filtered signals for invariance learning is a feature that our model has in common with several others [...]. By means of the continuous model neuron, we have discussed the relation of our model to these ‘trace rules’ and have shown that they bear strong similarities.”
  17. In the implementation of SFA on the a Poisson neuron, “The learning window that realizes SFA can be calculated analytically.”
    1. “Interestingly, physiologically plausible parameters lead to a learning window whose shape and width is in agreement with experimental findings.  Based on this result, we propose a new functional interpretation of the STDP learning window as an implementation of the slowness principle that compensates for neuronal low-pass filters such as EPSP.”
  18. “Of course, the model presented here is not a complete implementation of SFA, the extraction of the most slowly varying direction from a set of whitened input signals.  To implement the full algorithm, additional steps are necessary: a nonlinear expansion of the input space, the whitening of the expanded input signals, and a means of normalizing the weights… On the network level, however, whitening could be achieved by adaptive recurrent inhibition between the neurons [...].  This mechanism may also be suitable for extracting several slow uncorrelated signals as required in the original formulation of SFA [...] instead of just one.”
  19. For weight normalization “A possible biological mechanism is synaptic scaling [...], which is believed to multiplicatively rescale all synaptic weights according to postsynaptic activity… Thus it appears that most of the mechanisms necessary for an implementation of the full SFA algorithm are available, but that it is not yet clear how to combine them in a biologically plausible way.”
  20. <Immediately following>”Another critical point in the analytical derivation for the spiking model is the replacement of the temporal by the ensemble average, as this allows recovery of the rates that underlie the Poisson process.”  Data should be ergodic
  21. Not yet clear if these results can be reproduced with more realistic model neurons.
  22. “In summary, the analytical considerations presented here show that (i) slowness can be equivalently achieved by minimizing the variance of the time derivative signal or by maximizing the variance of the low-pass filtered signal, the latter of which can be achieved by standard Hebbian learning on the low-pass filtered input and the output signals; (ii) the difference between SFA and the trace learning rule lies in the exact shape of the effective low-pass filter–for most practical purposes the results are probably equivalent; (iii) for a spiking Poisson model neuron with an STDP learning rule, it is not the learning window that governs the weight dynamics but the convolution of the learning window with EPSP; (iv) the STDP learning window that implements the slowness objective is in good agreement with learning windows found experimentally.  With these results, we have reduced the gab between slowness as an abstract learning principle and biologically plausible STDP learning rules, and we offer a completely new interpretation of the standard STDP learning window.”

Should really look for another description of the algorithm if it exists.  For some reason I’m finding the paper very unclear on my first read through.  Its one of the few papers I’ve read that would be more understandable with more math and less English.

  1. HEXQ attempts to decompose and solve factored MDPs in a model-free manner
  2. Doesn’t deal with solving the decomposed MDP, just doing the decomposition itself
  3. Assumes:
    1. Some elements of the feature vector change less frequently than others
    2. The variables that change more often keep transition properties independently of more slowly changing variables
    3. “Interface between regions can be controlled.  For example, if a robot navigates around four equally sized rooms with interconnecting doorways (…) the state space can be represented by the two variables, room-identifier and position-in-room.  Most representations naturally label repeated sub-structures in this way.” <Seems to mean partitioning the feature vector along any particular dimension solely causes reasonable partitions of the state space>
  4. If these assumptions don’t hold, HEXQ will worst-case simply solve the flat problem
  5. Creates a hierarchy, with the maximum number of levels being the state dimension.  The bottom (level 1) is the variables that change most frequently
    1. “The rationale is that sub-tasks that are used most often appear at the lower levels and need to be learnt first.”
  6. Only the first level (fastest) interacts with the environment via primitive actions
  7. Start by observing the state variable that changes most frequently. “We now partition the states represented by the values of this variable into Markov regions.”  <I take this simply to mean regions of the flat state space corresponding to each possible assignment of that feature to a valid value, although not sure>
  8. “The boundaries between regions are identified by ‘unpredictable’ (see subsection 4.2) transitions which we call region exits.  We then define sub-MDPs over these regions and learn separate policies to leave each region via its various exits.”
  9. Regions are then combined with the next most frequently changing feature to make more abstract states at the next level in the hierarchy
    1. The exit policies then become abstract actions at that next level
    2. This results in a semi-MDP with one less feature in the state dimension and only abstract actions (exits, not primitive)
  10. The top-level has a sub-MDP that is solved by recursively calling lower-level MDP policies (exits) as its actions
  11. The feature ordering heuristic simply involves running a random trajectory and finding the frequency at which each feature changes
  12. Tries to model transitions as a directed graph <DBN?>, and random trajectories are taken.  “Following a set period of exploration in this manner, transitions that are unpredictable (called exits) are eliminated from the graph.”
  13. Transitions are unpredictable if:
    1. T or R is not stationary
    2. Some other variable changes value
    3. The terminal goal state is reached
  14. Entry states are reached after taking an exit (which is a state action pair <s^e, a>, where e is the level of the state)
    1. In Taxi, all states are entries because the agent is reset randomly after the goal state is reached <But doesn’t this criteria not jive with their assumptions that the domains are basically shortest path tasks?>
  15. An abstract state can only be left by an exit
  16. Approach:
    1. Decompose the transition graph into strongly connected components (SCCs)
    2. The SCCs form a DAG
    3. SCCs can then be merged together, potentially in a hierarchy – goal is to do as much of this as possible in order to minimize the # of abstract states
    4. “Following the coalition of SCCs into regions we may still have regions connected by edges from the underlying DAG.  We break these by forming additional exits and entries associated with their respective regions and repeat the entire procedure until no additional regions are formed.”  <Not exactly sure what this means yet>
  17. “A region, therefore, is a combination of SCCs such that any exit state in a region can be reached from any entry with probability 1.  Regions are generally aliased in the environment.”
  18. Regions have sub-goals of reaching exits
  19. “We proceed to construct multiple sub-MDPs one for each unique hierarchical exit state (s1, s2, …, se) in each region. Sub-MDP policies in HEXQ are learnt on-line, but a form of hierarchical dynamic programming could be used directly as the sub-task models have already been uncovered.”
  20. A region may have multiple exits, and each exit may lead to a separate sub-MDP <But those sub-MDPs are on the same level according to the wording which I don’t understand:> “In the taxi example, the above procedure finds one hierarchical level 1 region as reflected in figure 2.  This region has 8 exits… As there are 4 hierarchical exit states we create 4 sub-MDPs at level 1 and solve them.”
  21. Taking an abstract action at level e is equivalent to executing the policy for the appropriate region at level e-1
  22. In taxi, level 1 has just one state.  Level 2 has 5 states <corresponding to the 4 passenger locations plus the 1 state at level 1?>.  The abstract actions involve being at one of the locations where a passenger is and either doing a pickup or a putdown  <It seems like relevant aspects of state are ignored though, so it just has to do with pickup or putdown in a particular location on the map, regardless of whether a passenger is in the car or not, or what the destination is?  Oh, it seems like information at the next level up encodes whether there is a passenger in the car or not>.
  23. <The trick to go from deterministic shortest path problems to stochastic shortest path problems feels a bit hacky.  Basically, they keep long and short term statistics recorded, and see if they match or not>
  24. In taxi, HexQ ends up with a representation not much larger than MaxQ, while learns relevant hierarchical information itself
  25. Exits must work with certainty, how to generalize the algorithm to stochastic case is set for future work
  26. “The two heuristics employed by HEXQ are (1) variable ordering and (2) finding non-stationary transitions.”
  27. A problem is that “An independent slowly changing random variable would be sorted to the top level in the hierarchy by this heuristic and the MDP would fail to decompose as it is necessary to explore the variable’s potential impact.”
  28. HEXQ, like MAXQ is recursively optimal

A Theoretical Basis for Emergent Pattern Discrimination in Neural Systems Through Slow Feature Extraction. Klampfl, Maas. Neural Computation 2010.

  1. Shows equivalence of slow feature analysis to Fisher linear discriminant
  2. “We demonstrate the power of this principle by showing that it enables readout neurons from simulated cortical microcircuits to learn without any supervision to discriminate between spoken digits and to detect repeated firing patterns that are embedded into a stream of noise spike trains with the same firing statistics.  Both these computer simulations and our theoretical analysis show that slow feature extraction enables neurons to extract and collect information that is spread out over a trajectory of of firing states that last several hundred ms.”
  3. Also can train neurons to keep track of time themselves
  4. Much of the learning the brain does in unsupervised.  SFA works in that scenario
    1. Intuition is that things that occur nearny in time are probably driven by the same underlying cause
  5. Perceptual studies show that slowness can form “…position- and view-invariant representations of visual objects in higher cortical areas.”
    1. Images were altered during blindness caused by a saccade, and the perception was that the two merged
    2. The authors of a related work <Li and DiCarlo – DiCarlo seems to be the common element of that work> say “unsupervised temporal slowness learning may reflect the mechanism by which the visual stream builds and maintains object representations.”
    3. Studies of IT in monkeys shows similar evidence
  6. This work builds on the theoretical basis of how slowness can be responsible for such effects
  7. Work by Sprekeler, Michaelis, Wiskott (07) also show SFA to be equivalent to PCA of a low-pass filter
    1. “In addition, they have shown that an experimentally supported synaptic plasticity rule, spike-timing-dependent plasticity (STDP), could in principle enable spiking neurons to learn this processing step without supervision, provided that the presynaptic inputs are processed in a suitable manner.”
  8. “… we show that SFA approximates the discrimination capability of the FLD in the sense that both methods yield the same projection direction, which can be interpreted as a separating hyperplane in the input space.”
  9. Intuitions drawn from SFA may help explain some phenomena that is sometimes not so well covered by “… the classical view of coding and computation in the brain, which was based on the assumption that external stimuli and internal memory items are encoded by firing states of neurons, which asssign a certain firing rate to a number of neurons maintained for some time interval.”
    1. In particular, trajectories in the brain are likewise encoded by trajectories of firing over a several hundred ms.
    2. The firing in terms of neural trajectories has been found in terms of static stimuli (such as odors and tastes) as well as things that are naturally time-varying (such as auditory or visual stimuli)
    3. Firing of sequences of neurons also true in cases when considering traces of episodic memory in hippocampus and cortex
  10. Being that there is neural activity that works in a temporally dispersed manner, perhaps SFA has something to say about how the brain actually uses that neural activity
  11. <There are also simulated neural results, but details are too complex to outline here so will address those points when I get to that section fully>
  12. A nice result of the neural experiments, and general properties of SFA is that when used over temporal data it behaves in an anytime sense (always producing whatever slow signals are most appropriate), so items that depend on its outputs may begin to formulate responses quickly
    1. This is especially important when set up in a hierarchical manner
  13. For the FLD part, consider linear basis functions only
  14. For classification, SFA has an additional parameter p, which dictates the probability of showing items of different classes back-to-back (needs probability of seeing in-class temporal pairs to be better than chance, but seems to be surprisingly robust to values anywhere in that range)
  15. In case where eigenvalues are nonzero <which should be the case?> the SFA objective and FLD are equivalent
  16. SFA/FLD can also be extended to the multiclass setting
  17. “The last line of equation 2.19 is just the formulation of FLD as a generalized eigenvalue problem… More precisely, the eigenvectors of the SFA problem are also eigenvectors of the FLD problem; the C-1 [where C is the number of classes] slowest features extracted by SFA applied to the time series x_t span the subspace that optimizes seperability in terms of he FLD… The slowest feature… is the weight vector that achieves maximal separation…”
  18. When used as a classifier, the optimal responses should look like a step function
  19. They then move onto the case where the trajectory provided to SFA is not made of isolated points, but rather sub-trajectories assembled into a larger trajectory
  20. The idea is to model what happens when sequences of neural firing occurs, such as is the case when episodic experience is reviewed mentally
  21. <Still a little unsure of the exact formulation of the input in this section>
  22. In an example where class means are similar, FLD produces the “right” linear separator, while SFA chooses one that is basically degenerate (each half-space consists about 50-50 of points from both classes, even though it is linearly separable)
  23. In this case, SFA and FLD produce different results, both analytically as well as empirically
    1. This is because “… the temporal correlations induced by the use of trajectories have an effect on the covariance of temporal differences in equation 2.24 compared to equation 2.12″
  24. <Again for trajectory of trajectories case> “… even for a small value of p, the objective of SFA cannot be solely reduced to the FLD objective, but rather there is a trade-off between the tendency to separate trajectories of different classes… and the tendency to produce smooth responses during individual trajectories…”
  25. In the vanilla SFA classification construction, SFA should really see transition between all pairs of points in the same class equally often; when trajectories are used this starts to fall apart because most of the time-pairs are within a single instance as opposed to between two items in the same class; as the sub-trajectories become long they dominate the features constructed so the ability to do classification is lost
  26. Linear separators aren’t strong, but as dimension increases the ability to do linear separation becomes increasingly similar <although I would also argue that as dimension increases risk of overfitting comes along with it>
    1. Not only does separability increase, margin does as well
    2. “In other words, a linear readout neuron with d presynaptic units can separate almost any pair of trajectories, each defined by connecting fewer than d randomly drawn points.”
  27. Now moves onto “… SFA as a possible mechanism for training readouts of a biological microcircuit.”
  28. Based on previous discussion, when data points themselves are trajectories, each slow feature itself wont classify data (as it does in the vanilla case): “… we predict that the class information will be distributed over multiple slow features.”
  29. high-order slow feature is one that is fast
  30. Now getting on to their simulated neural results.  Not sure how SFA ties in exactly yet
  31. They run SFA on the output of the neural circuit
  32. “We then trained linear SFA readouts on the 560-dimensional circuit trajectories, defined as the low-pass filtered spike trains of the spike response of all 560 neurons of the circuit…”
    1. The signal passed in consists of a sequence of static input, with poisson noise in between
    2. At first glance, tehre slow features don’t seem to respond much to the distinction between signal/noise
    3. But on average, the first 2 slow features don’t respond at all to the noise, while responding with a noticeable pattern to the signal.
  33. “One interesting property of this setup is that if we apply SFA directly on the stimulus trajectories, we basically achieve the same result [as they do when running on the neural respones].  In fact, the application to the circuit is the harder task because of the variability of the response to repeated presentations  of the same pattern and because of temporal integration: the circuit integrates input over time, making the response during a pattern dependent on the noise input immediately before the start of the pattern.”  Noise in the circuit takes a little while to die down after the static signal is introduced, which makes it harder to pick up the static signal
    1. <I don’t really get this, because earlier in the paper they said that SFA is bad for classifying items that manifest as a time-series.  I though the point of using a neural circuit was to try and circumvent that issue, but here they are saying it works better without the neural circuit?>
  34. Now moving on to recognition of spoken digits
  35. “We preprocessed the raw audio files with a model of the cochlea (Lyon, 1982) and converted the resulting analog cochleagrams into spike trains that serve as input to our microcircuit model (see section B.3.2 for details).”
  36. Classification between digit utterances of words “one” and “two”
  37. <Not groking their training methodology for this section – not so clearly written and I probably need to eat lunch.  They mention training two different times but I’m not clear on what the purpose of what this distinction is>
  38. “We found that the two slowest features, y_1 and y_2, responded with shapes similar to half sine waves during the presence of a trajectory… which is in fact the slowest possible response under the unit variance constraint.  Higher-order features partly consisted of full sine wave responses, which are the slowest possible responses under the additional constraint to be decorrelated to previous slow features.”
  39. <Immediately following in next P> “In this example, the slowest feature y_1 already extracts the class of the input patterns almost perfectly: it responds with positive values for trajectories in response to utterances of digit 2 and with negative values for trajectories of digit 1 and generalizes this behavior to unseen test examples.”
  40. The first slow feature is closest to the FLD
  41. The first slow feature encodes what (digit), while the second slow feature encodes where (corresponding to a position in the trajectory identified by SF1).  This has been found with other application of SFA (Wiskott Sejnowski 02).  Other faster features encode a mixture of the two
  42. A linear classifier on the SFAs (itself with a linear kernel) is very effective (98%)
  43. Training FLDs and SVMs on the same data as SFA results in poorer performance <Hm.  SVMs are pretty powerful – wonder why this result is what they have>
  44. They then go onto the same classification problem but with more data: “Due to the increased number of different samples for each class (for each speaker, there are now 10 different digits), this task is more difficult than the speaker-independent digit recognition.”
    1. “No single slow feature extracts What-information alone; the closest feature to the FLD is feature y_3.  To some extent also, y_4 extracts discriminative information about the stimulus.”
    2. “In such a situation where the distance between the class means is very small, the tendency to extract the trajectory class itself as a slow feature becomes negligible. In that case, the theory predicts that SFA tries to distinguish each individual trajectory due to the decorrelation …”
  45. In this situtation, SFA can be used as a preprocessor, but is not so useful as a classifier itself
  46. In some sense SFA is poorly suited to direct application on neural bursting information because such activity is inherently non-slow <Here maybe they low-pass filter it first?>
  47. When trying to do classification of time-series (sequences), “… the optimization problem of SFA can be viewed as a composition of two effects: the tendency to extract the trajectory class as a slow feature and the tendency to produce a smooth response during individual trajectories.”
  48. “In the context of biologically realistic neural circuits, this ability of an unsupervised learning mechanism is of particular interest because it could enable readout neurons, which typically receive inputs from, a large number of presynaptic neurons of the circuit, to extract from the trajectory of network states information about the stimulus that has caused this particular sequence of states–without any ‘teacher’ or reward.”
  49. Earlier work by Berkes showed similar results between SFA and FLD for handwritten digit recognition, but the two were not formally linked in that work
  50. <An excellent paper with tons of good references – worth rereading.>

A Causal Approach to Hierarchical Decomposition in Reinforcement Learning. Anders Jonsson. Dissertation 2006.

  1. Considers planning in domains with factored state
  2. Goal is to hierarchical learning with do state abstraction for each subtask
  3. Because function approximation is problematic, instead focuses on hierarchical decomposition (temporal abstraction) and state abstraction
  4. Three models of “activities” (temporally extended actions) in RL
    1. Hierarchical Abstract Machines / HAMs (Parr, Russell)
    2. Options (Sutton et al)
    3. MAXQ (Dietterich)
  5. Starts with description of the H-Tree algorithm, which is based on McCallum’s U-tree algorithm (which isn’t designed for options, and was instead designed for POMDPs)
    1. Results show that option-specific state abstraction makes learning faster
  6. Next is VISA, which uses causal relationships (DBNs) to find hierarchical decomposition
    1. VISA uses the DBN to create a causal graph that describes how to get features to change
    2. “The VISA algorithm is useful in tasks for which the values of key state variables change relatively infrequently”
  7. HEXQ determines causal relationships by using the heuristic of how often state variables change
  8. It identifies “exits”, which are <s,a> pairs that cause the values of state variables to change.
  9. VISA tries to identify exits as well, but because it is working with more information (stronger assumptions) it is able to make more accurate and effective decomposition than HEXQ
  10. “The VISA algorithm exploits sparse conditional dependence between state v ariables such that the causal graph contains two or more strongly connected components. In addition, the algorithm is more efficient when changes in the values of state variables occur relatively infrequently . How likely is a realistic task to display this type of structure? It is impossible for us to determine the percentage of tasks in which the aforementioned structure is present. However, our intuition tells us that it is not uncommon to encounter tasks that display this structure.”
  11. In the DBN model, there is one DBN for each action
    1. <I would put the action in the same DBN as an additional input; this is more general (works for continuous actions for one) and is more compact in cases where action selection is highly (or completely) constrained>
  12. Its possible to learn DBNs from trajectory data, from an episodic generative model
    1. Their algorithm uses the Bayesian Information Criterion 
  13. Each of the four algorithms can be viewed as part of a larger system, which would:
    1. Learn a DBN
    2. Do hierarchical decomposition from the DBN
    3. Do (and then refine) state abstraction for each option
    4. Construct a compact representation of each option
    5. Learn the policy of each option, and the task option <the high-level policy?>
  14. The first step has the dominating computational cost
  15. Algorithms aren’t presented in this order, but in order they were developed
  16. The U-Tree algorithm is for POMDPs, keeps track of (a finite number) of past observations, and attempts to use that history to predict the next observation
  17. It then treats each leaf in the tree as a state in the MDP, and makes predictions about the probability of transitioning from leaf to leaf, and does Q estimates on leaf-action pairs
  18. Like normal decision trees, starts out as a stump and expands the tree when the data warrants it (with a Kolmogorov-Smirnov test)
  19. Now onto the H(ierarchical)-tree
  20. Its for semi-POMDPs
  21. Requires only minor changes, such as keeping track of option durations in the history
  22. Uses SMDP Q-learning over the leaves
  23. A criticism of U-tree is that you don’t know how to choose the length of the window for the history
    1. In H-Tree, this is less of a problem because options give much richer information, so in many problems only a couple of steps of history is needed
    2. <This doesn’t really solve the problem, it is perhaps a bit of a mitigation.  It seems elegant to me to do the same K-S test that is done on whether to grow the tree to also determine whether the history should be grown or not.>
  24. Does intra-option learning in the sense that an option that is common to two superoptions with be represented once
  25. Says its for POMDPs, but I’ve never seen it used for POMDPS <ah mentions this in the empirical section with TAXI – because its fully observable it just uses the current state with no other history>
  26. Epsilon-softmax exploration <you can’t do directed exploration with options, cause it just ends up slowing you down, although that proof didn’t exist at the time of this paper>
  27. There is faster learning and convergence to a better result when intra-option learning is used as opposed to leaving intra-option learning out <probably because the change in temperature for exploration goes too fast for learning without options>
  28. On the other hand, a plot with options vs flat shows flat learning takes a little longer to get close to convergence (about twice as long), but its also to a better result.
    1. This is blamed on the incorrect abstraction done by the tree
    2.  Also claim that epsilon-greedy exploration is worse for options <don’t but that – its true that the wrong option will push you far away, but the right option gets you very close to the goal, and that is what really matters.  With flat, you have to get extremely lucky with a series of right choices till you get to the goal>
  29. Admittedly, the algorithm has too many parameters
  30. Now onto VISA, which develops options and state abstraction from a DBN
  31. Builds on ideas from HEX-Q.  Whereas HEXQ uses samples to try and figure out dependencies on how to get a particular feature to change with an “exit” (based on orderings of how frequently features change), VISA uses a DBN to do so
  32. In the coffee task, for example, HEX-Q produces a linear hierarchy that is not representative of the actual dynamics  <I’m not sure if it always does this purely linear relationship or if it can represent general DBNs given enough data>
  33. Visa’s structure cares only about changing the value of features

Dissociating Hippocampal and Basal Ganglia Contribitions to Category Learning Using Stimulus Novelty and Subjective Judgements. Seger, Dennison, Lopez-Paniagua, Peterson, Roark. Neuroimage 2011.

  1. We identified factors leading to hippocampal and basal ganglia recruitment during categorization learning.”
  2. In the experiment there were alternating trial and error catergory learning, interspersed with a subjective judgement task
    1. In the subjective task, subjects categorized the stimulus, but instead of receiving feedback they also recorded the basis of their response with one of 4 options:
      1. Remember: <response is based on?> “Conscious episodic memory of previous trials.”
      2. Know-Automatic: “Automatic, rapid response accompanied by conscious awareness of category membership.”
      3. Know-Intuition: “A ‘gut feeling’ without fully conscious knowledge of category membership.”
      4. Guess: 
  3. “Categorization overall recruited both the basal ganglia and posterior hippocampus.”
    1. Use of basal ganglia showed up when making both types of “know-” based decisions
    2. Posterior hippocampus showed up with remember judgements
    3. <This distinction is a little unclear, as later they say> “First, we used subjects’ subjective judgments to dissociate trials performed on the basis of memory (and found to recruit the hippocampus), from trials performed in a subjectively automatic or intuitive way (found to recruit the basal ganglia).
    4. <But if you know something how is it not based on memory – the language seems sloppy to me>
  4. “[Analysis shows] the putamen exerting directed influence on the posterior hippocampus, which in turn exerted directed influence on the posterior caudate nucleus.
  5. Our results indicate that subjective measures may be effective in dissociating basal ganglia from hippocampal dependent learning, and that the basal ganglia are involved in both conscious and unconscious learning. They also indicate a dissociation within the hippocampus, in which the anterior regions are sensitive to novelty, and the posterior regions are involved in memory based categorization learning.”
  6. Commonly accepted that basal is important for categorization. “… they are particularly important for feedback-based categorization, in which subjects learn via trial and error.”
  7. The evidence of the role of hippocampus in categorization is less strong, although results show that anterior and posterior portions may be playing different roles

<Seems like review was incomplete>

Generating Feature Spaces for Linear Algorithms with Regularized Sparse Kernel Slow Feature Analysis. Bohmer, Grunewalder, Nickisch, Obermayer. Machine Learning 2012.

  1. Deals with a way of automatically constructing nonlinear basis functions via SFA
  2. “Real-world time series can be complex, and current SFA algorithms are

    either not powerful enough or tend to over-fit. We make use of thekernel trickin combination with sparsificationto develop a kernelized SFA algorithm which provides a powerful

    function class for large data sets.”

  3. Also uses regularization to prevent overfitting on small data sets
  4. Hypothesize that “…our algorithm generates a feature space that resembles a Fourier basis in the unknown space of latent variables underlying a given real-world time series.
  5. Assume that solutions are defined on a low dimensional space of latent variables Θ which is embedded in the high dimensional space X that is actually observed
  6. Look for a feature space Φ s.t.
    1. For all i = {1,…p} φi, is non-linear in X in order to encode Θ as opposed to X
    2. For all = {1,…p} φi, is “… a well behaving functional basis in Θ, e.g. a Fourier basis
    3. Size of p is as low as possible to represent Θ
  7. Although there have been numerous studies highlighting its resemblance to biological sensor processing (…), the method has not yet found its way in the engineering community that focuses on the same problems. One of the reasons is undoubtedly the lack of an easily operated non-linear extension that is powerful enough to generate a Fourier basis.”
  8. Based on Kernel SFA

<Wordpress ate the rest of this post… Grr.>

Sources of Power. Klein. Book 1999.

This book is essentially the contrasting position taken by that of Kahneman, and is discussed in a chapter of his book.  More specifically, Kahneman focuses on how people generally make poor or irrational decisions, and Klein focuses on how humans that are experts in a particular domain achieve that expertise.

Overall, I would say the book is quite interesting, but if I had to choose between the results of Kahneman and Klein (if they disagreed on a particular point), I would have to go with Kahneman simply because his methods seem more objective; the claims here are (somewhat necessarily by nature) anecdotal and subjective in terms of their evaluation.  It is also a bit more hand-wavy.  Still, I think there is a lot of goodness here.

Chapter 1: Chronicling the Strengths Used in Making Difficult Decisions

  1. While a good deal of research shows how people are really terrible at many things (even when they are supposed to be experts, as described by many cases with Kahneman) this book focuses on real-world (as opposed to laboratory) domains where people achieve amazing performance
  2. “Shopping in a supermarket does not seem like an impressive skill until you contrast an experienced American shopper to a recent immigrant from Russia.”
    1. <See, shopping is interesting!>
  3. Gives an example of how firefighters/EMTs operate.  In one case they sped over to a house where a man fell and put his arm through a window and severed an artery.  The person who evaluated the scene said that the man had lost 2 units of blood, and if he lost 4 he would be dead.  Based on where they had applied a bandage he could discern which artery was severed.  “Next, he might examine whether there are other injuries, maybe neck injuries, which might prevent him from moving the victim.  But he doesn’t bother with any more examination.  He can see the man is minutes from death, so there is no time to worry about anything else.”  He makes other important snap decisions such as having men put a device that keeps blood pressure up while they were already moving in the ambulance to save time.  From getting the initial alarm to getting the man to the hospital only 10 minutes elapsed.
    1. In the example, the lieutenant “… handled many decision points yet spent little time on any one of them.  He drew on his experience to know just what to do.”  But how does that happen?
  4. Conventional sources of power “… include deductive logical thinking, analysis of probabilities, and statistical methods.”  But those that are useful in natural settings include “… intuition, mental simulation, metaphor, and storytelling.”
    1. Intuition “… enables us to size up a situation quickly.”
    2. Simulation lets us do roll-outs in our head
    3. Metaphor allows us to generalize context from one situation to another in some way
    4. Storytelling allows us to transfer our experience to others.
  5. This book is concerned with naturalistic decision making, which involves “time pressure, high stakes, experienced decision makers, inadequate information …, ill-defined goals, poorly defined procedures, cue learning, context (e.g. higher-level goals, stress), dynamic conditions, and team coordination (…).”
    1. With regards to time pressure, estimate that 80% of decisions made by fireground commanders are done in less than a minute; many take a handful of seconds.  They also study engineers who have tight deadlines, even though they are far into the future, so they too have time pressure, but in a different sense than firefighters, compared to whom “… they are almost on vacation.”
    2. In terms of high stakes, mistakes when firefighting will cause people their life.  In finance fortunes can be lost on a single decision
    3. Here they consider experts – the firefighters they consider have an average on 23 years experience. “In contrast, in most laboratory studies, experience is considered a complicating factor.”
    4. In terms of unclear goals, from the example of firefighting, its not always clear if you should try and save the building, or simply spend efforts ensuring the fire doesn’t spread.  Should you risk firefighters to go into a building if you don’t know there is anyone there?

Chapter 2: Learning From Firefighters

  1. The initial hypothesis was that firefighters quickly pruned decision making to select between two options.  It seems that in most cases they didn’t even select between two; one option simply emerged.  One firefighter said “I don’t make decisions … I don’t remember when I’ve ever made a decision.”
    1. While creating options and selecting between them is something people often do (for example, when selecting a job), there is simply no time to consider options in time-critical scenarios
    2. <In Kahneman, I recall that the process is simply generating a single solution, simulating it mentally and then revising it/discarding it if (severe) problems arose when doing the rollout>
  2. In terms of firefighters, their experience seemed to blend together – they didn’t try to find a “nearest neighbor” mapping from the current crisis to something in their experience, in some cases they would isolate particular occurrences from a previous firefight that they thought of in a particular firefight

Chapter 3: The Recognition-Primed Decision Model

  1. As mentioned, firefighters just seem to know what to – even when it turns out the plan they were executing has failed they will immediately summon another solution to deal with the scenario
    1. This conflicted with their hypothesis that people always choose between (at least) two decisions
  2. “The commanders’ secret was their experience let them see a situation, even a nonroutine one, as an example of a prototype, so they knew the typical course of action right away.”  This is what they call the recognition-primed decision model
  3. Sometimes though, there is nothing at all to map to because the situation is really unique and they simply have to invent a solution
  4. As mentioned before, when a someone needs to take a while to consider options it is generally by thinking something up, then deciding if its good or not (through rollouts), and then if not thinking of something else, as opposed to a direct comparison between them
    1. The is called the singular evaluation approach
    2. This is basically the same idea as Herbert Simon’s satisficing - especially in time-critical situations you don’t have time to find the best option, so just find one that will do
    3. Experts will generally come up with a workable solution on the first shot, so they rarely have to consider more than one option
  5. They thought initially that it would be a characteristic of amateurs that they would jump to some option and that experts would carefully deliberate, but the actual data indicated the opposite
  6. “… there are times for deliberating about options.  Usually these are times when experience is inadequate and logical thinking is a substitute for recognizing a situation as typical… Deliberating about options makes a lot of sense for novices, who have to think their way through a decision.”
  7. Out of all the cases studied, 127 of 156 decisions were recoginitional decisions / singular evaluations
    1. Most of the comparative evaluations were situations were novices or outclassed by the situation somehow
  8. When creating a plan it is necessary to identify “… what types of goals make sense (so that priorities are set), which cues are important (so there is not an overload of information), what to expect next (so they can prepare themselves and notice surprises), and the typical ways of responding in a given situation.  By recognizing a situation as typical they also recognize a course of action likely to succeed.  The recognition of goals, cues, expectancies, and actions is part of what it means to recognize a situation.  That is, the decision makers do not start with the goals or expectancies and figure out the nature of the situation.”
  9. The 156 decisions they recorded corresponded to the difficult decisions while undertaking the tasks; the simple/routine stuff was left out
  10. The RPD model was in contrast to the traditionally accepted models of two-option comparison, or work by Janis and Mann that says people generally try to avoid decision making because it is difficult, but when they do a very heavy-weight process to come up with an answer; the approach here is quite the opposite.
    1. “Janis and Mann probably did not intend this advice for time-pressured situations, but the RPD model predominates even when time is sufficient for comparative evaluations.  Yte in one form or another, Janis and Mann’s prescriptive advice is held up as an ideal of rationality and finds its way into most courses on cognitive development.”
    2. This approach is more accurate when dealing with novices, or working in teams when it provides a way to pool knowledge, lead to agreement, and produce a process everyone can agree on
  11. The implications of the chapter is you don’t make someone an expert by having them exhastively enumerate options and consider their values.  In fact, when you do that you “… run the risk of slowing the development of skills.”  It is by experience and training by experts in terms of what they would do at a similar pace to what they would experience in the real task that is best.
    1. “The design of the scenarios is critical, since the goal is to show many common cases to facilitate a recognition of typicality along with different types of rare cases so trainees will be prepared for these as well.”
    2. “The emphasis is on being poised to act rather than being paralyzed until all the evaluations have been completed.”

Chapter 4: The Power of Intuition

  1. Intuition is difficult to study because its difficult to describe and people often don’t even know when they are using it
  2. A study showed that the brain basically makes a decision before the individual is even aware that it happened
  3. In one case a firefighter lieutenant saved his crew by deciding to pull his men out of a building that had a fire that reacted strangely to initial attempts to put it out, but otherwise did not seem particularly dangerous (it collapsed shortly after the men left).  The firefighter claimed that he used ESP, but when drilling down it turned out that he was spooked by the unusual circumstances and  pulled the men out to figure out what was happening.
    1. After they (Klein and the firefighter) worked through the experience “I think he was proud to realize how his experience had come into play.  Even so, he was a little shaken since he had come to depend on his sixth sense to get him through difficult situations and it was unnerving for him to realize that he might never have had ESP.”
    2. “… he did not seem to be aware of how he was using his experience because he was not doing it consciously or deliberately.  He did not realize there were other ways [aside from ESP] he could have sized the situation up.”
  4. Here intuition can be considered recognizing something without knowing how or if recognition is happening
  5. The perspective here is that intuition comes from experience
  6. In the above example, the firefighters experience didn’t give him facts from memory, but it did change the way he saw and reacted to the situation.  Additionally:
    1. He was drawing on things that didn’t exist in his memory (this fire didn’t react in a way he had experienced before, so there was nothing concrete to think of)
    2. He wasn’t making decisions based on particular events but his aggregate experience and expectations of what should happen
  7. In its simplest instantiation, RPD is a model of intuition
  8. Work by others “… shows that people do worse at some decision tasks when they are asked to perform analyses of the reasons for their preferences or to evaluate all the attributes of the choices.”
  9. Of course, intuition isn’t infallible though
  10. In the case where intuition is wrong, comparing the way events actually unfold to the way they were expected to allows us to figure out when we make mistakes
  11. There is another example of how a radar operator in the gulf war was able to discern that a radar blip was an incoming missile and not an incoming friendly aircraft – the distinction was extremely subtle and too many people a long time to figure out, but this operator had the correct intuition, even though he was unsure why
  12. Another example is how nurses decide if newborns have an infection – in some cases by the time the antibiotic starts to take effect the infection is already too far along.  Nurses often able to correctly identify when babies are infected, but were unable to identify what rules they used.  Careful investigation of their methods and case studies of their decisions.  Half of the rules they eventually discerned were new to medical literature, and a number of cues are opposite of what they would be for adults.  Ultimately, their decisions were usually based on the combination of a number of subtle cues
  13. The best way to get people to improve their performance is to simply have them experience increasingly complex scenarios.  Compiling case studies is also very good (especially if someone is in a situation where they can’t get access to sufficient experience or perhaps it is very dangerous) Simulations that are crafted with care can be even better: “A good simulation can sometimes provide more training than direct experience.  A good simulation lets you stop the action, back up and see what went on, and cram many trials together so a person can develop a sense of typicality.”

Chapter 5: The Power of Mental Simulation

  1. “During a visit to the National Fire Academy we met with one of the senior developers of training programs.  In the middle of the meeting, the man stood up, walked over to the door, and closed it.  Then in a hushed voice he said, ‘To be a good fireground commander, you need to have a rich fantasy life.’”
    1. What he meant was its important to be able to imagine what scenarios could have lead to the current situation, and to be able to imagine how the situation can evolve going forward.
  2. By fantasy, he meant ability to imagine and simulate – to be able to do rollouts
  3. Mentions a paper by Kahneman and Tverseky on the simulation heuristic: a person builds a simulation to explain how something may happen, if the simulation required too many unlikely events, the outcome was determined to be implausible
  4. Mental simulations tend to be extremely simple, relying only on a “… few factors–rarely more than three.  It would be like designing a machine that had only three moving parts.  Perhaps the limits of our working memory had to be taken into account.”
    1. Usually limited to about 6 transitions, perhaps also a function of the limit of working memory
  5. 3 parts and 6 transitions are basically the limits we have to consciously work in
  6. Expertise in the area allows for more powerful modeling; several transitions can be compressed into one
    1. Basically more experience allows for more accurate, higher level abstractions that would leave novices bogged down
  7. Being able to work like this is critical for those that deal with code <Indeed, I think programming requires more use of a large working memory than any other tasks I’ve done, including symbolic math>
  8. “Considering all these factors, the job of building a mental simulation no longer seems easy.  The person assembling the mental simulation needs to have a lot of familiarity with the task and needs to be able to think at the right level of abstraction.  If the simulation is too detailed, it can chew up memory space.  If it is too abstract, it does not provide much help.”
  9. They then went on to study what leads to failure of mental simulation (they previously gave an example were multiple agents weren’t able to communicate, and ultimately disaster occurred because for each party to get into the correct frame of mind seemed too implausible)
  10. He gives an example of an economist extremely accurately predicts what would happen to Poland under basically pure capitalism after the fall of communism.  The predictions were pretty accurate and came down to monitoring 3 variables: unemployment, inflation, and foreign exchange
    1. <Its worth mentioning that Kahneman reports that studying political scientists and pundits in aggregate reveals they are quite poor at making predictions as a whole.  Maybe this is a small sample size issue, or maybe the real experts are the only ones that are actually good at making predictions and got washed out in the large sample Kahneman took.  It may be the latter because even this indivudal’s students who spent time in Eastern Europe, and another political science professor who spent time in Poland (but wasn’t an expert in the area) weren’t able to make simulations at all>
  11. “…without a sufficient amount of expertise and background knowledge, it may be difficult or impossible to build a mental simulation.”
  12. Mental simulation serves to explain the past and predict the future.  In either case we work from the current state and then work backwards or forwards
  13. Once a plan is constructed, it is evaluated according to:
    1. Coherence: does it make sense?
    2. Applicability: will it accomplish what I want
    3. Completeness: does it do too much or too little
  14. When doing a rollout you may realize there are transitions which rely on data you don’t know much about (that is not that its unlikely, but rather that the confidence in the model is low)
  15. Experts can use intuition to predict if a strategy is effective, even without studying each particular step
  16. One problem with mental simulation is that sometimes we can get stuck with them and may be unwilling to discard them, even in the face of significant counter-evidence
    1. These are called by Perrow de minimus explanations; they simply try to minimize inconsistency. “The operator forms an explanation and then proceeds to explain away the disconfirming evidence.”
  17. Naturally when pressed for time we may not find errors
  18. Other problems are that in some cases people will look for something wrong with a plan, find one thing, and then consider the evaluation done, while there may be other problems with the plan
  19. Usually we can catch when a plan starts to seem infeasible, but on occasion patching it repeatedly by small amounts can lead to something that won’t work, but also seems ok
    1. Catching yourself in the error is called snap-back
    2. This is called the garden path fallacy
  20. He also discusses the ideas of doing a pre-mortem (imagine a plan went wrong, now explain why).  Kahneman also discusses this idea, but I don’t recall it was in terms of his discussion of Klein
    1. This works well because by design people should not just say a plan is fine as they are instructed to assume that it was a failure and then work backwards from there (as opposed to their natural tendency to say it worked well)
    2. “It takes less than ten minutes to get the people to imagine the failure  and its most likely causes.  The discussion that follows can go on for an hour.”
    3. These are useful not only because they can more successfully reveal flaws; they also can enable more effective contingency planning
  21. In talking about how planning at Shell went wrong: “This example shows how mental simulations can gain force when made explicit; the executives responded more favorably to decision scenarios than to forecasts based on statistics and error probabilities.”
    1. Kahneman would describe this as a system 1 / wysiati error

Chapter 6: The Vincennes Shootdown

  1. The name is derived from the commanding officer on the USS Vincennes
  2. In one case  he correctly decided not to shoot down enemy fighter planes that he correctly deduced were simply trying to provoke him, but didn’t really intend.
  3. In the other case he incorrectly shot down an Iranian passenger jet, incorrectly taking it as an enemy plane
  4. The rest of the chapter will deal with this case study
  5. In this case, the missile was fired 9 seconds after the the plane was recognized as potentially dangerous
  6. That day, the ship was attacked by other smaller ships, and its helicopter was fired on.  There were plenty of other reasons why the CO would consider the plane hostile
  7. Unfortunately, 2 key pieces of information were in error: that the plan had begun descent and that it was using an enemy beacon.
  8. Another ship in the area (seems to have had less sophisticated tools to deal with aircraft as well), figured the plane was correctly a passenger plane
  9. “Once the Vincennes’ crew became convinced that the track belonged to an F-14, that assumption colored the way they treated it and thought about it.”
    1. Context has a strong influence on decision making, especially when the data is highly ambiguous
  10. The interpretation started out the wrong way because an operator picked up an enemy military beacon, but that was due to operator error, so that one mistake lead to more mistakes
  11. US Navy analysis found “After this report of Mode II [enemy beacon], [a crew member] appeared[ed] to have distorted data flow in an unconscious attempt to make available evidence fit a preconceived scenario (‘Scenario fulfillment’).”
  12. Other data shows that regardless of the flawed data they had, even the good data would lead one to believe that the plane was hostile; most of the analyses suffer from hindsight bias
  13. “If… the Vincennes had not fired and had been attacked by an F-14, the decision researchers would have still claimed that it was a clear case of bias, except this time the bias would have been to ignore the base rates [one base rate would have it at 98.7% chance of being hostile], to ignore the expectancies.  No one can win.”
  14. Basically there were many issues that caused errors that were outside the control of the sailors.
  15. The point they make for this chapter is that behavior followed “… the same pattern: the use of mental simulation to evaluate and rule out possible explanations.”

Chapter 7: Mental Simulation and Decision Making

  1. “Mental simulation shows up in at least three places in the RPD model:diagnosing to form situation awareness, generating expectancies to help verify situation awareness, and evaluating a course of action.”
  2. Situation awareness is making sense of a situation from clues, which is especially useful when we find ourselves in an atypical situation, an example of this the way a mechanic or doctor makes a diagnosis
  3. “Situation awareness can be formed rapidly, through intuitive matching of features, or deliberately, through mental simulation.”  May be from mapping from the present to the past experience, or to choose between a number of candidate options
  4. Expectancies are developed by doing rollouts.  With more experience these become more exact.  “The greater the violations and the more effort it takes to explain away conflicting evidence, the less confident the decision maker feels about the mental simulation and diagnosis.”
  5. We generally don’t compare options because we are satisficing, not optimizing: “We think of grand masters as rational and analytical.  When de Groot asked them to think aloud while finding the best move in a chess problem, they relied on mental simulation to evaluate promising courses of action.  In de Groot’s published records, only five cases out of forty games, … show the grand masters comparing the strengths and weaknesses of one option to another.  The rest of the time they were rejecting moves or figuring out their consequences.”
  6. This is to say its not that we never do comparisons – when choosing a car to buy, for example we often do such comparisons
    1. There are a number of strategies people use to do such comparisons (which generally simplify the problem), such as just picking the option that is the best in a single dimension, or iteratively thresholding items from most important to least important dimension
    2. Also more common among novices
    3. Which one is used depends on a number of criteria, such as the amount of time allowed, importance of the decision, whether the decision needs to be justified, etc…
    4. In other cases, however, this option selection may be made according to a criteria that is not easily describable.  That is the case again in chess players; they do an iterative deepening based on a criteria that is tough to describe
  7. When there is time pressure, decisions are more likely to be made according to RPD method as opposed to comparative method
    1. In these situations, “Virtually not time was spent in any comparisons of options.  In fact, the bulk of time was spent in situation assessment rather than alternative generation…”
  8. Significance of the RPD model:
    1. Seems to describe what people actually do most often
    2. Explains how people can use experience to make decisions
    3. Demonstrates that people can make effective decisions w/o using a rational choice strategy
  9. Unlike more formal systems of decision making, RPD is naturalistic.  When other schemes are taught, they seem to just reduce performance
  10. RPD says to get better at something experience is key, but there are factors that emerge among experts
    1. Practice is good – it should be undertaken such that the experience obtained in each episode of practice has a goal and evaluation criteria
    2. Have lots of experience
    3. Get accurate feedback that is diagnostic (without feedback, accumulated experience may not be useful)
    4. Get more use of past experiences by reviewing them from time to time to try and learn new lessons (for example, chess players don’t have much time to deliberate over earlier decisions during a match; they do post-mortems)
  11. “The strategies provide a concept that is consistent with principles of adult learning in which the learner is assumed to be motivated, and the emphasis is on granting autonomy and ownership to the learner rather than having the trainers maintain tight control.”
  12. Inasmuch as a task can be broken down into individual subcomponents, its useful to identify those components and practice each independently

Chapter 8: The Power to Spot Leverage Points

  1. A leverage point is a place where a small amount of effort can lead to a large change
  2. A case study is given where a doctor relied on past experience that was in many ways different from a present challenge, but used that experience to develop a solution to a complex and time-critical problem
    1. The trick was the two problems had the same leverage point; the trick that worked in one worked in the other
  3. “Skillfull problem solving is impressive because after the fact, the solution seems obvious, yet we know that, without any guidance, most people would miss the answer.  They would not even know that an answer was possible.”
  4. Their concept of leverage points first arose in chess, where often a particular situation would be sought that would cause them to have an advantage, such as being in a position to attack the opponent’s queen
  5. It can also be other sorts of strategic decisions, such as IBM investing in the 360 (which they say cost more than the manhattan project)
  6. “Leverage points are just possibilites–pressure points that might lead to something useful, or might go nowhere.  Expertise may be valuable in noticing these leverage points.”
  7. “Leverage points provide fragmentary action sequences, kernel ideas, and procedures for formulating a position.  Experts seem to have a larger stock of procedures that they can think of to use as starting points in building a new plan or strategy… Novices, in contrast, are often at a loss about where to begin.”
  8. A leverage point that can work against you is called a choke point
  9. “Once we come up with leverage points, we need to fill in the remaining details.”

Chapter 9: Nonlinear Aspects of Problem Solving

  1. “The concept of leverage points opens the way to think about problem solving as a constructive process.  It is constructive in the sense that solutions can be built up from the leverage points and that the very nature of the goal can be clarified while the problem solver is trying to develop a solution.”
  2. This approach can be traced back to Karl Duncker of the Gestalt school.  “Rather than treating thought as calculating ways of manipulating symbols, the Gestaltists viewed thought as learning to see better, using skills such as pattern recognition.”
  3. “To solve ill-defined problems, we need to add to our understanding of the goal [or decide what it would even be] at the same time we generate and evaluate courses of action to accomplish it.  When we use mental simulation to evaluate the course of action and find it inadequate, we learn more about the goal we are pursuing.”  Failures also lead to new understanding
  4. They mean nonlinear in that there is not necessarily one smooth transition from start to finish while attacking a problem.  It may require multiple iterations of deepening understanding, planning, simulating, and revising, and the order in which these steps occur can vary.
  5. Part of problem solving tells us how long we expect a solution to take
  6. “We have to balance between looking for ways to reach goals and looking for opportunities that will reshape the goals.”
  7. Standard research on decision making focuses on well-defined problems, as this makes many types of analyses possible.  Many problems we face in the real world, however, are poorly-defined.  This makes the first step, of defining the goal,  difficult.  If you wait for a well-defined problem, you can’t really do much of anything
  8. Goes onto how AI approaches to problem solving aren’t what we do <not really taking notes on this>.  They also work in problem spaces we don’t always work in (well defined, clear goal)
    1. Main point is AI approaches are similar to the analytical approaches that RPD stands in opposition to, and if we are usually doing RPD (at least on a conscious level) then the way classical AI solves the problem is probably not what we do
  9. Example of the Apollo 13 mission

<Ok, out of time on this book – think I got the gist.>


Get every new post delivered to your Inbox.