Protected: Results Update

This content is password protected. To view it please enter your password below:

DRAW: A Recurrent Neural Network For Image Generation. Gregor, Danihelka, Graves, Rezende, Wiestra. JMLR 2015

  1. “…introduces the Deep Recurrent Attentive Writer (DRAW)…  DRAW networks combine a novel spatial attention mechanism that mimics the foveation of the human eye, with a sequential variational auto-encoding framework that allows for the iterative construction of complex images.”
  2. Can generate house numbers from the google street number dataset that are indistinguishable from real images
  3. Instead of generating images all at once, this approach tries the equivalent of sketching an image first and then refining it
  4. “The core of the DRAW architecture is a pair of recurrent neural networks: an encoder network that compresses the real images presented during training, and a decoder that reconstitutes images after receiving codes. The combined system is trained end-to-end with stochastic gradient descent, where the loss function is a variational upper bound on the log-likelihood of the data. It therefore belongs to the family of variational auto-encoders…”
  5. They train the attentional system with RL, and use backprop “In this sense it resembles the selective read and write operations developed for the Neural Turing Machine (Graves et al., 2014).”
  6. “tion over images. However there are three key differences. Firstly, both the encoder and decoder are recurrent networks in DRAW, so that a sequence of code samples is exchanged between them; moreover the encoder is privy to the decoder’s previous outputs, allowing it to tailor the codes it sends according to the decoder’s behaviour so far. Secondly, the decoder’s outputs are successively added to the distribution that will ultimately generate the data, as opposed to emitting this distribution in a single step. And thirdly, a dynamically updated attention mechanism is used to restrict both the input region observed by the encoder, and the output region modified by the decoder. In simple terms, the network decides at each time-step “where to read” and “where to write” as well as “what to write”.”
  7. The output of the encoder network is a hidden vector
  8. They use LSTM for their recurrent network
  9. The output of the encoder is used to parameterize a distribution over a latent vector (which is a diagonal Gausian).  They use a diagonal Gaussian instead of the more common Bernoulli distribution because it has a gradient that is easier to work with
  10. A sample from the latent distribution is then passed as input to the decoder
    1. The output of the decoder is added cumulatively to a canvas matrix which creates an image.  The number of steps used to write to the canvas is a parameter to the algorithm
  11. “The total loss is therefore equivalent to the expected compression of the data by the decoder and prior. “
  12. “…, we consider an explicitly twodimensional form of attention, where an array of 2D Gaussian filters is applied to the image, yielding an image ‘patch’ of smoothly varying location and zoom.”
  13. <skipping a bunch>
  14. Generated images of mnist nad street house numbers look good, but generated natural images look very blurry and not much like anything identifiable, although there is clear structure in what is generated.

One-shot learning by inverting a compositional causal process. Lake, Salakhutdinov, Tenenbaum. NIPS 2013

  1. Deals with one-shot learning
  2. “…a Hierarchical Bayesian model based on compositionality and causality that can learn a wide range of natural (although simple) visual concepts, generalizing in human-like ways from just one image.”
  3. In 1-shot learning did about as well as people, and better than deep learning methods
  4. People can learn a new concept from a tiny amount of data – can learn a class from one image (which is a very high dimensional piece of data)
  5. Even MNIST, which is an old and small dataset, still has 6k samples/class, but people often only need 1 example
  6. “Additionally, while classification has received most of the attention in machine learning, people can generalize in a variety of other ways after learning a new concept. Equipped with the concept “Segway” or a new handwritten character (Figure 1c), people can produce new examples, parse an object into its critical parts, and fill in a missing part of an image. While this flexibility highlights the richness of people’s concepts, suggesting they are much more than discriminative features or rules, there are reasons to suspect that such sophisticated concepts would be difficult if not impossible to learn from very sparse data. ”
    1. Looks like people both have a rich hypothesis space (because they can do all the above with a small amount of data), but also don’t overfit, which is the theoretical downside to having a large hypothesis class.  How do they do it?
  7. Here, focus is on handwritten characters
    1. Idea is to use something more natural and complex than simple synthetic stimuli, but something less complex than natural images
  8. Use a new omniglot dataset, which has 16k characters with 20 examples each
    1. Also has time data so the strokes are recorded as well
  9. “… this paper also introduces Hierarchical Bayesian Program Learning (HBPL), a model that exploits the principles of compositionality and causality to learn a wide range of simple visual concepts from just a single example.”
  10. Also use the method to generate new examples of a class, and then do a Turing test with it by asking other humans which was human generated and which was machine generated
  11. The HBPL “…is compositional because characters are represented as stochastic motor programs where primitive structure is shared and re-used across characters at multiple levels, including strokes and sub-strokes.”
  12. The model attempts to find a “structural description” that explains the image by breaking the character down into parts
  13. A Character is made of:
    1. A set of strokes
      1. Each stroke is made of simple sub-strokes modeled by a “uniform cubic b-spline” and is built of primitive motor elements that are defined by a 1st order Markov Process
    2. Set of spatial relationships between strokes, can be:
      1. Independent: A stroke that has a location independent of other strokes
      2. Start/end: A stroke that starts at beginning/end of another stroke
      3. Along: A stroke that starts somewhere along a previous stroke
  14. ” Each trajectory … is a deterministic function of a starting location … token-level control points … and token-level scale …. The control points and scale are noisy versions of their type-level counterparts…”
  15. Used 30 most common alphabets for training, and another 20 for evaluation.  The training set was used to learn hyperparameters, a set of 1000 primitive motor elements, and stroke placement.  They attempted to do cross-validation within the training set
  16. The full set of possible ways a stroke could be created is enormous, so they have a botto-up way of finding a set of the K most likely parses.  They approximate the posterior based on this finite, size-K sample based on their relative likelihoods
    1. They actually then use metropolis-hasting to get a number of samples of each parse with a little variance each to get a better estimate of the likelihoods
  17. “Given an approximate posterior for a particular image, the model can evaluate the posterior predictive score of a new image by re-fitting the token-level variables…”
  18. Results
  19. For the 1-shot tasks, a letter from an alphabet was presented with 20 other letters from the same alphabet.  Each person did this 10 times, but each time was with a totally new alphabet, so no characters was ever seen twice
  20. Get K=5 parses of each character presented (along with MCMC), and then run K gradient searches to reoptimize the token-level variables to fit the query image.
  21. They can also, however, attempt to reoptimize the query image to fit the 20 options presented
  22. Compare against:
    1. Affine model
    2. Deep Boltzmann Machines
    3. Hierarchical Deep Model
    4. Simple Strokes (a simplified HBPL)
    5. NN
  23. Humans and HBPL ~4.5% error rate, affine model next at 18.2%
  24. Then they did one-shot Turing test where people and algorithms had to copy a single query character
    1. <For what its worth, I think Affine looks better than both results from people and HBPL>
  25. In the “Turing test” there was feedback after each 10 trials, for a total of 50 trials
    1. <Note that this test doesn’t ask which character looks best, it is which is most confusable with human writing (which is pretty sloppy from the images they show).  I’m curious if the affine model could be made more human just by adding noise to its output>
  26. <Playing devil’s advocate, the images of characters were collected on mTurk, and look like they were probably drawn with a mouse — that is to say I feel they don’t look completely like natural handwriting.  I wonder how much of this program is picking up on those artifacts?  At least in terms of reproduction, the affine method looks best>

 

Science 2015

  1. “Concepts are represented as simple probabilistic programs—that is, probabilistic generative models expressed as structured procedures in an abstract description language (…). Our framework brings together three key ideas—compositionality, causality, and learning to learn—that have been separately influential in cognitive science and machine learning over the past several decades (…). As programs, rich concepts can be built “compositionally” from simpler primitives
  2. “In short, BPL can construct new programs by reusing the pieces of existing ones, capturing the causal and compositional properties of real-world generative processes operating on multiple scales.”
  3. <Looks like exactly the same paper, just more brief.  The accuracies of both BPL and other methods seems improved here, though.  Convnets get 13.5% error; BPL gets 3.3%; people get 4.5%.  “A deep Siamese convolutional network optimized for this one-shot learning task achieved 8.0% errors”>
  4. “BPL’s advantage points to the benefits of modeling the underlying causal process in learning concepts, a strategy different from the particular deep learning approaches examined here.”
    1. <Or equivalently you can just say BPL does better because it has a small and highly engineered hypothesis class>
  5. Also run BPL with various “lesions” and gets error rates in the teens.  Also did more poorly in the “Turing test” part
  6. Instead of training on 30 background alphabets, they also did with just 5, and there the error rates are about 4%; on the same set convnets did about 20% error
  7. Supplementary Material

  8. <I assumed that they would ask individuals who actually learned how to write the languages to do the recordings.  Instead, they just took pictures of characters and had people write them.  This seems like a problem to me because of inconsistencies in the way people would actually do the strokes of a letter in an alphabet they do not know.>
  9. <Indeed, they were also drawn by mouse in a box on a screen, which is a very unnatural way to do things>
  10. <From what I can tell the characters are recorded in pretty low resolution as well which looks like it can cause artifacts, looks like 105×105>
  11. <This basically has the details that were included in the main part of the NIPS paper>
  12. Some extra tricks like convolving with Gaussian filter, randomly flipping bits
  13. Primitives are scale-selective
  14. “For each image, the center of mass and range of the inked pixels was computed. Second, images were grouped by character, and a transformation (scaling and translation) was computed for each image so that its mean and range matched the group average.”
  15. ” In principle, generic MCMC algorithms such as the one explored in (66) can be used, but we have found this approach to be slow, prone to local minima, and poor at switching between different parses. Instead, inspired by the speed of human perception and approaches for faster inference in probabilistic programs (67), we explored bottom-up methods to compute a fast structural analysis and propose values of the latent variables in BPL. This produces a large set of possible motor programs – each approximately fit to the image of interest. The most promising motor programs are chosen and refined with continuous optimization and MCMC.”
  16. “A candidate parse is generated by taking a random walk on the character skeleton with a “pen,” visiting nodes until each edge has been traversed at least once. Since the parse space grows exponentially in the number of edges, biased random walks are necessary to explore the most interesting parts of the space for large characters. The random walker stochastically prefers actions A that minimize the local angle of the stroke trajectory around the decision point…”
  17. For the ANN they used cafe, and took a network that works well on MNIST
    1. <But it seems like this system doesn’t have any of the special engineering that went into this that deals specifically with strokes as opposed to whole images>
    2. “The raw data was resized to 28 x 28 pixels and each image was centered based on its center of mass as in MNIST. We tried seven different architectures varying in depth and layer size, and we reported the model that performed best on the one-shot learning task.”
    3. <This may make the task easier, but MNIST deals with a small number of characters, many of which are much less complex than some of the characters used here.   It might be the case that some of the more complex characters can’t be accurately reduced to such a small size, so this may be hobbling performance>
    4. Also the network is not very deep – only 2 conv layers and a max-pooling
    5. “One-shot classification was performed by computing image similarity through the feature representation in the 3000 unit hidden layer and using cosine similarity.”
    6. They used a smaller net for the 1-shot classification with less data, <so that was nice of them>
  18. The full “Siamese network” did work on the 105×105 image, had 4 conv layers and 1 standard hidden layer.  Parameters were optimized with Bayesian method
  19. “The Hierarchical Deep model is more “compositional” than the deep convnet, since learning-to-learn endows it with a library of high-level object parts (29). However, the model lacks a abstract causal knowledge of strokes, and its internal representation is quite different than an explicit motor program. “
  20. For data collection “The raw mouse trajectories contain jitter and discretization artifacts, and thus spline smoothing was applied.”
  21. <Ok, skipping the rest>

Model-Based Reinforcement Learning in Continuous Environments Using Real-Time Constrained Optimization. Andersson, Heintz, Doherty. AAAI 2015

  1. Working on high-D continuous RL
  2. Builds a model with sparse Gaussian processes, and then does local (re)planning “by solving it as a constrained optimization problem”
  3. Use MPC/control related methods that were done back in ’04 but revisited here and can be used for real-time control now
  4. Test in “extended” cart-pole <all this means here is the start state is randomized> and  quadcopter
  5. Don’t try to do MCTS, because it is expensive.  Instead use gradient optimization
  6. Instead of normal O(n^3) costs for GPs, this has O(m^2n), whre m < n
  7. “However, as only the immediately preceding time steps are coupled through the equality constraints induced by the dynamics model, the stage-wise nature of such modelpredictive control problems result in a block-diagonal structure in the Karush-Kuhn-Tucker optimality conditions that admit efficient solution. There has recently been several highly optimized convex solvers for such stage-wise problems, on both linear (Wang and Boyd 2010) and linear-timevarying (LTV) (Ferreau et al. 2013; Domahidi et al. 2012) dynamics models.”
  8. Looks like the type of control they use has to linearize the model locally
  9. “For the tasks in this paper we only use quadratic objectives, linear state-action constraints and ignore second order approximations.”
  10. Use an off-the shelf convex solver for doing the MPC optimization
  11. Use warm starts for replanning
  12. The optimization converges in a handful of steps
  13. <Say they didn’t need to do exploration at all for the tasks they considered, but it looks like they have a pure random action period at first>
  14. Although the cart-pole is a simple task, they learn it in less than 5 episodes
    1. <But why no error bars, especially when this experiment probably takes a few seconds to run.  This is crazy in a paper from 2015, although it is probably fine it makes me wonder if it sometimes fails to get a good policy>
  15. Use some domain knowledge to make learning the dynamics for the quadcopter a lower-dimensional problem
    1. 8D state, 2D action
  16. For quadcopter there is training data from a real quadcopter? fed in and then it is run in simulation
  17. “By combining sparse Gaussian process models with recent efficient stage-wise solvers from approximate optimal control we showed that it is feasible to solve challenging problems in real-time.”

Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift. Ioffe, Szegedy. Arxiv 2015

  1. A problem with training ANNs is that as training occurs, the distribution of inputs for higher layers changes (called covariate shift).  Here they do normalization <whitening at each layer? yes> of inputs for each mini batch.
  2. Trains faster, trains to better results, is itself a form of regularization so removes need for dropout in some cases
  3. Saturation occurs frequently as a result of covariate shift.  If we can avoid that then it may make it easier to train with larger learning rates
  4. “Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.”
  5. Normalization parameters must be computed inside the gradient descent step (so in batch mode, and not online).  This can be shown both theoretically and in practice
  6. Normalization is done by each input independently (this is to save computational costs, and also because there needs to be some computation that isn’t differentiable <but needs to be?>)
    1. <I guess so, later on:>  “Thus, BN transform is a differentiable transformation that introduces normalized activations into the network. “
  7. In order to make sure the normalization doesn’t ruin expressability of the layer, a constraint is that “the transformation inserted in the network can represent the identity transform”
  8. “In traditional deep networks, too-high learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima.”
  9. Naturally it also helps deal with scaling issues in the inputs
  10. “Moreover, larger weights lead to smaller gradients, and Batch Normalization will stabilize the parameter growth.”
  11. Because it acts as regularization, can remove the need for dropout and ReLUs, as well as other forms of regularization (such as L2 weight regularization), can also allow for slower weight decay
  12. Get state of the art results on imagenet, and reaches human-level performance
  13. “Batch Normalization adds only two extra parameters per activation, and in doing so preserves the representation ability of the network.”
  14. State that this may help with training problems that are part of RNNs

Continuous Control with Deep Reinforcement Learning. Lilicrap, Hunt, Pritzel, Heess, Erez, Tasssa, Silver. Arxiv 2015

  1. Extension of deep QL to continuous actions
  2. Actor-critic, model-free
  3. Deterministic policy gradient
  4. Show the algorithm running on 20 tasks in a physics simulator
  5. A followup to deterministic policy gradient paper
  6. Uses most of the tricks from the Atari paper plus something relatively new called batch normalization
  7. Ran algorithms directly on joint angle data as well as simulated camera images
  8. Alg is called deep deterministic policy gradient
  9. They are able to use same parameters for the direct state information as well as visual data
  10. Method is simple and pretty straightforward actor-critic
  11. They compare results to a planner that has access to a generative model
  12. DDPG can sometimes outperform the planner that accesses the generative model, even in some cases when working only from the visual data
  13. DPG requires:
    1. A parameterized actor function which is a mapping from states to actions
    2. Critic, which has Q-function
  14. NFCQA is basically the same as DPG but uses an NN as a FA.  Issue is it uses batch learning which doesn’t scale well.
    1. The minibatch version of this algorithm is equivalent to the original formulation of DPG
  15. Do “soft” updates of the network which makes weights change more slowly but helps prevent divergence
    1. “This simple change moves the relatively unstable problem of learning the action-value function closer to the case of supervised learning, a problem for which robust solutions exist. “
    2. Did this both for the policy and Q
  16. Method of batch normalization is an approach that helps deal with issue that different parts of the state vector may have different scales and meanings
    1. <From what I can tell here, this looks like it basically does minibatch whitening of the data, is it really such a new idea?  Need to check the paper where it is introduced.>
  17. They just add Gaussian noise to the actor in order to do exploration
  18. Most of the problem looks like came from MuJoCo, some in 2d and some in 3d, but they also did racing in Torcs
  19. Similar to the atari papers they use the last 3 frames of data to represent state
  20. Visual data is downsampled to 64×64, 80-bit
  21. “Surprisingly, in some simpler tasks, learning policies from pixels is just as fast as learning using the low-dimensional state descriptor. This may be due to the action repeats making the problem simpler. It may also be that the convolutional layers provide an easily separable representation of state space, which is straightforward for the higher layers to learn on quickly.”
  22. The planner they compare against is iLQG which <I think> is a locally-optimal controller
    1. It needs not only the model but also its derivatives
  23. “The original DPG paper evaluated the algorithm with toy problems using tile-coding and linear function approximators. It demonstrated data efficiency advantages for off-policy DPG over bothon- and off-policy stochastic actor critic. It also solved one more challenging task in which a multijointed octopus arm had to strike a target with any part of the limb. However, that paper did not demonstrate scaling the approach to large, high-dimensional observation spaces as we have here.”
  24. “It has often been assumed that standard policy search methods such as those explored in the present work are simply too fragile to scale to difficult problems [17]. Standard policy search is thought to be difficult because it deals simultaneously with complex environmental dynamics and a complex policy. Indeed, most past work with actor-critic and policy optimization approaches have had diffi- culty scaling up to more challenging problems [18]. Typically, this is due to instability in learning wherein progress on a problem is either destroyed by subsequent learning updates, or else learning is too slow to be practical.”
  25. Similar to guided policy search <?>
  26. Looks like Q estimates are close to the returns that the policies generate

Distinguishing Conjoint and Independent Neural Tuning for Stimulus Features With fMRI Adaptation. Drucker, Kerr, Aguirre. Innovative Methodology 2009.

  1. “We describe an application of functional magnetic resonance imaging (fMRI) adaptation to distinguish between independent and conjoint neural representations of dimensions by examining the neural signal evoked by changes in one versus two stimulus dimensions and considering the metric of two-dimension additivity.”
  2. “Do different neurons represent individual stimulus dimensions or could one neuron be tuned to represent multiple dimensions?”
  3. “This study describes an application of functional magnetic resonance imaging (fMRI) to distinguish conjoint from independent representation of two stimulus dimensions within a spatially restricted population of neurons.”
  4. “If the recovery for a combined change is simply the additive combination of the recovery for each dimension in isolation, we take this as evidence for independent neural populations. When the neural recovery for a combined change is subadditive, this may reflect populations consisting of neurons that conjointly represent the two stimulus dimensions.”
  5. “These two possibilities could be distinguished directly by measuring the tuning of individual neurons. However, the signal obtained with BOLD fMRI averages the population neural response from a voxel, making this measurement unavailable.”
  6. “To distinguish conjoint and independent tuning in this case, we must measure the properties of the neural population using adaptation methods.”
  7. “In summary, we may distinguish between conjoint and independent tuning of neurons in a population by comparing the recovery from adaptation for combined transitions to that seen for isolated transitions along each stimulus dimension.”
  8. “In theory, one could conduct the test described earlier by measuring the BOLD fMRI response to three stimulus pairs: a pair that differs only in color, a pair that differs only in shape, and a pair that differs in both color and shape.”
    1. To make this robust though, more thorough sampling is needed
  9. Looking at the difference between Manhattan and Euclidian distance helps figure out how neurons respond to stimuli with multiple dimensions (additive/independent or not)
  10. Another issue that needs to be addressed before interpreting how neurons respond requires figuring out how linear changes in stimuli lead to changes in neural response — assumed to be nonlinear <I figure they measure this as they vary only 1 dimension?>
  11. They have a model to estimate the nonlinearity, <trying to deal with varying forms of nonlinearities seems complex the way they do it maybe there is a better way>
  12. “A different violation of the model assumptions occurs when the underlying neural representation is independent for the stimulus dimensions, but its neural instantiation is not aligned with the assumed dimensional axes of the study. For example, consider an experiment designed to examine the neural representation of rectangles. The stimulus space used in the experiment consists of rectangles that vary in height and width, and the experimenter models these two parameters. It may be the case, however, that a population of neurons actually has independent tuning for the sum and difference of height and width (roughly corresponding to area and aspect ratio)—a 45° rotation of the axes as modeled by the experimenter.”
  13. “In summary, when significant loading on the Euclidean contraction covariate is obtained in an experiment, an additional test is necessary to reject the possibility of independent,but misaligned, neural populations. Post hoc testing of the performance of the model under assumed rotations of the stimulus axes can distinguish between the independent, but rotated, and the conjointly tuned cases.”
  14. “Earlier, we considered how these concepts are related to receptive fields that are either linear or radially symmetric within a stimulus space. Intermediate receptive fields are possible, however, with oval shapes of varying elongation. In such cases the population would not be wholly independent, but instead represent one dimension to a greater extent than the other. These intermediate cases are considered readily within the framework of the Minkowski exponent that defines the representational space.”
  15. fmris introduce other problems related to things they are bad at measuring
  16. “The test for a conjointly tuned neural population amounts to the measurement of variance attributable to the Euclidean contraction covariate. We consider here optimizations of the approach to maximize power for this test.”
  17. Instead of sampling stimuli from a grid of parameter space, they sample from nested octagons
    1. Helps keep points at a diagonal more evenly spaced than those that are up/down which is an issue in rectangular sampling. “the dioctagonal space increases the range of the Euclidean contraction covariate, thus improving power.”
  18. “Our selection of stimuli was motivated by the psychological study of integral and separable perceptual spaces. Some visual properties of objects are apprehended separately (e.g., color and shape), whereas other dimensions are perceived as a composite (e.g., saturation and brightness); these have been termed separable and integral dimensions (Shepard 1964). We hypothesized that integral perceptual dimensions are represented by populations of neurons that represent the dimensions conjointly, whereas separable dimensions are represented by independent neural populations; similar ideas have been proposed recently”
  19. First set of experiments are on “popcorn” and “moons”
  20. Screen Shot 2015-08-05 at 12.10.34 PM
  21. There is evidence that the two dimensions used for both stimuli sets are perceptually independent
  22. <Maybe they dont actually do an experiment on the stars that vary orange to red with differing number of points and only use them as an example?  that would be a bummer>
  23. “During separate fMRI scanning sessions … Subjects were required to monitor and report the position of a bisecting line, which was randomly tilted and shifted within preset limits … to maintain attention.”
  24. “For each subject, we identified within ventral occipitotemporal cortex voxels that showed recovery from adaptation to both stimulus axes for both stimulus spaces …Most voxels were concentrated around the right posterior fusiform sulcus, corresponding to ventral LOC…”
  25. In popcorn they find the neural representation is not independent based on dimension, but for the crescents they may indeed be independent
  26. “Although a particular study may find independent tuning for a pair of stimulus dimensions, it does not automatically follow that neurons are therefore tuned “for” those axes. It remains possible that the dimensions selected for study are manifestations of some further, as yet unstudied, organizational scheme.”
  27. In discussion, mentions other features that are thought to be represented independently on a neural level
  28. “Our method amounts to using a linear model to test the metric of a space—an approach that has been considered problematic…we have argued by simulation for the validity of our model for two dimensions with 16 regularly spaced samples.”
    1. “Herein we have considered several types of nonlinearities and distortions that can exist in neural representation or recovery from adaptation. Although we find that the method is generally robust to these deviations, there naturally exists the possibility of further violations of the assumptions of the model that we have not evaluated.”
  29. “We envision the use of the metric estimation test to study the representation of stimulus properties across sensory cortical areas. By revealing the presence of independently tuned neural populations, the fundamental axes of perceptual representation might be identified. Interestingly, a given stimulus space may be represented conjointly in one region of cortex, but independently in another.”
    1. This is true of the visual system

Properties of Shape Tuning of Macaque Inferior Temporal Neurons Examined Using Rapid Serial Visual Presentation. De Baene, Premereur, Vogels. J Neurophysiology 2007.

  1. Examined macaque inferior temporal cortical neuron responses to parametrically defined shapes
  2. “we found that the large majority of neurons preferred extremes of the shape configuration, extending the results of a previous study using simpler shapes and a standard testing paradigm. A population analysis of the neuronal responses demonstrated that, in general, IT neurons can represent the similarities among the shapes at an ordinal level, extending a previous study that used a smaller number of shapes and a categorization task. However, the same analysis showed that IT neurons do not faithfully represent the physical similarities among the shapes.”
  3. Also, IT neurons adapt to stimulus distribution statistics
  4. “Single IT neurons can be strongly selective for object attributes such as shape, texture, and color, while remaining tolerant to some transformations such as object position and scale”
  5. Rapidly display images in succession without interstimulus break
  6. Other results also show that neurons seem to be tuned to activate at when shapes that come from the extremes of parameter shape are presented
  7. “Because a high number of stimuli are presented repeatedly in RSVP, this paradigm might be more sensitive to adaptive effects than classical testing paradigms in which one stimulus is presented per trial after acquisition of fixation and the intertrial interval is relatively long”
  8. <Skipping experimental details and moving on to results>
  9. <Again,> Neuron responses were tuned to extremes of the parameter space and not normally or uniformly distributed
    1. They used a number of different shape classes, and all showed this effect
  10. There was “a good overall fit between physical and neural similarities.”
  11. Although they had the issue that some dimensions were more salient than others,
  12. Screen Shot 2015-08-04 at 1.11.39 PM
  13. Did a hierarchical clustering of shapes according to neural responses and different shape classes are always together (aside from one shape class that is split in half and has another shape class “inside” it)
  14. “One issue to consider regarding the interpretation of the observed stronger responses for extreme stimuli is that the employed stimuli are likely to be suboptimal for the tested IT neurons. The critical question here is why the extreme stimuli are less suboptimal than the other stimuli given the likely high-dimensional space in which IT neurons are tuned. A satisfactory answer to this important question will require a full description of the nature of the tuning functions of IT neurons as well as knowledge about the relative position and range of the stimulus set with respect to these tuning functions. The possibility cannot be excluded that IT neurons learn the stimulus statistics of the parametric shape spaces and thus that the observed tunings depend on the stimulation history and the specific stimulus spaces. Experiment 2 demonstrated that the responses of IT neurons can indeed be modified by changes in input statistics. These effects were small in comparison to the degree of monotonic tuning, but stimulus statistics might exert a more profound effect with more extensive daily repetition of the same stimulus spaces as is the common practice in singlecell recording experiments  The MDS results clearly show that IT neurons are more sensitive for some stimulus variations (e.g., indentation; stimulus sets 3 and 4) than for others. This is in agreement with previous studies using calibrated sets of shapes…”

Representation of object similarity in human vision: psychophysics and a computational model. Cutzu, Edelman. Vision Research 1997.

  1. Visual system is robust to illumination and perspective changes.  We usually hold that we should be sensitive to changes in shape, but how do you study that in a well principled way?
  2. References to earlier work that studied 2d shape change, here considering 3d
  3. 3 main ideas about how to make pose-independent shape classification, and there are ways to test which one seems to be what we do
  4. <Mostly interested in the way they generate their shape data and properties of it, so skipping most of the other stuff>
    1.  ex/ “theories such as Shepard’s law of generalization, Nosofsky’s GCM and Ashby’s GRT”
  5. Shapes made up bodies – in all they were 70-dimensional
  6. Screen Shot 2015-07-15 at 4.58.54 PM
  7. “We remark that the nonlinearities in the image creation process led to a complicated relationship between the shape-space representation of an object and its appear- ance on the screen”

  8. “Many early studies relied on the estimation of subjective similarities between stimuli, through a process in which the ob- server had to provide a numerical rating of similarity when presented with a pair of stimuli. One drawback of this method is that many subjects do not feel comfort-

    able when forced to rate similarity on a numerical scale. Another problem is the possibility of subjects modifying their internal similarity scale as the experiment pro- gresses. We avoided these problems by employing two different methods for measuring subjective similarity: compare pairs of pairs (CPP) and delayed match to sample (DMTS).”

  9. <skipping different experimental designs, moving on to discussion>
  10. Running MDS on subject data puts points pretty much where they should be
  11. “The CPP experiments described above support the hypothesis of veridical representation of similarity, by demonstrating that it is possible to recover the true low-dimensional shape-space configuration of complex stimuli from proximity tables obtained from subjects who made forced-choice similarity judgments.”
  12. “It is important to realize that the major computa- tional accomplishment in the experiments we have de- scribed so far is that of the human visual system and not of the MDS procedure used to analyze the data.”
  13. “The detailed recovery from subject data of complex similarity patterns imposed on the stimuli supports the notion of veridical representation of similarity, dis- cussed in the introduction. Although our findings are not inconsistent with a two-stage scheme in which geometric reconstruction of individual stimuli precedes the computation of their mutual similarities, the com- putational model that accompanies these findings offers a more parsimonious account of the psychophysical data. Specifically, representing objects by their similari- ties to a number of reference shapes (as in the RBF model described in Section 6.2) allowed us to replicate the recovery of parameter-space patterns observed in human subjects, while removing the need for a prior reconstruction of the geometry of the objects.”
  14. “Assuming that perceptual simi- larities decrease monotonically with psychological space distances, multidimensional scaling algorithms derive the psychological space configuration of the stimulus points from the table of the observed similarities.”
  15. asd

Perceptual-Cognitive Explorations of a Toroidal Set of Free-Form Stimuli. Shepard, Cermak. Cognitive Psychology 1973.

<I’m just going to post images because it explains the important stuff>

Screen Shot 2015-07-15 at 3.45.37 PMScreen Shot 2015-07-15 at 3.45.27 PMScreen Shot 2015-07-15 at 3.45.46 PMScreen Shot 2015-07-15 at 3.46.11 PM

  1. But also people tended to view shapes based on what object they were most similar to (classifying them based on whether they looked like a gingerbread man, for example)
    1. “a striking aspect of the subsets is their very marked variation in size and shape in the underlying two-dimensional toroidal surface.”
    2. So these clusters don’t match the earlier contour maps either in size or shape (they are not necessarily symmetric or convex, although they seem to be L/R symmetric mostly but not up/down)
    3. Sometimes a category formed two disconnected clusters
  2. The general conclusions, here, seem to be the following: On the one hand, the underlying parameter space provides a very convenient frame- work for representing the groups into which Ss tend to sort the forms. Moreover this space is directly relevant in the sense that most of the forms sorted into any one group typically cluster together into one or two internally connected subsets in the space. But, on the other hand, the fact that the spatial representations of the spontaneously produced sub- sets vary greatly in size and shape and sometimes even consist of two or more widely separated clumps seems to establish that Experiment II taps a variety of cognitive functioning that was not operative in Experi- ment I. Just what forms will be seen as representing the same object ap- parently cannot be adequately explained solely in terms of the metric of perceptual proximity among the free forms themselves…”

  3. Each cluster can be further broken down in to subsequent subclusters
  4. Parameter space is toroidal, so top links to bottom and side to side