Representation of object similarity in human vision: psychophysics and a computational model. Cutzu, Edelman. Vision Research 1997.

  1. Visual system is robust to illumination and perspective changes.  We usually hold that we should be sensitive to changes in shape, but how do you study that in a well principled way?
  2. References to earlier work that studied 2d shape change, here considering 3d
  3. 3 main ideas about how to make pose-independent shape classification, and there are ways to test which one seems to be what we do
  4. <Mostly interested in the way they generate their shape data and properties of it, so skipping most of the other stuff>
    1.  ex/ “theories such as Shepard’s law of generalization, Nosofsky’s GCM and Ashby’s GRT”
  5. Shapes made up bodies – in all they were 70-dimensional
  6. Screen Shot 2015-07-15 at 4.58.54 PM
  7. “We remark that the nonlinearities in the image creation process led to a complicated relationship between the shape-space representation of an object and its appear- ance on the screen”

  8. “Many early studies relied on the estimation of subjective similarities between stimuli, through a process in which the ob- server had to provide a numerical rating of similarity when presented with a pair of stimuli. One drawback of this method is that many subjects do not feel comfort-

    able when forced to rate similarity on a numerical scale. Another problem is the possibility of subjects modifying their internal similarity scale as the experiment pro- gresses. We avoided these problems by employing two different methods for measuring subjective similarity: compare pairs of pairs (CPP) and delayed match to sample (DMTS).”

  9. <skipping different experimental designs, moving on to discussion>
  10. Running MDS on subject data puts points pretty much where they should be
  11. “The CPP experiments described above support the hypothesis of veridical representation of similarity, by demonstrating that it is possible to recover the true low-dimensional shape-space configuration of complex stimuli from proximity tables obtained from subjects who made forced-choice similarity judgments.”
  12. “It is important to realize that the major computa- tional accomplishment in the experiments we have de- scribed so far is that of the human visual system and not of the MDS procedure used to analyze the data.”
  13. “The detailed recovery from subject data of complex similarity patterns imposed on the stimuli supports the notion of veridical representation of similarity, dis- cussed in the introduction. Although our findings are not inconsistent with a two-stage scheme in which geometric reconstruction of individual stimuli precedes the computation of their mutual similarities, the com- putational model that accompanies these findings offers a more parsimonious account of the psychophysical data. Specifically, representing objects by their similari- ties to a number of reference shapes (as in the RBF model described in Section 6.2) allowed us to replicate the recovery of parameter-space patterns observed in human subjects, while removing the need for a prior reconstruction of the geometry of the objects.”
  14. “Assuming that perceptual simi- larities decrease monotonically with psychological space distances, multidimensional scaling algorithms derive the psychological space configuration of the stimulus points from the table of the observed similarities.”
  15. asd

Perceptual-Cognitive Explorations of a Toroidal Set of Free-Form Stimuli. Shepard, Cermak. Cognitive Psychology 1973.

<I’m just going to post images because it explains the important stuff>

Screen Shot 2015-07-15 at 3.45.37 PMScreen Shot 2015-07-15 at 3.45.27 PMScreen Shot 2015-07-15 at 3.45.46 PMScreen Shot 2015-07-15 at 3.46.11 PM

  1. But also people tended to view shapes based on what object they were most similar to (classifying them based on whether they looked like a gingerbread man, for example)
    1. “a striking aspect of the subsets is their very marked variation in size and shape in the underlying two-dimensional toroidal surface.”
    2. So these clusters don’t match the earlier contour maps either in size or shape (they are not necessarily symmetric or convex, although they seem to be L/R symmetric mostly but not up/down)
    3. Sometimes a category formed two disconnected clusters
  2. The general conclusions, here, seem to be the following: On the one hand, the underlying parameter space provides a very convenient frame- work for representing the groups into which Ss tend to sort the forms. Moreover this space is directly relevant in the sense that most of the forms sorted into any one group typically cluster together into one or two internally connected subsets in the space. But, on the other hand, the fact that the spatial representations of the spontaneously produced sub- sets vary greatly in size and shape and sometimes even consist of two or more widely separated clumps seems to establish that Experiment II taps a variety of cognitive functioning that was not operative in Experi- ment I. Just what forms will be seen as representing the same object ap- parently cannot be adequately explained solely in terms of the metric of perceptual proximity among the free forms themselves…”

  3. Each cluster can be further broken down in to subsequent subclusters
  4. Parameter space is toroidal, so top links to bottom and side to side

On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. Hoffman, Shahriari, Freitas. AISTATS 2014.

  1. Consider noisy optimization with finite samples <not yet clear if this is budget is imposed by the actor or the environment>
  2. “Bayesian approach places emphasis on detailed modelling, including the modelling of correlations among the arms. As a result, it can perform well in situations where the number of arms is much larger than the number of allowed function evaluation, whereas the frequentist counterpart is inapplicable.”
  3. “This paper draws connections between Bayesian optimization approaches and best arm identification in the bandit setting. It focuses on problems where the number of permitted function evaluations is bounded.”
  4. Applications include parameter selection for machine learning tasks
  5. “The paper also shows that one can easily obtain the same theoretical guarantees for the Bayesian approach that were previously derived in the frequentist setting [Gabillon et al., 2012].”
  6. A number of different criteria can be used in Bayesian land to select where to sample: “probability of improvement (PI), expected improvement (EI), Bayesian upper confidence bounds (UCB), and mixtures of these”
  7. Mentions work of Bubeck/Munos/Etal
  8. Tons of relevant references
  9. Also discussion in terms of simple regret
  10. But looks like they are also talking PACy
  11. Setting they consider includes GPs
  12. “As with standard Bayesian optimization with GPs, the statistics of … enable us to construct many different acquisition functions that trade-off exploration and exploitation. Thompson sampling in this setting also becomes straightforward, as we simply have to pick the maximum of the random sample from …, at one of the arms, as the next point to query.”
  13. Seems like they are really considering the finite arm case where arms have some covariance
  14. Used Bayes math to get upper and lower bounds among all arms, and then this is used to generate a bound on the simple regret
  15. “Intuitively this strategy will select either the arm minimizing our bound on the simple regret (i.e. J(t)) or the best “runner up” arm. Between these two, the arm with the highest uncertainty will be selected, i.e. the one expected to give us the most information.”
  16. The exploration parameter beta is chosen based on how often each arm is chosen and then finding something epsilon optimal
    1. Regret bound is in terms of near-optimality
  17. “Here we should note that while we are using Bayesian methodology to drive the exploration of the bandit, we are analyzing this using frequentist regret bounds. This is a common practice when analyzing the regret of Bayesian bandit methods”
  18. Can do a derivation with Hoeffding or Bernstein bounds as well (leads to analysis of case of independent arms, bounded rewards)
  19. UGap vs BayesGap – bounds are pretty much the same
  20. Have a nice empirical section where they use data from 357 traffic sensors and try to find the location with the highest speed
    1. “By looking at the results, we quickly learn that techniques that model correlation perform better than the techniques designed for best arm identification, even when they are being evaluated in a best arm identification task.”
  21. Then they use it for optimizing parameters in scikit-learn
    1. “EI, PI, and GPUCB get stuck in local minima”

Protected: Notes on Optimization Task

This content is password protected. To view it please enter your password below:

Active Model Selection. Madani, Lizotte, Greiner. UAI 2004

  1. Considers the case where there is a fixed budget
  2. Shown to be NP-Hard
  3. Consider some heuristics
  4. “We observe empirically that the simple biased-robin algorithm significantly outperforms the other algorithms in the case of identical costs and priors.”
  5. Formalize the problem in terms of coins.  You are given a set of coins with different biases, and are given a budget of number of flips to sample.  Goal is to pick the coin with the highest bias for heads.  Actually consider the case where there are priors over the distributions for each coin, so considers Bayesian case
  6. “We address the computational complexity of the problem, showing that it is in PSPACE, but also NP-hard under different coin costs.”
  7. Metric is based on regret
  8. “A strategy may be viewed as a finite, rooted, directed tree where each leaf node is a special “stop” node, and each internal node corresponds to flipping a particular coin, whose two children are also strategy trees, one for each outcome of the flip”
    1. So naturally the total number of ways this can work out is exponential
  9. “We have observed that optimal strategies for identical priors typically enjoy a similar pattern (with some exceptions): their top branch (i.e., as long as the outcomes are all heads) consists of flipping the same coin, and the bottom branch (i.e., as long as the outcomes are all tails) consists of flipping the coins in a Round-Robin fashion”
  10. Update estimates on coins according to beta distribution
  11. “The proof reduces the Knapsack Problem to a special coins problem where the coins have different costs, and discrete priors with non-zero probability at head probabilities 0 and 1 only. It shows that maximizing the profit in the Knapsack instance is equivalent to maximizing the probability of finding a perfect coin, which is shown equivalent to minimizing the regret. The reduction reveals the packing aspect of the budgeted problem. It remains open whether the problem is NP-hard when the coins have unit costs and/or uni-modal distributions”
  12. “It follows that in selecting the coin to flip, two significant properties of a coin are the magnitude of its current mean, and the spread of its density (think “variance”), that is how changeable its density is if it is queried: if a coin’s mean is too low, it can be ignored by the above result, and if its density is too peaked (imagine no uncertainty), then flipping it may yield little or no information …However, the following simple, two coin example shows that the optimal action can be to flip the coin with the lower mean and lower spread!”
  13. Even if Beta parameters of two coins are fixed, the beta parameter of a third coin make require you to choose the first or second coin depending on their values
  14. Furthermore, “The next example shows that the optimal strategy can be contingent — i.e., the optimal flip at a given stage depends on the outcomes of the previous flips.”
  15. Although the optimal algorithm is contingent, an algorithm that is not contingent may only give up a little bit on optimality
  16. Discusses a number of heuristics including biased robin and interval estimation
  17. Gittins indices are simple and optimal, but only in the infinite horizon discounted case
    1. Discusses a hack to get it to work in the budgeted case (manipulating the discount based on the remaining budget)
  18. Goes on to empirical evaluation of heuristics

Gaussian Process Dynamical Models. Wang, Fleet, Hertzmann. Nips 2006

  1. “A GPDM comprises a low-dimensional latent space with associated dynamics, and a map from the latent space to an observation space.”
  2. “We demonstrate the approach on human motion capture data in which each pose is 62-dimensional.”
  3. “we show that integrating over parameters in nonlinear dynamical systems can also be performed in closed-form. The resulting Gaussian Process Dynamical Model (GPDM) is fully defined by a set of lowdimensional representations of the training data, with both dynamics and observation mappings learned from GP regression.”
  4. As a Bayesian nonparametric, GPs make them easier to use and overfit less
  5. “Despite the large state space, the space of activity-specific human poses and motions has a much smaller intrinsic dimensionality; in our experiments with walking and golf swings, 3 dimensions often suffice.”
  6. “The Gaussian Process Dynamical Model (GPDM) comprises a mapping from a latent space to the data space, and a dynamical model in the latent space…The GPDM is obtained by marginalizing out the parameters of the two mappings, and optimizing the latent coordinates of training data.”
  7. “t should be noted that, due to the nonlinear dynamical mapping in (3), the joint distribution of the latent coordinates is not Gaussian. Moreover, while the density over the initial state may be Gaussian, it will not remain Gaussian once propagated through the dynamics.”
  8. Looks like all predictions are 1-step, can specifically set it up to use more history to make it higher-order
  9. “In effect, the GPDM models a high probability “tube” around the data.”
  10. “Here we consider a simple online method for generating a new motion, called mean-prediction, which avoids the relatively expensive Monte Carlo sampling used above.”
  11. <Wordpress ate the rest of this post.  A very relevant paper I should follow up on.>

Surpassing Human-Level Face Verification Performance on LFW with GaussianFace. Lu, Tang. AAAI 2015

  1. First algorithm to beat human performance in Labeled Faces in the Wild dataset
  2. This has traditionally been a difficult problem for a few reasons:
    1. Often algorithms try to use different face datasets to help training, but the faces in different datasets come from different distributions
    2. On the other hand, relying on only one dataset can lead to overfitting
    3. So it is necessary to be able to learn from multiple datasets with different distributions and generalize appropriately
  3. Most algorithms for face recognition fall in 2 categories:
    1. Extracting low-level features (through manually designed approaches, such as SIFT)
    2. Classification models (such as NNs)
  4. “Since most existing methods require some assumptions to be made about the structures of the data, they cannot work well when the assumptions are not valid. Moreover, due to the existence of the assumptions, it is hard to capture the intrinsic structures of data using these methods.”
  5. GaussianFace is based on Discriminative Gaussian Process Latent Variable Model
  6. The algorithm is extended to work from multiple data sources
    1. From the perspective of information theory, this constraint aims to maximize the mutual information between the distributions of target-domain data and multiple source-domains data.
  7. Because GPs are in the class of Bayesian nonparametrics, they require less tuning
  8. There are optimizations made to allow GPs to scale up for large data sets
  9. Model functions both as:
    1. Binary classifier
    2. Feature extractor
    3. “In the former mode, given a pair of face images, we can directly compute the posterior likelihood for each class to make a prediction. In the latter mode, our model can automatically extract high-dimensional features for each pair of face images, and then feed them to a classifier to make the final decision.”
  10. Earlier work on this dataset used the Fisher vector, which is derived from a Gaussian Mixture Model
  11. <I wonder if its possible to use multi-task learning to work on both the video and kinematic data?  Multi-task learning with GPs existed before this paper>
  12. Other work used conv nets to take faces from different perspectives and lighting to produce a canonical representation, other approach that explicitly models face in 3D and also used NNs, but these require engineering to get right
  13. “hyper-parameters [of GPs] can be learned from data automatically without using model selection methods such as cross validation, thereby avoiding the high computational cost.”
  14. GPs are also robust to overfitting
  15. “The principle of GP clustering is based on the key observation that the variances of predictive values are smaller in dense areas and larger in sparse areas. The variances can be employed as a good estimate of the support of a 3 probability density function, where each separate support domain can be considered as a cluster…Another good property of Equation (7) is that it does not depend on the labels, which means it can be applied to the unlabeled data.”
    1. <I would say this is more of a heuristic than an observation, but I could see how it is a useful assumption to work from>
    2. Basically it just works from the density of the samples in the domain
    3. <Oh I guess I knew this already>
  16. “The Gaussian Process Latent Variable Model (GPLVM) can be interpreted as a Gaussian process mapping from a low dimensional latent space to a high dimensional data set, where the locale of the points in latent space is determined by maximizing the Gaussian process likelihood with respect to Z [the datapoints in their latent space].”
  17. “The DGPLVM is an extension of GPLVM, where the discriminative prior is placed over the latent positions, rather than a simple spherical Gaussian prior. The DGPLVM uses the discriminative prior to encourage latent positions of the same class to be close and those of different classes to be far”
  18. “In this paper, however, we focus on the covariance function rather than the latent positions.”
  19. “The covariance matrix obtained by DGPLVM is discriminative and more flexible than the one used in conventional GPs for classification (GPC), since they are learned based on a discriminative criterion, and more degrees of freedom are estimated than conventional kernel hyper-parameters”
  20. “From an asymmetric multi-task learning perspective, the tasks should be allowed to share common hyper-parameters of the covariance function. Moreover, from an information theory perspective, the information cost between target task and multiple source tasks should be minimized. A natural way to quantify the information cost is to use the mutual entropy, because it is the measure of the mutual dependence of two distributions”
  21. There is a weighing parameter that controls how much the other data sets contribute
  22. Optimize with scaled conjugate gradient
  23. Use anchor graphs to work around dealing with a large matrix they need to invert
  24. “For classification, our model can be regarded as an approach to learn a covariance function for GPC”
  25. <Not following the explanation for how it is used as a feature generator, I think it has to do with how close a point is to cluster centers>
  26. Other traditional methods work well here (like SVM, boosting, linear regression), but not as well as GP <Is this vanilla versions or on the GP features?>
  27. Works better as a feature extractor than other methods like k-means, tree, GMM
  28. “Deepface” was the next-best method
  29. It is only half-fair to say this beats human performance, because human performance is better in the non-cropped scenario, and this is in the cropped scenario.
    1. <My guess is that in the non-cropped scenario, machine performance conversely degrades even though human performance increases>
  30. Performance could be further increased but memory is an issue, so better forms of sparsification for the large covariance matrix would be a win

Deep Learning. LeCun, Bengio, Hinton. Nature 2015

  1. “Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. “
  2. Previous machine learning methods traditionally relied on significant hand-engineering to process data into something the real learning algorithm could use
  3. “Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations.”
  4. Has allowed for breakthroughs in many different areas
  5. We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available computation and data.
    1. <really? they have a very different definition of very little than i do>
  6. “In practice, most practitioners use a procedure called stochastic gradient descent (SGD).”
  7. Visual classifiers have to learn to be invariant to many things like background, shading, contrast, orientation, zoom, but also have to be very sensitive to other things (for example, learning to distinguish a german shepard from a wolf)
  8. “As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. ”
    1. Which is just an application of the chain rule for derivatives
  9. ReLUs are best for deep networks, can help remove the need for pre training
  10. Theoretical results as to why NNs rarely get stuck in local minima (especially large networks)
  11. Deep NN work started in 2006, when pretraining was done by having each layer model the activity of the layer below
  12. 1st major application of deep nets was speech recognition in 09, by 12 it was doing speech recognition on Android
  13. For small datasets, unsupervised pretraining is helpful
  14. Convnets for vision
  15. “There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.”
  16. “Recent ConvNet architectures have 10 to 20 layers of ReLUs, hundreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.”
  17. “The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intuitive’ inference that underpins effortless commonsense reasoning.”
  18. Machine translation and rnns
  19. Regular RNNs don’t work so well, LSTM fixes major problems
  20. “Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a ‘tape-like’ memory that the RNN can choose to read from or write to88, and memory networks, in which a regular network is augmented by a kind of associative memory89. Memory networks have yielded excellent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions.Beyond simple memorization, neural Turing machines and memory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught ‘algorithms’. Among other things, they can learn to output a sorted list of symbols when their input consists of an unsorted sequence in which each symbol is accompanied by a real value that indicates its priority in the list88. Memory networks can be trained to keep track of the state of the world in a setting similar to a text adventure game and after reading a story, they can answer questions that require complex inference90. In one test example, the network is shown a 15-sentence version of the The Lord of the Ringsand correctly answers questions such as “where is Frodo now?”89.”
  21. Although the focus now is mainly  on supervised learning, expect that unsupervised learning will become most important in the long term
  22. “Systems combining deep learning and reinforcement learning are in their infancy, but they already outperform passive vision systems99 at classification tasks and produce impressive results in learning to play many different video games100.”
  23. “Natural language understanding is another area in which deep learning is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time.”

Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. Singh, Memoli, Carlsson. Eurographics Symposium on Point-Based Graphics 2007

  1. “We present a computational method for extracting simple descriptions of high dimensional data sets in the form of simplicial complexes.”
  2. Discusses a tool Mapper for doing TDA
  3. “The purpose of this paper is to introduce a new method for the qualitative analysis, simplification and visualization of high dimensional data sets”
  4. One application is to help visualization in cases where data is extremely high-dimensional, and standard forms of visualization and dimension reduction don’t produce a reasonable result
  5. “Our construction provides a coordinatization not by using real valued coordinate functions, but by providing a more discrete and combinatorial object, a simplicial complex, to which the data set maps and which can represent the data set in a useful way”
  6. “In the simplest case one can imagine reducing high dimensional data sets to a graph which has nodes corresponding to clusters in the data”
  7. “Our method is based on topological ideas, by which we roughly mean that it preserves a notion of nearness, but can distort large scale distances. This is often a desirable property, because while distance functions often encode a notion of similarity or nearness, the large scale distances often carry little meaning.”
    1. <Reminds me a little of the intuition behind t-sne>
  8. Basic idea is called a partial clustering where subsets of the data are clustered.  If subsets overlap, clusters may also overlap which can be used to build a “simplicial complex” <basically a way of gluing together points so each face has a certain number of vertices, and then connecting the faces>
    1. If simplices and topologies are robust to changes in data subset divisions, then the results are real
  9. “We do not attempt to obtain a fully accurate representation of a data set, but rather a low-dimensional image which is easy to understand, and which can point to areas of interest. Note that it is implicit in the method that one fixes a parameter space, and its dimension will be an upper bound on the dimension of the simplicial complex one studies.”
  10. Unlike other methods of dimension reduction (like isomap, mds) this method is less sensitive to metric
  11. <skipping most of the math because it uses terms Im not familiar with>
  12. “The main idea in passing from the topological version to the statistical version is that clustering should be regarded as the statistical version of the geometric notion of partitioning a space into its connected components.”
  13. Assumes there is a mapping from data points to Reals (the filter), and that interpoint distances can be measured
  14. “Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any conditions on the clustering algorithm. Thus any domain-specific clustering algorithm can be used.”
  15. Cluster based on vector of interpoint distances (this is the representation used, not Euclidian distances)
  16. Also want clustering algorithm that doesn’t require # of clusters to be specified in advance
    1. <I would have thought they would use Chinese-restaurant-processes, but they use something called single-linkage clustering>
  17. “Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any conditions on the clustering algorithm. Thus any domain-specific clustering algorithm can be used.”
    1. Seems like mapper can handle higher-dimensional filters
  18. “The outcome of Mapper is highly dependent on the function(s) chosen to partition (filter) the data set. In this section we identify a few functions which carry interesting geometric information about data sets in general.”
  19. projection pursuit methods
  20. Can use eigenfunctions of the Laplacian as filter functions
  21. Also discuss using meshes and computing distance based on dijkstras over the mesh
  22. “Different poses of the same shape have qualitatively similar Mapper results however different shapes produce significantly different results. This suggests that certain intrinsic information about the shapes, which is invariant to pose, is being retained by our procedure.”
    1. I think this falls out primarily from the distance metric used?
  23. Clusters different 3d models nicely, even though it samples down to just 100 points for each model, and many are similar (horse, camel, cat)

Very Deep Convolutional Networks for Large-scale Image Recognition. Simonyan, Zisserman. ICLR 2015

  1. Discusses approach that got 1st, 2nd place in imagenet challenge 2014
  2. Basic idea is to use very small convolutions (3×3) and a deep network (16-19 layers)
  3. Made the implementation public
  4. Works well on other data sets as well
  5. Last year people moved to make smaller receptive windows, smaller stride, and using training data more thoroughly, (at multiple scales)
  6. 224×224: only preprocessing is doing mean-subtraction of RGB values for each pixel
  7.  “local response normalization” didnt help performance and consumed more memory
  8. Earlier state of the art used 11×11 convolutions w/stride 4 (or 7×7 stride 2)
    1. Here they only did 3×3 with stride 1
    2. They also have 3 non-linear rectification layers instead of 1, so the decisions made by those layers can be more flexible
  9. Their smaller convolutions have a much smaller number of parameters, which can be seen as a form of regularization
  10. Optimized multinomial logistic regression using minibatch (size 256) gradient descent from backprop + momentum.
  11. “The training was regularised by weight decay (the L2 penalty multiplier set to 5 · 10−4 ) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5).”
    1. <How does this weight decay work exactly?  Need to check it out>
  12. Ends up training faster than Krizhevsky et al., 2012’s network because of some pretraining, and also because the network is narrower, but deeper (more regularized)
    1. Pretrain 1st 4 convolutional layers, and last 3 fully connected layers
    2. They found out later that pretraining wasn’t really needed if they used a particular random initialization procedure
  13. Implementation based on Caffe, including very efficient paralleization
  14. With 4 Titan GPUs, took 2-3 weeks to train
  15. Adding further layers didn’t improve performance, although they say it might have if the data set was even larger
  16. “scale jittering helps” <i guess this has to do with how images are cropped and scaled to fit in 224×224, and randomizing this process a bit helps>
  17. “Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.”
  18. Method was simpler than a number of other near state-of-the-art

Get every new post delivered to your Inbox.