Category Archives: Regression

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning. Guo, Singh, Lee, Lewis, Wang. NIPS 2014

  1. Model-based Atari planners are better than the famous DQN, but very slow
  2. Goal here is to build a real-time MCTS agent better than DQN
  3. Use UCT to train NN, 300 max depth, 10k trajectories (minor modifications for a couple of games)
  4. A few particular approaches:
    1. Do 800 episodes in the game through UCTs policy, and learn the values computed by UCT to train the NN
    2. Use UCT as above, but to use NN to train a classifier to pick an action
      1. One issue is that the policy generated by these approaches will in general be different from that found by UCT during search, so the values will not be for the policy used
    3. So they propose an “interleaved” approach where part of the behavior is on-policy UCT, and part UCT is used to compute values, but the actions taken are according to NN
  5. Learn in 84×84 greyscale
  6. Use same NN arch as DQN
  7. UCT with classification is as good or better as DQN,  UCT with regression is worse
  8. To be consistent with DQN paper, they leave in 5% exploration for recording quality, although it doesn’t make any sense to do that when evaluating performance
  9. UCT without any realtime constraints really beats up on everything (roughly double score of best)
  10. Also, the emulator runs at 60 fps, not 30 that might be expected <the 60 is consistent with atari hardware, I checked>
  11. Interleaved method performance jumped %10 when doubling trajectories
  12. <I thought they were going to try a DYNA 2 approach, wonder how it would work>
  13. They observed some horizon effects in UCT where some good behavior requires really long timescales that couldn’t be reasoned about

A Review of Unsupervised Feature Learning and Deep Learning for Time-Series Modeling. Langkvist, Karlsson, Loutfi. Pattern Recognition Letters 2014.

  1. Consider continuous, high dimensional, noisy time series data
  2. Also assume it may not be the case that there is enough information in the data to do very accurate prediction (high Bayes error)
  3. Assume nonstationary: x(t) predicts y(t), but at a different t, the same value for x may produce a different prediction
    1. To resolve this, some form of context is required.  With the correct context the process is stationary
    2. Amount of time for context may be unkown, may be very large
  4. May be nonstationary so that summary statistics change over time.  In some cases, change in frequency is so relevant that its better to work in the frequency-domain than the time-domain
    1. <Been meaning to learn about that>
  5. In vision, often things like translation and scale invariance are desired.  In time series analysis, we desire invariance to translations in time
  6. So yeah its a gnarly problem.  Picking the right representation is key
    1. Here they consider finding a representation
  7. Discusses hidden (and hidden and gated) Boltzmann machines, although I won’t take notes because its probably isn’t what they really will use anyway
  8. Auto-encoders
  9. Basic linear auto-encoder is same as PCA
  10. Terms in cost function include for sparsity and to keep weights close to 0 <listed as 2 different things, but how are they distinct?>
  11. Recurrent neural network
  12. Regularization terms “… prevents a the trivial learning of a 1-to-1 mapping of the input to the hidden units.”
  13. RBMs don’t need regularization because stochastic binary hidden unit acts as a regularizer, although it is possible to add regularization on top anyway
  14. Recurrent neural network
  15. Trained by backprop-through-time
  16. “RNNs can be seen as very deep networks with shared parameters at each layer when unfolded in time.”
  17. Deep learning
  18. Convolution, pooling
  19. Other methods for dealing with time-data aside from simple recurrent networks is penalizing changes in the hidden layer from one time step to the next
  20. Also mention slow feature analysis
  21. “Temporal coherence is related to invariant feature representations since both methods want to achieve small changes in the feature representation for small changes in the input data.”
  22. Hidden Markov Models
    1. But require discrete states
    2. Limited representational capacity
    3. Not set up well to track history
  23. “The use of Long-short term memory (…) or hessian-free optimizer (…) can produce recurrent networks that has a memory of over 100 time steps.”
  24. Some models are generative, others are discriminative.  Auto-encoder isn’t generative, but “a probabilistic interpretation can be made using auto-encoder scoring (…)”
  25. <According to the table in the paper, recurrent neural networks are most appropriate for what we are considering>
  26. Discusses video data
    1. Lots of relevant work to investigate here
  27. “The use of deep learning, feature learning, and convolution with pooling has propelled the advances in video processing.”  Deep learning is natural because it is state of the art on still images, but extensions are needed to deal with the temporal aspect
  28. “The early attempts at extending deep learning algorithms to video data was done by modelling the transition between two frames.  The use of temporal pooling extends the time-dependencies a model can learn beyond a single frame transition.  However, the time-dependency that has been modeled is still just a few frames.  A possible future direction for video processing is to look at models that can learn longer time-dependencies.”
  29. Other examples <with a fair amount of space given that I’m skipping> is stock prices and music recognition
  30. Motion Capture Data
  31. Previous applications were Temporal Restricted Boltzmann Machines (TRBM), and conditional RBM (Hinton in both papers), then recurrent TRBM
  32. Mirowski and LeCun used dynamic factored graphs to fill in missing mocap data
  33. “A motivation for using deep learning algorithms for motion capture data is that it has been suggested that human motion is composed of elementary building blocks (motion templates) and any complex motion is constructed from a library of these previously learned motion primitives (Flash and Hochner, 2005).  Deep networks can, in an unsupervised manner, learn these motion templates from raw data and use them to form complex human motions.”
  34. Section on machine olfaction, physiological eeg, meg, ecg
  35. “In order to capture long-term dependencies, the input size has to be increased, which can be impractical for multivariate signals or if the data has very long-term dependencies.  The solution is to use a model that incorporates temporal coherence, performs temporal pooling, or models sequences of hidden unit activations.”

Submodular Meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation, and Dictionary Selection. Das, Kempe. ICML 2011

  1. How do you select a subset of k variables from a large set in order to get the best linear prediction of another variable?
    1. To be clear, its about choosing dimensions of data, as opposed to selecting subset of whole data points
  2. Related to feature selection and sparse approximation
  3. Analyzes performance of commonly used, empirically effective greedy heuristics
    1. Analysis based on submodularity and spectral analysis
  4. Introduces the “… submodularity ratio as a key quantity to help understand why greedy algorithms perform well even when the variables are highly correlated.”
  5. Get best approximation guarantees in terms of both submodularity ratio as well as “… smallest k-sparse eigenvalue of the covariance matrix.”
  6. Also get better bounds on dictionary selection problem <not sure what that is>
  7. Test on real-world as well as synthetic data
    1. Results show submodularity ratio is a better predictor of performance of greedy algorithms than other spectral parameters
  8. Commonly, after subset selection is performed, the goal is to minimize MSE or maximize squared multiple correlation R2 <whats that? sounds neat>.  Here they focus on the latter
  9. Selection criteria is based on covariance matrices
  10. Minimizing R2 is “… equivalent to the problem of sparse approximation over dictionary vectors…”
  11. Formulation is similar to that of sparse recovery, except there the assumption is the data is actually k-sparse, but in many cases that assumption doesn’t hold and the data is actually dense
  12. The problem as specified is NP-Hard, so heuristics are the only way to go
  13. Common heuristics are greedy methods, or those based on convex relaxation.  Here they focus on the former, as the methods are generally simpler and require less tuning
  14. Of the greedy methods, most common are Forward Regression and Orthogonal Matching Pursuit
  15. Previous theoretical work on these greedy methods have been lacking in applicability
  16. “We prove that whenever the submodularity ration is bounded away from 0, the R2 objective is ‘reasonably close’ to submodular, and Forward Regression gives a constant-factor approximation.”
  17. Although they mention issues with spectral methods in this context (“… greedy algorithms perform very well, even for near-singular input matrices.”) the covariance ratio is related to spectral analysis:
    1. The submodularity ratio “… is always lower-bounded by the smallest k-sparse eigenvalue of the covariance matrix [but can be much larger].”
  18. They also get bounds for the two greedy methods when the lowest k-sparse eigenvalue is non-zero
  19. In comparison between performance as related to submodularity ratio vs spectral analysis “… while the input covariance matrices are close to singular, the submodularity ratio actually turns out to be significantly larger.  Thus our theoretical results can begin to explain why, in many instances, greedy algorithms perform well in spite of the fact that the data may have high correlations.”
  20. R2 describes the fraction of the variance of the desired output values Y that are explained by variables in the X inputs in the corpus
  21. Forward regression is the natural greedy method, where the variable that is added at each step is the one that most maximizes R2 immediately
  22. Give a bound for how much  the R2 values of a set and a sum of its elements can differ, used for the proof of performance of forward regression
  23. <Skipping actual proofs, onto empirical results>
  24. “Because several of the spectral parameters (as well as the optimum solution) are NP-hard to compute, we restrict our experiments to data sets with n<30 features, from which k<8 are to be selected.  We stress that the greedy algorithms themselves are very efficient.”
  25. Data sets are World Bank development indiccators, house prices in Boston w/relevant features, and then a synthetic set
  26. Compared to optimal (exhaustive) results, the greedy algorithms perform extremely well
  27. The greedy methods work better than Lasso
  28. The bound based on the submodularity ratio is much better than that of the spectral parameters
    1. While there is a fair amount of looseness between the theoretical bounds and actual performance (theoretical is pessimistic), the difference between actual and theoretical is not absurdly different
    2. They also include an extension to the theory that drastically tightens the bounds <why not put this in the theoretical section in the first place, as opposed to with the empirical results?>
  29. The real-world data sets are nearly singular

How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis. Escalante-B, Wiskott. CogPrints 2013.

  1. Their extension for supervised SFA is called graph-based SFA
  2. “The algorithm extracts a label-predictive low-dimensional set of features that can be post processed by typical supervised algorithms to generate the final label or class estimation.”
  3. Trained with a graph where edge weights represent similarities
  4. The modification to SFA made here is that it accepts weights
  5. There are different ways of building these graphs, a very simple method generates results equivalent to the Fisher linear discriminant
  6. Claim is that supervised learning on high-dimensional data is tough, so often a dimension reduction step is taken (perhaps unsupervised).
    1. Here, a supervised dimension reduction step is proposed
  7. “GSFA and LE [Laplacian Eigenmaps] have the same objective function, but in general GSFA uses different edge-weight (adjacency) matrices, has different normalization constraints, supports nonde-weights, and uses function spaces.”
  8. GSFA can be used for both regression or classification, many approaches only work for one of the two
  9. “The central idea behind GSFA is to encode the label information implicitly in the structure of the input data, as some type of similarity matrix called edge-weight matrix, to indirectly solve the supervised learning problem, rather than performing an explicit fit to the labels.”
  10. In the graph, there are edge weights along with node weights, which specify a-priori sample properties
  11. “… hierarchical processing can also be seen as a regularization method because the number of parameters to be learned is typically smaller than if a single SFA node with the number of parameters than if a single SFA node with a huge input is used, leading to better generalization.”
    1. Another advantage is that if non-linear bases are used, the nonlinearity can allow for increasingly more complex functions per layer
  12. In graph edges are undirected, weighed, although it seems that the approach trivially generalizes to the directed case
  13. Basically they rewrite the original constraints of SFA with added weights
  14. Non-existing edges are given 0-weight
  15. Seems like they just end up using the graph to exactly calculate what the dynamics would be based on initialization probabilities (vertex weights) and transition probabilities (edge weights)
  16. How to construct the graph for either classification or regression is then discussed
  17. For graphs, they simply generate a separate graph for each class, with each item in each graph fully connected, and each sub-graph completely unconnected to items in a separate class, so basically there are independent fully connected components for each class
    1. There are some tricks that can be used due to the symmetry in each class cluster to make processing cheaper
  18. What is found by the classifier in this construction is equivalent to that of the Fisher linear discriminant
  19. “Consistent with FDA, the theory of SFA using unrestricted function space (optimal free responses) predicts that, for this type of problem, the first S – 1 slow features extracted are orthogonal step functions, and are piece-wise constant for samples from the same identity (…).  This closely approximates what has been observed empirically, which can be informally described as features that are approximately constant for sample of the same identity, with moderate noise.”
  20. <Immediately next paragraph> “When the features extracted are close to the theoretical predictions (e.g., their Δ-values are small), their structure is simple enough that one can use even a modest supervised step after SFA, such as a nearest centroid or a Gaussian classifier (in which a Gaussian distribution is fitted to each class) on S-1 slow features or less.”
    1. Using SVMs over Gaussians doesn’t make performance that much better, while being computationally more expensive
  21. Now on to regression
  22. For regression “The fundamental idea is to treat labels as the value of a hidden slow parameter that we want to learn.  In general, SFA will not extract the label values exactly.  However, optimization for slowness implies that samples with similar label values are typically mapped to similar output values.  After SFA reduces the dimensionality of the data, a complimentary explicit regression step on a few features solves the original regression problem.”
  23. They discuss 4 ways of doing the regression for SFA, the first one actually doesn’t even leverage
  24. In the version that doesn’t leverage graphs, simply sort data and then pass into SFA.  “Due to limitations of the feature space considered, insufficient data, noise, etc., one typically obtains noisy and distorted versions of the predicted signals.”
    1. On the other hand, its the easiest to implement (partially because vanilla SFA can be used) so “… we recommend its use for first experiments.”  If that doesn’t work well, use the GSFA approaches
  25. In the “sliding window training graph” items are sorted as above, but each vertex is connected to the d closest left and right items
  26. They recommend not using just 0 and 1 weights as it leads to “pathological solutions” – this may be what we’re picking up in ToH, and talk about why that happens. <This is worth investigating further.>
  27. In the “serial training graph,” data points are binned together – then points are all connected to points in adjacent bins, but they don’t connect all the points in a same bin  <why?>
    1. As is the case in other particular structures, can set up GSFA to be more efficient for this particular case
    2. Naturally, there is tuning required to see that the binning was done correctly
  28. The “mixed training graph” adds connections within a bin
  29. Then there is a supervised step on top of this stuff <am I missing something – I thought there were 4 in total?>
  30. “There are at least three approaches to implement the supervised step on top of SFA to learn a mapping from slow features to the labels. ” <
    1. First option is linear or nonlinear regression
    2. To bin and then classify <so you end up with discrete approx of regression?>
    3. Do a weighted version of #2 so you get continuous estimations
    4. <#s 2 and 3 immediately above look terribly hacky, if I am groking them correctly>
  31. Experimental results
  32. For classification they only check to see that indeed SFA does the same thing as Fisher linear discriminant (because that has already been studied exhaustively), which it does
    1. Interestingly in the benchmark task used, convnets are best, and outperform humans
  33. In the regression problems they take photos of people and estimate the horizontal position of the face, vertical position, and size.  This is all done separately <why?  Ah, because the sorting depends on the output variable, so you can only sort according to one… although it seems like a pretty simple extension could handle higher-dimensional outputs>
  34. Take face pictures from a number of data sets (a total of 64,471) and were “… automatically pre-processed through a pose-normalization and pose-reintroduction step. Basically they are all centered and then from there shifted and zoomed according to a distribution.  This way, they know what the x,y,z values they are estimating are
  35. Because of the size of the corpus and images themselves, its difficult to apply algs like SVMs directly, so they use hierarchical SFA and GSFA (which they also call HSFA <- great>)
    1. They also do a hierarchical version of PCA, which sort of does the opposite thing of SFA.  The 120 HPCA features used explain 88% of the variance
  36. Used a few different post-dimension reduction classifiers, including SVM
  37. The slow features of the data gets organized in a more orderly fashion as go up in hierarchy
  38. “… GSFA was 4% to 13% more accurate than the basic reordering of samples employing standard SFA.  In turn, reordering was at least 10% better than HPCA for this dataset.”
  39. Only 5 HSFA features are used, whereas 54 for HPCA. “This can be explained because PCA is sensitive to many factors that are irrelevant to solve the regression problem, such as the vertical position of the face, its scale, the background lighting, etc. Thus, the information that encodes the horizontal position of a face is mixed with other information and distributed over many principal components, whereas it is more concentrated in the slowest components of SFA.”
  40. Mixed and serial (straight SFA) outperformed the sliding window graphs <they were surprised but I’m not, at least with regards to mixed as it regular sliding window just seems like a strange construction).  The serial was actually better than the mixed, although the difference wasn’t significant
  41. They call these approaches “implicitly supervised” because the construction of the graph depends on the supervised labels, but the algorithm never sees those labels explicitly
  42. “The experimental results demonstrate that the larger number of connections considered by GSFA indeed provides a more robust learning than standard SFA.”
  43. Knock unsupervised dimension reduction by doing dimension reduction that doesn’t necessarily help in the task you are actually interested in <But this is only “implictly” supervised, by the same logic fully supervised dimension reduction would be better yet.>
  44. Being able to simply specify a graph means there is no need to exhaustively harvest data from a graph you may already have, as is the case in standard SFA
  45. GSFA has a tendency to overfit because it is not regularized, and is sensitive (in a bad way) to multiple types of data being used

Gaussian Process Optimization in the Bandit Setting: No Regeret and Experimental Design. Srinivas, Krause, Kakade, Seeger. IC

<I did see this presented in person in ICML 2010.>

  1. Analyzes an algorithm called GP-UCB
  2. Regret is bounded in terms of information gain
  3. <I think a general problem with the approach is the computational (and even moreso) memory requirements involved in using GPs.  I gave them up years ago because they kept running out of memory.>
  4. Deals with noisy setting
  5. Information gain is bounded by use of submodularity argument
  6. Says HOO has issues with high-dimensional domains, which is contrary to what I remember from the paper.  Anyway, argues that measures of smoothness other than Lipshitz (or Holder) can give better results (based on what kernel you use in the GP)
  7. Mentions that there is a difference bettween experimental design and optimization but I am not familiar with the distinctions
  8. GP-UCB is a distinct method of using GPs for optimization; there is a small literature on different algorithms with a number of citations in the paper
  9. Discusses RKHS, Reproducing Kernel Hilbert Spaces (something I’ve heard of  before but have no idea what it is)
    1. There is something called the RKHS norms which is a measure of smoothness
  10. Greedily selecting points that maximize information gain is approximately (a constant fraction of) optimal <they cite guestrins paper for this>
  11. The GP-UCB algorithm, however, doesn’t just try to maximize information globally.  Since we are performing optimization we only care about areas where the maximum may lie.  There is some scaling constant that is used to balance gaining information vs finding the optimum.  It must be selected correctly or the algorithm will converge prematurely
  12. There is a major problem with this approach which is what I thought I remembered hearing in the talk.  Given a bunch of potential points, there is math that tells you which one to sample next.  But in continuous optimization there is an infinitely large set of points and this approach has no way to deal with it, turning the global optimization problem into another global optimization problem (which they claim is easier, though.  I don’t really buy that)
  13. Putting together the computational, memory, and most fundamental issue that it doesn’t resolve the global optimization problem, I think this needs to be seriously improved upon before its considered seriously.
  14. For example in the experimental results section, they discretize a 1-d optimization problem into 1000 points.  This works fine for 1-D but clearly this won’t scale in any reasonable manner, one of the things they say the algorithm is good at.  For larger dimensions it is necessary to resort to some other method than uniform discretization

Reliable Effective Terascale Linear Learning. John Langford. Talk

  1. His method (Vowpal Wabbit) takes a couple of seconds to train a 500~mb compressed file and gives the best result on the dataset (very large, extremely sparse)
  2. The method is online
  3. This talk is bad.  
  4. He says his stuff is fast but isn’t comparing it to anything (for example, compare to c45 decision trees). His defining of what features are in the database is also unusual and is causing problems.  He is also using unusual terminology for things like weights
  5. I think a takeaway is that you can’t show up to a talk and be like “my stuff is the most awesome stuff.  period.” because it will just make people skeptical.  You also can’t say “we beat the future” because that claim doesn’t make any sense
    1. Another way to say this is don’t love your method too much
  6. He throws out acronyms like MPI and RCV1, but nobody knows what that means, and he didn’t define them
  7. What are nodes?  (I think it means the number of machines its parallelized onto, but I don’t think he said it)
  8. Says hadoop is bad because it takes a minute to start up a task, this uses something new called allreduce
    1. Basic idea is each node starts with a number and then you add all numbers and propagate the sum to all
    2. Gave some other reasons why hadoop is no good
  9. Hadoop is nice because it moves the program to data (which is good when the data is huge)
  10. Says all the algorithms that run at scale do gradient descent (Don’t know if I totally agree with that, but ok)
  11. Says if units aren’t all at same scale it messes up the gradient.
    1. This is commonly fixed by a Newton update, which requires a Hessian, which you can’t do on large data
  12. Method runs as a mixture of adaptive and batch
  13. Learning rate is defined for each dimension, and is related to the previous gradients in that dimension
  14. Normalizes data
  15. Also leverages importance weighting (like what occurs in boosting), and has a very effective way of using them
    1. Says this also helps having an aggressive learning rate
  16. Says L-BFGS is very important to know
  17. There are a large number of pieces to this learning system and its really tough to maintain whats going on.  It would be better to more thoroughly explain some parts and leave others completely unexplained
  18. Says VW lead to a more effective spam filter as well
    1. Says spam is adversarial in that it changes over time and in response to filters
  19. The problem being solved is batch
  20. The method is robust and gives good results on most data sets

Non-Linear Monte-Carlo Search in Civilization II. S.R.K. Branavan David Silver Regina Barzilay. IJCAI 2011

  1. Based on the title this must be the most awesomest paper ever.  Its the culmination of my entire existence on earth.
  2. Apply non-linear regression within MCTS to estimate Q-function from random rollouts.
  3. Because it generalizes, it doesn’t need many rollouts to make an estimate
  4. Work also pulls out information from the manual (which seems to be the focus of  a separate paper)
  5. The non-linear FA is much more effective than a linear one
  6. Result beats built-in AI 78% of the time.
    1. That is awesome, but this is AI that dates back to 1996, and was designed to work on home-PCs of the era (33 mhz), so we might expect it isn’t amazing.
    2. They actually used FreeCiv, so this isn’t totally right (still designed for 100 mhz machines), and the AI was constrained not to cheat as it does in the real Civ (which is very fair).  I imagine the AI in freeCiv is much better than Civ2, though.
    3. They do, however, treat unfinished games as a loss, so draws are awarded to the AI
  7. The branching factor in the game is extremely huge, so I’m amazed anything works at all.
  8. This is similar to the Go work that used FAs to learn an evaluation function, but there linear FAs were used
  9. The VFA (value function approximator) is built locally in time (from a particular state)
    1. Global VFA was effective in Backgammon, but not Go or Civ, which are larger
  10. Elements of manual relevant to current game state are modeled as hidden variables in the non-linear VFA
  11. Use 4-layer NN for VFA, trained using rollouts.  Layers are as follows:
    1. Game state, action, and game manual
    2. Linguistic model of manual
    3. Fixed features that perform predefined transformations on first 2 layers
    4. Value function
  12. Rollouts train all of these, including the linguistic model
  13. They argue nonlinear VFAs generalize better from a small number of rollouts – thats nonintuitive to me since the hypothesis space of linear VFAs is smaller and should be learnt more quickly (even if its accuracy is worse at the end)
  14. ANNs are helpful because they can process the input in stages
  15. There is already some literature on RL in civ, but this is the first that plays the entire game, as opposed to some subset of it, or only selecting from options
  16. Use game rules and opponent actions as part of the black-box transition function
  17. Manual text that is extracted and fed into the ANN depends on the situation
  18. Its funny they dont at least carry over the ANN from the last step as an initial point to start learning, I guess it biases samples and random sampling is better.
  19. Stanford parser used on the manual
  20. Rollouts are 20 steps long, and uses game score as the evaluation function.  500 rollouts are performed
  21. Almost half a million features fed into the ANN
  22. Exploration is epslion-greedy
  23. Played through a 100 step game in 1.5 hours, not so bad
  24. Compare to 3 other methods:
    1. MCTS: each item in the game computes its search tree indep, this is the only alg that explicitly builds a search tree
    2. Linear MC – similar to the linear VFA method used for Go
    3. Non-linear MC is basically the same method as in the paper, except no manual
    4. Random-Text same as proposed method, but manual has scrambled word order (but same word content)
  25. The other comparison methods are much less effective against the AI (next best is non-linear MC with about 30%).
    1. Regular MCTS didn’t win a single game
  26. Its worth mentioning that they use the Civ2 manual to play freeciv.  While they are very similar, they are not exactly the same game.
  27. There is another paper “Learning to Win by Reading Manuals in a Monte-Carlo Framework” that deals more with just dealing with the manual.

Gaussian Process Bandits: An Experimental Design Approach. Srinivas, Krause, Kakade, Seeger

  • A short (4 page) paper
  • Analysis of upper confidence bound on Gaussian process optimization
  • Smootheness is encoded by covariance function
  • The Gaussian Bandit UCB requires a discrete set of test points, which seems strange
  • There are some notation abuses that are difficult to understand, for example their definition of a no-regret algorithm and the statement that UCB is a no regret algorithm
  • The regret bound in discrete UCB is O(sqrt(kt)).  For the infinite arm case, K is replaced by the bound for the maximum possible information gain due to sampling
    • The say this connects GP optimization w/ optimal experimental design, should read more about this
  • Here there is noise on the reward observations, it is assumed to be Gaussian
  • Although it is a regret algorithm it looks like there is a probability of failure delta
  • Information gain is submodular – more information is gained when the total number of points sampled is low (although there has to be cases where this isn’t true)
  • There is a different regret bound here that also has sqrtish flavor

Issues in Using FAs for RL. Thrun, Schwartz

  • Discusses failure cases w/FAs in QL
  • Generalization err in FA can lead to systematic overestimation of values, can lead to an expectation of agent failing to learn optimal policy
  • Gets bounds on necessary accuracy of FA, TD factor used
  • Driven by empirical difficulties encountered when using FAs
  • Even if error introduced by FA is zero-mean, the max operator makes the error to be above zero, so issue makes value function tend to overestimate value, gives expectation as to what that amount can be
  • Empirically, using high gammas (over 0.9) lead to more severe failures
  • Say memory based methods may be more reliable than other forms of regression, have better behavior when amount of experience is large
  • Also say TD methods should help mitigate this due to eligibility trace
  • FAs baised toward worst case estimates can help the issue… how many of these exist?