Category Archives: Bandits

Behavioral and Neuropshysiological Correlates of Regret in Rat Decision-Making on a Neuroeconomic Task. Steiner, Redish. Nature Neuroscience 2014

  1. Deals with regret as described in the RL literature as rats make decisions (they make a choice they can’t go back on, but may have gotten something better with a different choice)
  2. “In humans, the orbitofrontal cortex is active during expressions of regret, and humans with damage to the orbitofrontal cortex do not express regret.  In rats and nonhuman primates, both the orbitofrontal cortex and the ventral striatum have been implicated in reward computations.”
  3. In situations of high regret <presumably, the reward of the other option is presented, but not accessible> “… rats looked backwards toward the lost option, cells within orbitofrontal cortex and ventral striatum represented the missed action, rats were more likely to wait for the long delay, and rats rushed through eating the food after that delay.
  4. “Dissapointment is the realization that a realized outcome is due to one’s own mistaken action “
  5. <Didn’t get to finish – posting for spring cleaning>

Deterministic Policy Gradient Algorithms. Silver, Lever, Heess, Degris, Wierstra, Riedmiller. ICML 2014

  1. Deterministic policy gradient for continuous action MDPs
    1. Deterministic in that the action selected by the algorithm is deterministic as opposed to a stochastic
    2. Not a function of stochasticity in the domain <so far>
  2. “The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function.  This simple form means that the deterministic policy gradient can be estimated much more effectively than the usual stochastic policy gradient.”
  3. Exploration is driven by off-policy actor-critic
  4. This method is shown to beat stochastic policy gradient in high-D spaces
  5.  “It was previously believed that the deterministic policy gradient did not exist, or could only be obtained when using a model (Peters, 2010).  However, we show that the deterministic policy gradient does indeed exist, and furthermore it has a simple model-free form that simply follows the gradient of the action-value function.”
  6. The determinstic policy gradient is simply equivalent to the stochastic policy gradient as policy variance approaches 0.
  7. The basic difference between the deterministic vs stochastic policy gradient is that the first only integrates over state, while the latter also integrates over actions (so doing stochastic requires more samples <and similarly is more error-prone>)
  8. The benefit of stochastic policy gradient is that it has a nice way of doing exploration built-in.  Here an actor-critic setup is used to drive exploration: “We use the deterministic policy gradient to derive an off-policy actor-critic algorithm that estimates the action-value function using a differentiable function approximator, and then updates the policy parameters in the direction of the approximate action-value gradient.  We also introduce a notion of compatible function approximation for deterministic policy gradients, to ensure that the approximation does not bias the gradient.”
  9. Experiments on a high-D bandit, some other stuff, and octopus arm
  10. “… compatible <in that the critic does not introduce bias> function approximators are linear in ‘features’ of the stochastic policy….”
  11. Discusses off-policy actor critic, which is a combination of policy gradient and TD, provides an approximation of the true gradient
    1. Has to use importance sampling to compensate for the fact that the state distribution that is being observed and the one the alogrithm would like to generate are different.
  12. “In continuous action spaces, greedy policy improvement becomes problematic, requiring a global maximization at every step.  Instead, a simple and computationally attractive alternative is to move the policy in the direction of the gradient of Q, rather than globally maximizing Q.”
  13. The estimate of the action parameterization-value gradient depends on the distribution of states visited, but it turns out the gradient of the state distribution does not have to be calculated
  14. Naturally, as in the case of stochastic case, a differentiable estimate of the Q function must be used.
    1. In general, this will not preserve the gradient of the true value function (it is called compatible) but there are classes of FAs that will
  15. Start with a description of gradient SARSA
  16. Then move to off-policy deterministic actor critic
  17. Linear FAs are compatible, and can be effective if they only have to locally dictate how to adjust parameters <but is it used only locally?>
    1. Seems like it can be linear in any set of basis functions, though
  18. Minimizing squared-error
  19. “To summarise, a compatible off-policy deterministic actor-critic (COPDAC) algorithm consists of two components.  The critic is a linear function approximator that estimates the action-value from features [math]… This may be learnt off-policy from samples of a behaviour policy β(a|s), for example using Q-learning or gradient Q-learning.  The actor then updates its parameters in the direction of the critic’s action-value gradient.”
  20. Although off-policy QL may diverge when using linear VFA, there are now methods that are safer, which is what is used here
  21. Computational complexity is mn where m = |A| and n is the number of policy parameters <which may be |S||A|?>
  22. On to experimental results
  23. High-D (D = 10, 25, 50) quadratic bandit.  Seems like a pretty trivial problem – perf in the 50D case converges at around 1000 samples.  Stochastic is still almost at exactly the same performance it started with at that point
  24. Then work in mountain car, pendulum, puddle world <at least in the first two, exploration isn’t trivial, although not extremely difficult>
  25. <Because the VFA is being done linearly, this doesn’t solve the problem of engineering features that allow the problem to be solvable, which is a fundamental issue in continuous RL>
  26. In octopus, reward is the distance from the arm to the target, so there is a nice smooth landscape to optimize on
    1. State space is simplified to 6-D
    2. VFA is done by ANN
  27. <Discussion>
  28. “Using a stochastic policy gradient algorithm, the policy becomes more deterministic as the algorithm homes in on a good strategy.  Unfortunately this makes stochastic policy gradient harder to estimate, because the policy gradient ∇θπθ(a|s) changes more rapidly near the mean.  Indeed, the variance of the stochastic policy gradient for a Gaussian policy N(μ, σ2) is proportional to 1/σ2 (…), which grows to infinity as the policy becomes deterministic.  This problem is compounded in high dimensions, as illustrated by the continuous bandit task.  The stochastic actor-critic estimates the stochastic policy gradient in … The inner integral …[math], is computed by sampling a high dimensional action space.  In contrast, the deterministic policy.”

Optimal Learning for Sequential Sampling with Non-parametric Beliefs. Barut, Powell. Journal of Global Optimization?

[this is based on a quick 2nd read through, so less notes than usual]

  1. Aggregates over a set of kernel functions in order to do ranking and selection more effectively 
  2. The weighing scheme used is shown to be optimal under independent kernel estimators (need to find out what that means exactly)
  3. Uses knowledge gradient to select sampling points (roughly uses GPs and then estimates gradient of the bounds, and then attempts to sample where the gradient is highest, but a separate optimization algorithm is needed to find those locations)
  4. Proposed policy is shown to be asymptotically optimal
  5. This paper is concerned with searching over a finite number of options (computationally, restricted to some thousands)
  6. Similarity is measured by a kernel.  Although there is no explicit use of lipshitz smoothness or concavity, kernel metrics must be Holder
  7. In the off line setting, the problem addressed is called ranking and selection, but in on-line its called multi armed bandit
  8. References for earliest papers on stochastic search
  9. Noise is assumed to be normally distributed
  10. Policy is 1-step optimal, and converges to optimal at the limit
  11. In their empirical results they use 0-mean 1D gaussian processes.  They test this method, random sampling, and another GP-based method (which attempts to find the length-scale online (SKO)? theirs just weighs a bunch of provided kernels, finding the best one)
    1. For differing kernels, used linear fits but with different bandwidths
    2. Their stuff does very well, although with larger length-scales SKO works better.  For GPs that actually have very short length-scales, KGCB is better.
  12. Also, SKO is less general, and only works well on certain classes of functions (certain covariance functions cause trouble, for example).  Seems to be a little more fragile.
  13. I respect them for pointing out situations where SKO outperforms KGCB
  14. In an energy storage/retrieval problem, their method outperforms SKO, although with larger amounts of noise the bounds begin to overlap.  In this problem KGCB converged more quickly, while in some of the earlier results SKO converged more rapidly, so depends on the domain.
  15. “Although our policy performs very well in the numerical experiments, there is a caveat.  Kernel estimation is known to suffer from the curse of dimensionality as the MSE proportional to h^d where h is the bandwidth and d is the number of dimensions.  If observations lie in high dimensional spaces, non-parametric estimation is known to have poor performance.  Because of these reasons, the efficiency of our estimation method also degenerates in higher dimensions.  Additive models might be used to handle this curse but this requires making more assumptions on the structure of the function.”

Cross-entropy optimization applied to humanoid walking

  • Physics simulation by PyOde.
  • 10,000 rollouts per planning step, with rollouts 100 steps long (certainly could have done many fewer).
  • Reward is the velocity of the “hip” along the x-axis to left, a large negative reward occurs when any section aside from a foot touches the ground (which is also terminal)
    • In the interest of full disclosure, the algorithm did cause the dude to fall down later in the simulation, but I’m very confident that could have been prevented with more rollouts.
  • There are 7 joints being controlled (each rendered as a linear section), and the action is a torque applied at each
  • I think the state can be most compactly represented in 22 dimensions (the great thing about the method is it doesn’t matter!), being an x,y position and velocity at the shoulder, and each other part of the body can be encoded in terms of an angle and angular velocity relative to another joint in the skeleton.
  • Optimization is done according to a very large gaussian (700 dimensions).  This method works well, but actually computing the covariance matrix is very expensive.  The physics engine is also pretty heavy-weight and I had to do expensive operations in order to get it to work in this setting – the experiment took about a day and a half.

I think it would be interesting to try and apply the method to walking on an uneven floor.

Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization. Deautels, Krause, Burdick. ICML 2012.

  1. Another paper that I didn’t read very closely, but here are the basic ideas
  2. Basic idea is dealing with bandits where you have to make a number of decisions and then find out the results of all of them after the selection (batch instead of single arm selection)
  3. An algorithm called GP-BUCB is developed, the regret of which only increases by a constant factor independent of the batch size B
  4. Like vanilla GP-UCB, this algorithm needs the arms to continuous, but then discretized for a max operation
  5. The trick for the algorithm leverages a property of GPs, where the reduction in variance depends only on where samples were taken from, and not what the actual values of the samples were.
    1. Based on this, it is possible to hypothesize what the new bounds on the algorithm will be once the sample comes back, and based on that, a decision is made.
  6. Computations of the variance of the function dominate the running time
  7. They have some math to make sure bounds remain correct during batch selection, so that it does not become “overconfident”
  8. Applications are some Synthetic problems Automated Vaccine Design, Spinal Cord Therapy (neat)

On Local Regret. Bowling, Zinkevich. ICML 2012.

  1. Didn’t read this totally carefully, so this is just a skeleton of the paper
  2. Deals with local regret, which is a generalization of normal regret that measures performance relative to “nearby” options as opposed to globally
  3. An algorithm is presented that minimizes local regret with arbitrary locality graphs, and show how graph structure can be used to speed learning
  4. The only assumption is that the action set has some defined notion of locality, but otherwise makes no assumptions (such as convexity or smoothness)
    1. This allows the method to be applied to tasks that were previously considered intractible
    2. I still don’t understand how this can work if there are infinite options with no smoothness, will try to go over it later
  5. They discuss a few different types of regret: internal, swap, and external, and rank the difficulty.  They then introduce local versions of those types of regret
  6. Ah they do say that it is impossible to have no regret in some infinite cases.  That makes more sense, but the earlier statement is a bit misleading then
  7. Local regret is a generalization of regret, because when all options are related, local regret becomes normal regret

Gaussian Process Optimization in the Bandit Setting: No Regeret and Experimental Design. Srinivas, Krause, Kakade, Seeger. IC

<I did see this presented in person in ICML 2010.>

  1. Analyzes an algorithm called GP-UCB
  2. Regret is bounded in terms of information gain
  3. <I think a general problem with the approach is the computational (and even moreso) memory requirements involved in using GPs.  I gave them up years ago because they kept running out of memory.>
  4. Deals with noisy setting
  5. Information gain is bounded by use of submodularity argument
  6. Says HOO has issues with high-dimensional domains, which is contrary to what I remember from the paper.  Anyway, argues that measures of smoothness other than Lipshitz (or Holder) can give better results (based on what kernel you use in the GP)
  7. Mentions that there is a difference bettween experimental design and optimization but I am not familiar with the distinctions
  8. GP-UCB is a distinct method of using GPs for optimization; there is a small literature on different algorithms with a number of citations in the paper
  9. Discusses RKHS, Reproducing Kernel Hilbert Spaces (something I’ve heard of  before but have no idea what it is)
    1. There is something called the RKHS norms which is a measure of smoothness
  10. Greedily selecting points that maximize information gain is approximately (a constant fraction of) optimal <they cite guestrins paper for this>
  11. The GP-UCB algorithm, however, doesn’t just try to maximize information globally.  Since we are performing optimization we only care about areas where the maximum may lie.  There is some scaling constant that is used to balance gaining information vs finding the optimum.  It must be selected correctly or the algorithm will converge prematurely
  12. There is a major problem with this approach which is what I thought I remembered hearing in the talk.  Given a bunch of potential points, there is math that tells you which one to sample next.  But in continuous optimization there is an infinitely large set of points and this approach has no way to deal with it, turning the global optimization problem into another global optimization problem (which they claim is easier, though.  I don’t really buy that)
  13. Putting together the computational, memory, and most fundamental issue that it doesn’t resolve the global optimization problem, I think this needs to be seriously improved upon before its considered seriously.
  14. For example in the experimental results section, they discretize a 1-d optimization problem into 1000 points.  This works fine for 1-D but clearly this won’t scale in any reasonable manner, one of the things they say the algorithm is good at.  For larger dimensions it is necessary to resort to some other method than uniform discretization

Masters Thesis: Automatic Decomposition of Continuous Action and State Spaces in Simulation-Based Planning. Colin Schepers


  1. Main goal of the work is to “interchange” the discretization components of:
    1. Tree Learning Search, which discretizes according to a tree of trees where each tree further decomposes the action space
      1. Underlying component is Incremental Regression Tree Induction (IRTI)
    2. HOLOP, where the tree decomposes a space that corresponds to a sequence of actions
      1. Underlying component is HOO
  2. TLS has the issue of throwing away data when new information arrives and trees must be discarded
  3. Also include the idea of transposition tables
  4. Work examines behavior, computation time, computational complexity, and memory of both algs
  5. There are also some new algorithms introduced that extend these algorithms that have better performance in certain situations


Ch 1: Introduction

  1. In terms of doing the actual action planning, talks of two possible options:
    1. Meta Tree Learning (MTL) is the more traditional approach to MCTS, called meta tree because it constructs a tree of trees
    2. Sequence Tree Learning (STL) explodes the sequence into a large space where each dimension corresponds to one step in the sequence (what HOLOP does)
  2. IRTI and HOO can be combined with either MTL or STL to get 4 planning algorithms (originially TLS was coupled with IRTI and HOLOP with STL)
    1. IRTI x MTL = Regression-based Meta Tree Learning (RMTL), very similar to TLS
    2. STL x HOO = Hierarchical Optimistic Sequence Tree Learning (HOSTL), very similar to HOLOP
    3. IRTI x STL = Regression-based Sequence Tree Learning (RSTL), introduced in this work
    4. MTL x HOO = Hierarchical Optimistic Meta Tree Learning (HOMTL), also introduced in this work
  3. The novelty of using transposition tables here is due to the fact that the spaces considered here are continuous, not discrete as is the case with most transposition tables. Two methods are proposed to do this
  4. The Basic Research Questions for the Thesis are:
    1. Can the IRTI component of TLS be replaced by the HOO component of HOLOP
    2. Can the HOO component of HOLOP be replaced by the IRTI component of TLS
    3. How can the information retreived from simulations be reused for TLS
    4. Can a transposition tree increase the performance of simulation based systems in continuous environments
    5. Which combination of techniques is best
    6. What is the memory cost of these algorithms
    7. What is the computational time complexity of the algorithms

Ch 2: Preliminary Knowledge

  1. Discusses use of Zobrish hashing for high-performance transition tables.  I’ve never heard of it.

Ch 3: Automatic Decomposition

  1. IRTI checks all possible tests (splits) in a leaf node, and takes the one with the highest information gain.  If that split yields two different leaves that are statistically significantly different (F-test), the split is made
  2. <p. 14 the rule used by HOO to calculate b-scores is very reminiscent of UCB as well>
  3. <p. 14 in an extended version of the HOO paper on Arxiv, a version of the algorithm is presented where n_0 doesn’t have to be recalculated at each step, if the total number of pulls is known before sampling starts.  This can make the algorithm much more efficient (log n per step instead of n)>
  4. Discussion of scaling of the exploration factor (commonly called C)
    1. The propose scaling it according to the range of values seen from each node, then modifying it by a factor which is called k
    2. <p. 16, in practice, I definitely buy that this helps, but in certain situations this rule will cause problems (when all rewards observed in a node are the same, there will be no bias, for example)>
    3. But they also mention that if vmin, vmax are known for the problem it can be used, so thats all fine.  If you don’t know that you need to resort to something like above
  5. Discuss ways of conserving memory, such as not allowing leaves to split that have too few samples.  Capping tree depth isn’t mentioned, but is also a reasonable option

Ch 4: Multi-step Optimization in Continuous Environments

  1. In normal MCTS, an edge represents 1 action, and a node represents 1 state. In Regression-based Meta Tree Learning, an edge represents a range of actions, and a node represents a range of states
  2. When selecting an action, UCT rules are used (basically just ignore that edges and nodes represent a range and use the samples)
  3. If a sample causes a split in a leaf in an internal leaf, the trees below that point must be discarded
    1. Replay can be used once trees are discarded, but this is somewhat expensive.  It is cheaper to do sequence planning and pass values into the children on splits (sequence planning, though, has the drawback that it can’t form closed-loop plans)
  4. <p. 21 It isn’t the case that the given splitting rule for HOLOP/HOSTL that prioritizes early actions has to be used, but it is one method that intuitively makes sense, and also doesn’t ruin the original proofs of HOO>
  5. <In general, the way the material is presented respects the original algorithms this paper builds on too much.  It basically talks about TLS and HOLOP (which are both unique in both aspects of how to do decomposition as well as plan over sequences) and then crosses them over.  It would be easier for most readers not already familiar with the history of previous publications to present for example the Meta Tree Learning algorithms and then the Sequence Tree Learning algorithms, or something like that.  It starts with the corners of a 2×2 table describing the attributes of the algorithms instead of starting with a row or column>

Ch 5: Continuous Transposition Tree

  1. Two methods for constructing transposition tables are introduced
  2. In Level-based Transposition Trees, there is a clear distinction between the discretizations over the state and action spaces
  3. <p. 24 “This process [of descending a decision tree over the state space] is expected to be less computationally expensive because it does not require any complex computations like the UCT formula.”  Is there really any significant difference between the costs of both aside from constant level operations which are very cheap anyway?>
  4. One important feature of using the transposition table is that it allows planning to be stored – otherwise planning always has to be redone from every current state (of course it can also make planning cheaper further down the tree)
  5. In LTT a decision tree decomposes the state space.  From each leaf in that tree over the state space, another decision tree is rooted that decomposes the action space
    1. Still suffers from the need to discard action trees on splits in the state-tree leaves, but its easy to do replay in this setting
  6. In a Mixed Transposition Tree (MTT), the tree is built both in terms of states and actions (as opposed to states and then actions, as in LTT); a node represents both spaces and an edge from parent to child represents a split in either the state or action dimension
  7. When MTTs are used, from root to leaf states are traversed according to the query state (naturally), while actions are followed according to those that has the highest value according to a particular computation, I think it is basically like UCB, which allows for exploration.
  8. The advantage MTT has over LTT is that trees do not have to be discarded and rebuilt
  9. <In general, this thesis has a lot of verbiage, but is short on diagrams or other items that would express points succinctly, sometimes the material is presented in a way that is a bit vague and the only way to resolve the issue is to go back to pseudocode and grok it>

Ch 6: Experiments and Results

  1. UCT here isn’t the normal version of UCT, somewhat between HOSTL and IRTI
  2. A couple of the test domains are regular 1-step optimization problems.  Two more are multi step problems.  One is navigating in a circle (agent chooses angle to move in), and the other is cart-pole
  3. For the 1-step optimization problems, comparisons are also made with a random agent, a vanilla MC agent (random but choses best found), and a version of UCT that uses unifom discretization
  4. In 1-step optimization, UCT learns a good policy quickly, but HOO eventually outperforms it with near optimal performance; the regret plots are most illustrative of performance.  IRTI is worse than HOO in both cases (in one case better than UCT by the end of the experiment and in one case worse, but anyway as more time is given IRTI will beat UCT anyway)
  5. Good parameterizations are found for each problem separately.
  6. When constraining the algorithms on time instead of samples, HOO performs worst (due to polynomial time complexity) <as mentioned, an nlogn version can be implemented, though>.  When time is the main constraint IRTI performs best, vanilla-MC actually outperfoms UCT
  7. In multistep experiments, all four algorithms (2×2 grid plus random and vanilla mc) are examined
  8. In donut world, when samples are constrained, HOSTL (HOLOP) performs best, IRTI x STL is 2nd best
  9. In stark contrast, in the cart-pole domain, HOSTL is worst besides random (even vanilla MC is better).
    1. Here, RMTL (original tree-learning search) performs best by a wide margin
    2. The domain used here is more challenging than the one I’ve used as it doesn’t allow for much deviation of the pole
    3. This is under time constraints
  10. <p. 35They test MTT by itself which I’m not exactly groking because I thought it was a technique for allowing tranposition tables to be used and not a policy, but I’m missing something>
  11. Have very nice experiments of sample distributions on 1-step optimization problems
  12. The HOO-family of algorithms have by far the worst memory use.  The regression-tree algorithms use the least memory <I guess cause they are constantly throwing trees out?  Even if they are doing replay, the trees are probably much more compact than the HOO versions because it requires a test confirming statistical significance to split>
  13. In terms of measured computation time, not surprisingly HOO is slowest, but what is interesting is that IRTI is faster than UCT <Impressive.  It is because UCT ultimately ends up building a larger tree>

Ch 7: Experiments and Results

  1. It is actually unclear whether transposition tables in the manner used are helpful (sometimes they help and sometimes they do not)
  2. HOSTL (HOLOP) is best when the domain isn’t time constrained, but RMTL(TLS) is best when there are time constraints as it is very efficient because trees are built fairly compactly
  3. While HOO-based algorithms used the most memory, there were no issues of exhausting memory during experiments or anything like that

Two basic problems in finite stochastic optimization. Sébastien Bubeck. Talk

Section 1: Stochastic Optimization

  1. Setting: finite X, non-Bayesian, non-minimax, nonparametric noise, non-asymptotic analysis.
    1. How do we define optimality in this context?
  2. There are 2 parameters to the learner: n evaluations and k options, we want to learn v_1, …, v_k reward distributions of the options, want option with max mean, at each step 1,…, n the learner outputs an option I_t
  3. for delta_i being the gap between option i and optimal, H = Sum_i 1/delta_i^2
  4. You always need at least omega(H/logk) evaluations, so any algorithm that accomplishes this is optimal (Audibert, Bubeck, Munos 2010)
  5. According to this criteria, uniform sampling is actually not so bad.
    1. O(K Log K / min_i delta_i^2) – this is ok on easy problems but on hard ones is quite difficult
  6. Multi-armed bandit introduced by Thompson in ’33 (I’m familiar with the second citation he gives – Robbins ’52)
  7. Bandit survey to appear check webpage
  8. After n steps, UCB has a polynomial prb of failure, while uniform sampling seems much better. Why does this not show up in practice?
  9. He introduces a modified UCB (UCB-E)with a different bias term: sqrt((cn/H)/T_i(t))
    1. This algorithm has a better exponentially small probability of error
    2. This algorithm is optimal
    3. It finds the best option in time O(H logH), but requires knowledge of H
  10. Successive rejects algorithm:
    1. For each phase, uniformly sample, and then reject the lowest avg val. arm
    2. This has a pretty good regret bound (didn’t have time to copy)
  11. In empirical results shows Successive rejects as much better than Hoeffding races
  12. What if we want to find m-best options instead of single best?
    1. For 1-best you can work by throwing away only the worst
    2. For m-best you also want to throw away the worst, but also the best arms when we are confident about them
  13. For the m-best there is a successive rejects alg, but it is successive accepts and rejects.
  14. Current work is on Combinatorial Stochastic Optimization
    1. Find the best concept in a given concept class.

Section 2: Optimal Discovery with Probabilistic Expert Advice

  1. In a graph, you can ask whether a node is important or not
  2. You can define a subset of the graph and sample a node from it an ask if it is optimal
  3. Goal is to quickly define important nodes
    1. Part of this problem is you only need to find each interesting node once; hearing about it again later is wasteful
  4. Part of the theory is from Good-Turing when working on Enigma
  5. Algorithm is called Good-UCB, looks alot like regular UCB (this is demonstrated empirically as well, and is tight the whole way, not just at the end)
    1. Theorem says this is just as good as using an Oracle; uniform sampling does not achieve this
  6. One of the examples for this is prime numbers, small exploration constant here works best

Learning, Inference, and Control for Robotics and Sustainable Energy. J. Zico Kolter. Talk

  1. Gives example of little dog – easy to model kinematics, but full model is difficult with uncertain terrain, similar example with driving on poor surfaces
  2. Can make the same argument about wind turbines (hard to model) or what happens in te home (hard to control)
  3. But data is easy to get
  4. 2 parts: data driven learning and control for dynamic tasks, and data driven control for sustainable energy


  1. Say its very diificult to accurately make a model of robot dynamics from pure physics, so better to use data to (help) make a model
  2. Say many planning problems can be helped just by looking at the sign of derivative terms. Do gradient descent just on that
  3. Does this form of policy gradient to teach little dog to climb steps in about 5 minutes
  4. Same issue with drift-parking. Cant use a dynamic model based on phystics because it misses particular bits, so you should combine observed data
  5. Idea is dynamics is hard to model, but maneuver is repeatable over short horizons.  So parts can be open-loop
  6. Result is molti-model LQR:
    1. Use predictions errors over data to estimate model variance
    2. Use variance- aware method (new iterative LQR method) to compute optimal controls


  1. Generating energy from wind-turbines with a data-driven control approach (control right pitch of blades)
  2. The models we have for wind dynamics is not accurate, and operates in very restricted conditions.  They really suck
  3. Because of this online optimization is important.  Go about doing stochastic optimization
  4. Care about data efficiency, satisfied with local optimim
  5. “Trust region policy search” – use second order (Hessian) info to optimize.  Update param values by trust-region
    1. Need to estimate Hessian, which is difficult, but can do important sampling on previous results to reduce the # of samples needed
    2. Hessian may be indefinite, so use a trus region solver – fits a polynomial only locally as opposed to globally.  This can be solved exactly
    3. Use something based on variance of gaussian used to sample to pick region
  6. Beats up REINFORCE badly (but even back in the day REINFORCE was known to be a very sample inefficient algo), but indepenedent of that it does quickly climb up to the optimal region
  7. Idea is to use the power consumption of the entire home at the power meter coming in instead of monitorig each outlet independently
  8. Uses HMMs to model whether the state of each device is on or off.  Problem is current algs can’t deal with input sizes as large as what occurs in a home
  9. Do spectral clustering on the data to identify what is actually happening in the house
  10. Need a new alg to do tractable inference.  It is a convex approx inference methods that can be quickly solved for hudreds of thousands of variables