Category Archives: Sample Based Planning

Model-Based Reinforcement Learning in Continuous Environments Using Real-Time Constrained Optimization. Andersson, Heintz, Doherty. AAAI 2015

  1. Working on high-D continuous RL
  2. Builds a model with sparse Gaussian processes, and then does local (re)planning “by solving it as a constrained optimization problem”
  3. Use MPC/control related methods that were done back in ’04 but revisited here and can be used for real-time control now
  4. Test in “extended” cart-pole <all this means here is the start state is randomized> and  quadcopter
  5. Don’t try to do MCTS, because it is expensive.  Instead use gradient optimization
  6. Instead of normal O(n^3) costs for GPs, this has O(m^2n), whre m < n
  7. “However, as only the immediately preceding time steps are coupled through the equality constraints induced by the dynamics model, the stage-wise nature of such modelpredictive control problems result in a block-diagonal structure in the Karush-Kuhn-Tucker optimality conditions that admit efficient solution. There has recently been several highly optimized convex solvers for such stage-wise problems, on both linear (Wang and Boyd 2010) and linear-timevarying (LTV) (Ferreau et al. 2013; Domahidi et al. 2012) dynamics models.”
  8. Looks like the type of control they use has to linearize the model locally
  9. “For the tasks in this paper we only use quadratic objectives, linear state-action constraints and ignore second order approximations.”
  10. Use an off-the shelf convex solver for doing the MPC optimization
  11. Use warm starts for replanning
  12. The optimization converges in a handful of steps
  13. <Say they didn’t need to do exploration at all for the tasks they considered, but it looks like they have a pure random action period at first>
  14. Although the cart-pole is a simple task, they learn it in less than 5 episodes
    1. <But why no error bars, especially when this experiment probably takes a few seconds to run.  This is crazy in a paper from 2015, although it is probably fine it makes me wonder if it sometimes fails to get a good policy>
  15. Use some domain knowledge to make learning the dynamics for the quadcopter a lower-dimensional problem
    1. 8D state, 2D action
  16. For quadcopter there is training data from a real quadcopter? fed in and then it is run in simulation
  17. “By combining sparse Gaussian process models with recent efficient stage-wise solvers from approximate optimal control we showed that it is feasible to solve challenging problems in real-time.”

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning. Guo, Singh, Lee, Lewis, Wang. NIPS 2014

  1. Model-based Atari planners are better than the famous DQN, but very slow
  2. Goal here is to build a real-time MCTS agent better than DQN
  3. Use UCT to train NN, 300 max depth, 10k trajectories (minor modifications for a couple of games)
  4. A few particular approaches:
    1. Do 800 episodes in the game through UCTs policy, and learn the values computed by UCT to train the NN
    2. Use UCT as above, but to use NN to train a classifier to pick an action
      1. One issue is that the policy generated by these approaches will in general be different from that found by UCT during search, so the values will not be for the policy used
    3. So they propose an “interleaved” approach where part of the behavior is on-policy UCT, and part UCT is used to compute values, but the actions taken are according to NN
  5. Learn in 84×84 greyscale
  6. Use same NN arch as DQN
  7. UCT with classification is as good or better as DQN,  UCT with regression is worse
  8. To be consistent with DQN paper, they leave in 5% exploration for recording quality, although it doesn’t make any sense to do that when evaluating performance
  9. UCT without any realtime constraints really beats up on everything (roughly double score of best)
  10. Also, the emulator runs at 60 fps, not 30 that might be expected <the 60 is consistent with atari hardware, I checked>
  11. Interleaved method performance jumped %10 when doubling trajectories
  12. <I thought they were going to try a DYNA 2 approach, wonder how it would work>
  13. They observed some horizon effects in UCT where some good behavior requires really long timescales that couldn’t be reasoned about

Improving UCT Planning via Approximate Homomorphisms. Jian, Singh, Lewis. AAMAS 2014

  1. Improves performance of UCT by:
    1. finding local abstractions in the DAG built by UCT
    2. using approximate homomorphisms (as opposed to true homomorphisms which preserve optimal policies, but are rare and computationally difficult to develop)
  2. Derive a lower bound on the performance of the abstraction method when used with UCT
    1. <If the bounds are specific for UCT it might be moot anyway because of the horrible worst-case performance of the algorithm>
  3. Also global homomorphisms are global, but the paper is concerned with UCT which is generally applied when global methods are too expensive
  4. “One interesting overall insight obtained herein is that the more computationally limited UCT is, the coarser are the
    best-performing abstractions.”
  5. “A homomorphism is a perfect abstraction in the sense that the induced MDP is equivalent to the original MDP for planning purposes.”
  6. Here, they just glob states together into clusters and use that set as an abstracted state (counts average returns apply to set as a whole)
  7. <OK, think I get the idea, not so relevant for now but may come back to it>

Monte-Carlo Exploration for Deterministic Planning. Nakhost, Muller. IJCAI 2009.

  1. MC methods lead to breakthroughs in Go and general game playing
  2. Introduces planner ARVAND
  3. Local search are the most effective techniques for satisficing planning
    1. But they have trouble dealing with local minima and plateaus where the heuristic function is flat in a large neighborhood
  4. Generally, these types of search algorithms are greedy, so a heuristic that isn’t very good can lead search to areas that are ultimately unpromising
  5. Mentions exploration/exploitation tradeoff
  6. Monte-Carlo Random Walk (MRW) is based on two things:
    1. Action generation is orders of magnitude faster than state-of-the-art heuristic evaluation, so it often works well (in terms of wall-clock time) to try and move further and then evaluate the result with the heuristic less frequently
    2. Results from games (Go) show that exploration is also important
  7. Mention randomization in motion planning, and some other sutff
  8. Mention other methods for escaping local minima/plateaus (FF, and something called enforced hill climbing/EHC)
    1. An extension to EHC, Marvin, keeps track of sequences that lead to exits from plateaus and attempts to apply them again later
  9. Arvand may use heuristics generated by FF, so can be like FF with randomized exploration
  10. Variants of MRW may move away from chance exploration to bias action selection to what was effective previously
  11. # of random walks, how long to make them are parameters
  12. Randomized exploration runs into trouble if the #actions is huge (>1000), or many random trajectories lead to dead ends.  Pruning is important in these cases
    1. They discuss something that has a couple of names, which comes down to softmax action selection
    2. In this paper, ARVAND starts with chance random walks, but falls back on these if that doesn’t make progress
  13. Empirical results discuss performance of this algorithm and competitors thoroughly, but I don’t have the background to make sense of it
  14. “To the best of our knowledge, this is the first successful use of Monte-Carlo search as the main search method of a classical planner. Contrary to greedy local search methods that try to immediately exploit their local knowledge, MRW planning explores the local search space by using random walks.”
Tagged , ,

Towards a Second Generation Random Walk Planner: An Experimental Exploration. Nakhost, Muller. IJCAI 2013

  1. Focuses on best-first planners that use randomization to break out of regions of the search space that have a flat heuristic value
  2. Four major insights claimed
    1. Better to evaluate states frequently as opposed to just at the end of search
    2. Better to allow parameters to intelligently adjust on the fly
    3. Better to bias action selection based on current state as opposed to the simple heuristic of how useful it was throughout the entire history
    4. “Even simple forms of random walk planning can compete with GBFS.”
  3. Sat-searchers work almost exclusively based on heuristics.  IPC 2011 had 25 of 27 entries in that category.
    1. Because they use heuristics they don’t really explore; when heuristics are bad lack of search driven by other methods means that solution quality is poor
  4. The same authors, in 2009 introduced Monte-Carlo Random Walks, which run finite-length random walks, and then choose one course of action based on leaf evaluation
  5. Mention RRTs, as well as Roamer.
  6. The planner they discuss is called Arvand-2013, built on top of something called Fast Downward algorithm
  7. A fair amount of discussion of parameter values that worked well for particular different domains from IPC
  8. <Too results specific based on IPC domains, skipping>

Rollout Algorithms for Constrained Dynamic Programming. Bertsekas. LIDS (tech report?) 2005.

<This is a tech report?>

<Seems like it is dealing with deterministic case>

  1. Based on the abstract, seems similar to, and his other work that deals with heuristics and rollouts
  2. “Under suitable assumptions, we show that if the base heuristic produces a feasible solution, the rollout algorithm also produces a feasible solution, whose cost is no worse than the cost correspoding to the base heuristic.”
  3. <Important!> “In the base heuristic is a DP policy, i.e., it generates state/control sequences via feedback functions μk, k=0,1,…,N-1, so that at any state xk, it applies the control μk(xk), then the algorithm can be viewed as a policy improvement step of the policy iteration method.  Based on the generic results for policy iteration, it then follows that the rollout algorithm’s cost <doing cost minimization in this setting> is no worse than the one of the base heuristic.  This cost improvement property has been extended in [BTW97 <Bertsekas, Tsitsiklis, Wu. Rollout algorithms for combinatorial optimiziation>] to base heuristics that are not DP policies, but rather satisfy an assumption called sequential improvement, which will be discussed shortly within the constrained context of this paper.”
  4. The base heuristic is more like a heuristic policy (as opposed to a value) in that it produces a “partial trajectory”, where the idea seems to be that if you plan to some depth you continue the rollout with the heuristic to give a better estimate of cost
    1. <This fits well with the random policy stuff>
    2. <Fits with Go literature>
  5. There is a given set of trajectories <a priori?>; the planned trajectory followed by the trajectory resulting from the heuristic may not be a member of this set of trajectories.
    1. <I think the paper is set up this way so that it can be used for constraint satisfaction; only trajectories satisfying the constraint are in this set; in this paper they deal with constrained DP as opposed to the more common unconstrained DP.>
  6. “We will show under an appropriate extension of the notion of sequential improvement, that the rollout algorithm produces a feasible solution, whose cost is no worse than the cost corresponding to the base heuristic. We note, however, that it does not appear possible to relate our rollout algorithm to concepts of policy iteration, and in fact we are not aware of an analog of the policy iteration method for constrained DP problems [my emphasis]…”
    1. <So this paper is probably less useful at the moment than his other work because of the focus on constrained DP… may skim the rest of this paper>
  7. For constrained rollouts, when performing action selection, it only considers actions at each time step that will keep the rollout satisfying the constraints
    1. At each decision point, examines all possible actions, and then does heuristic rollouts from each resulting state
    2. Takes the action that had the lowest cost and was still feasible, according to the information generated by the heuristic policy
  8. <Ah yes, this paper is the constrained version of that discussed in  [BTW97 <Bertsekas, Tsitsiklis, Wu. Rollout algorithms for combinatorial optimiziation>]  which I’m already familiar with>
    1. Ok so at this point I’m just going to skim and mine for references, as was recommended to me
  9. In some cases the heuristic will not produce policies that satisfy the constraints, in which case you are in trouble, but they introduce the idea of a heuristic being sequentially consistent, which basically means the policy itself is Markovian (they then say in this case you can just call the heuristic a policy)
  10. A sequentially improving base heuristic has a cost that keeps reducing as the rollout progresses <this actually does not seem trivial to achieve, but they say any sequentially consistent algorithm is also sequentially improving, so I guess it is>, and also that trajectories must be consistent
  11. <I suppose somewhat obviously> that if the base heuristic is sequentially improving, doing deeper rollouts with this heuristic will improve the solution quality
  12. “If the base heuristic is sequentially improving, then the rollout algorithm produces a trajectory … that is feasible and has cost that is no larger than the cost of the trajectory generated by the base heuristic starting from x0.”
  13. Then moves on to what you can do if the trajectory is not feasible
  14. As well as the case where doing trees (multiple partial trajectories) instead of rollouts

HC-Search: Learning Heuristics and Cost Functions for Structured Prediction. Rao Doppa, Fern, Tadepalli. AAAI 2013

Won best paper award

  1. “Structured prediction is the problem of learning a function from structured inputs to structured outputs.”
  2. This paper deals with structured prediction applied to search (although it is not the first such paper)
  3. “Given a structured input, the framework uses a search procedure guided by a learned heuristic H to uncover high quality candidate outputs and then uses a separate learned cost function C to select a final prediction among those outputs.”
  4. The regret can be found as inability of H to find good results, and then C not selecting the best output <not sure what these two things mean exactly yet>
  5. Based on these two factors of regret, regret is minimized first by “training H to quickly uncover high quality outputs via imitation learning, and then training C to correctly rank the outputs generated via H according to their true losses.”
  6. <Claim> Excellent empirical results <this is probably true otherwise it would not win best paper award>
  7. Structured prediction is just supervised learning.  <Although it doesn’t seem like it is always part of structured learning> They leverage a cost function <which I suppose must be equivalent to an error metric>
  8. In general minimizing the cost function is intractable, and in practice “… heuristic inference methods such as loopy belief propagation and variational inference have shown some success in practice.  However, the learning algorithms generally assume exact inference and their behavior in the context of heuristic inference is not well understood.”
  9. <It seems like the problem is discrete, because they say that if the optimization problem is a tree then it is tractable, and also…>
  10. “Another approach to dealing with the Argmin problem is to eliminate it altogether and instead attempt to learn a classifier that can greedily produce a structured output by making a series of discrete decisions (…)  Unfortunately, in many problems, some decisions are difficult to make by a greedy classifier, but are crucial for good performance.”
  11. Mention some method which seems to be state of the art at the moment, but mention that it has a number of serious limitations that will be addressed here
  12. <I honestly am not sure what type of paper I’m reading here so far.  Is it search, optimization, or supervised learning?>
  13. The decomposition of the problem into H,C “… the performance of the approaches with a single function can be arbitrarily bad when compared to that of HC-Search in the worst case.”
  14. In practice, HC-Search also works better<, so it seems like a case where assumptions in the theory hold up>
  15. “Finally, we note that our methodology can be viewed as similar in spirit to Re-Ranking (Collins 2000), which uses a generative model to propose a k-best list of outputs, which are then ranked by a separate ranking function. In contrast, we search in efficient search spaces guided by a learned heuristic that has minimal restrictions on its representation.”
  16. <Seems like> the setting is like standard supervised learning, except you are also provided with a cost function that must be optimized <I suppose most supervised learning algorithms assert what cost they are optimizing, such as MSE, which might not be the case in the problem defined here>
  17. There is a sort of forward model given <seems like it can do multistep look ahead, starting with x and proposing different possible solutions y>
  18. <So it seems like there is a sort of combinatoric search problem>
  19. Both learned heuristic H and learned cost function C are mappings from × Y to reals.
    1. In addition there is some search strategy <seems like here they just assume greedy>
  20. Given some x and some time bound, is run using H and the best item found in terms of C during the whole search is returned
  21. They “… consider the feasibility of an exact solution to this learning problem in the simplest setting of greedy search using linear heuristic and cost functions represented by their weight vectors wH and wC respectively.
  22. They “… consider the HC-Search Consistency Problem, where the input is a training set of structured examples, and we must decide whether or not there exists wH and wC such that HC-Search using greedy search will achieve zero loss on the training set.”
    1. This problem is NP-Hard.  In fact, both problems of finding a zero-loss heuristic and cost are np-hard
  23. Even though the problem is NP-hard, the decomposition of and C still helps
  24. Because the problem is intractable, they resort to greedy stagewise search for and C
  25. Most generally, learning a heuristic can be viewed as a Reinforcement Learning (RL) problem where the heuristic is viewed as a policy for guiding “search actions” and rewards are received for uncovering high quality outputs. In fact, this approach has been explored for structured prediction in the case of greedy search (Wick et al. 2009) and was shown to be effective given a carefully designed reward function and action space. While this is a viable approach, general purpose RL can be quite sensitive to the algorithm parameters and specific definition of the reward function and actions, which can make designing an effective learner quite challenging. Indeed, recent work (Jiang et al. 2012), has shown that generic RL algorithms can struggle for some structured prediction problems, even with significant effort put forth by the designer.”
    1. They go with a non-RL approach, imitation learning
  26. Their formulation of imitation learning is distinct from supervised learning in that they only care that it helps the heuristic make correct choices.  Basically the only concern is that it gets the ordering of the values of the options correct, as opposed to getting the values themselves right
    1. This type of strategy is called rank-based
    2. “Most common search procedures such as greedy search, beam search, and best-first search fall under this category.”
  27. The rank-based learner they used is called “the online passive-aggressive algorithm”
  28. The cost-function learner <seems> to be trained to accurately predict the data from the distribution that A and H yield
  29. Their experimental results cover large standard test sets, such as handwriting recognition, scene labeling and some others
  30. “The results in the scene labeling domain are the most significant improving the error rate from 27.05 [the previous state of the art] to 19.71 .”
  31. “One of the advantages of our approach compared to many frameworks for structured prediction is the ability to use more expressive feature spaces without paying a huge computational price.”
  32. Outperforms other methods basically across the board <are there other state of the art methods here that aren’t listed?>
  33. Plain C-Search (as opposed to HC-Search) doesn’t work as well, and even with 3rd order features works poorer than HC-Search with 2nd order features.  The difference in performance can be attributed to the fact that HC-Search optimizies for the distribution found by H, which plain C-Search does not
  34. When giving the true loss function instead of the learned loss function H, improvements in performance weren’t significant

Cross Entropy optimization applied to humanoid locomotion in a noisy domain with a deterministic model.

In this experiment, the true domain has a large amount of noise; between 10-50 newtons of force is randomly applied to one randomly selected joint at each time step.  During the rollouts, there is no such noise, so the planner must deal with unpredictable pertubations of the domain.

Cross-Entropy Motion Planning. Kobilarov. International Journal of Robotics Research. 2012.

  1. Not the same research as presented in
  2. This approach is for the noiseless setting
  3. The previous paper deals with using cross-entropy directly, while this paper discusses a combination of cross-entropy and RRT*
  4. The additional RRT* component is intended to help deal with the more difficult regions of planning “In particular, more complicated obstacles such as those associated with narrow passages significantly shrink the feasible regiouns in trajectory space and thus, in the abscence of a good prior, can render the method intractable due to the large number of rejected samples.”
  5. Considers open-loop planning, although mentions there are a two ways to parameterize trajectories:
    1. As a sequence of actions: “Conditions for resolution completeness of planning with such primitives [encoding of sequence of actions with time duration for each action] have been established (Yershov and LaValle 2011).
    2. As a sequence of states
    3. The paper outlines methods for doing both, as either may be better depending on the domain
  6. “In general, the problem cannot be solved in closed form since both the dynamics and constraints can be non-linear.  Gradient-based optimization is not suitable unless a good starting guess is chosen since the constraints impose many local minima.  In addition, constraints corresponding to arbitrary obstacles can be non-smooth and require special differentiation (Clarke et al. 1998) to guarantee convergence.”
  7. In sampling based motion planning, a graph is produced that is a finite approximation to the infinite set of feasible trajectories, so the original problem can be solved approximately through the graph
  8. Care about optimality at the limit.  PRMs approach optimality, but this occurs at an exponentially slow rate (defined how excatly?) due to the increasing number of edges+vertices “which in higher dimensions becomes computationally intractable.”
    1. So this paper cares about RRTs and not PRMs
  9. RRT* has a problem of uniformly distributing samples, and most of these samples don’t help at all.  By using cross-entropy optimization on top, it attempts to allocate samples along the path where the optimal trajectory may lie
    1. CE optimizes the method of sampling vertices
    2. RRT* is further optimized to always attempt to connect newly added nodes to the goal
  10. “Even though general theoretical convergence of the CE method has been shown [citations] actual rates of convergence, sample complexity, or precise performance guarantees remain open problems.”
  11. The difference between RRT* and cross-ent is that one samples in state space, the other in parameterized trajectories (such as action sequences)
  12. CE was originally designed for rare-event estimation. “The rare event of interest in this work is finding a parameter Z with a real-valued cost J(Z) which happens to be very close to the cost of an optimal parameter Z*… the rare-event estimation is equivalent to the global optimization of J(Z)”
  13. Uses Gaussain Mixture Models as the basis for CE optimizaiton, because it is easy to do with EM
    1. Although it can only capture as many local regions as components in the mixture, each Gaussian component can be regarded as an approximation of a local second-order model of the objective function centered at each mean (by considering the covariance as the inverse Hessian of the cost – I would like a citation/further discussion of this)
  14. Discusses connection to CE and CMA as well as another algorithm called EDA.  The connection to CMA was already discussed before in the CMA-ES paper.  Says the two are virtually identical in the Gaussian case.
  15. “In case when very few elite samples are available a classical GMM EM algorithm will fail to estimate correct means and covariances.”  So they add additional noise, as is common in Cross-Ent
  16. Gives examples of double integrator(actually in 2D, with obstacles), driving (dubins?),  and helicopter navigation
  17. In the helicopter domain, 85% of rollouts with regular CE are thrown out, so RRT* is added to make samples more useful
  18. In TCE RRT* (There is also an SCE proposed), examines correlations of states across entire trajectories
  19. In the helicopter, shows expected optimal trajectory as a result of 3 mixture components (I’m not sure what dimension, though).  The result is very complex in terms of the resulting path in the x,y space.
  20. Discusses other distributions that may make a good tradeoff between efficiency and accuracy.
  21. In a higher dimensional double-integrator domain with obstacles, the two algs here SCE-RRT*, and TCE-RRT* outperform RRT*.
  22. Use domain I havent heard of which is a weird version of a plane that has dubins steering in x,y and double integrator in z.

Sparse Roadmap Spanners. Dobson, Krontiris, Bekris. Workshop on the Algorithmic Foundations of Robotics (WAFR) 2012

  1. PRMs approach optimality at the limit, but of course this is not actually practical in terms of time or memory requirements
  2. If optimality is relaxed, near-optimal roadmaps can be made that are much more sparse
  3. The approach here asymptotically:
    1. Is complete
    2. Near optimal
    3. Probability of adding nodes converges to zero as time passes
  4. The implication is that finite size data structures can plan near optimally
  5. Empirical results back these claims up
  6. RRT isn’t focused on here, even though it can plan in dynamic domains because the goal is on preprocessing with PRMs
  7. fixed k-PRM, where each point is connected to a fixed k number of nearest points isn’t optimal, but if kis scaled logarithmically with the number of nodes, it is.  Delta-PRM, which connects all points in a delta-ball is asymptotically optimal, I suppose because the number of points in the ball increases as the number of samples increases
    1. These methods just keep adding nodes and vertices at each iteration, and there is no way to know when to stop, so the graph simply grows and grows
  8. The algorithm sequential roadmap spanner produces smaller maps by utilizing graph spanners that are subgraphs that have a node-to-node traversal cost that is allowed to be within some multiplicative factor of the cost in the full graph – therefore the graph remains small but the suboptimality of it is bounded
    1. In practice the graphs tend to be much smaller but don’t sacrifice much in terms of optimality
  9. This paper introduces ther SPArse Roadmap Spanner algorithm.  It has the properties:
    1. Of being probabalistically complete
    2. Can connect any 2 pts of length tc*+4delta.  t and delta are parameters to the algorithm, c* is the cost of the optimum path between query points in Cfree
    3. Converges to a finite size roadmap
  10. It is said to the best of their knowledge that this is the first approach that is near optimal and produces compact results – I wonder how Brafman’s paper from ICAPs compares?
  11. The approach has a natural stopping criteria (not sure what it is yet though)
  12. The alg builds 2 graphs in parallel, one sparse and one dense
    1. there are limits on edge lengths – the sparse graph has longer edges
  13. Uses visibility based navigation through the sparse graphs with guard and bridge nodes
  14. This method has a tiny fraction of the number of nodes delta-prm makes, and the path quality is very close