Category Archives: Tree Search

Rollout Algorithms for Constrained Dynamic Programming. Bertsekas. LIDS 2005

<This is a tech report?>

  1. Based on the abstract, seems similar to http://www.mit.edu/~jnt/Papers/J066-97-rollout.pdf, and his other work that deals with heuristics and rollouts
  2. “Under suitable assumptions, we show that if the base heuristic produces a feasible solution, the rollout algorithm also produces a feasible solution, whose cost is no worse than the cost correspoding to the base heuristic.”
  3. <Important!> “In the base heuristic is a DP policy, i.e., it generates state/control sequences via feedback functions μk, k=0,1,…,N-1, so that at any state xk, it applies the control μk(xk), then the algorithm can be viewed as a policy improvement step of the policy iteration method.  Based on the generic results for policy iteration, it then follows that the rollout algorithm’s cost <doing cost minimization in this setting> is no worse than the one of the base heuristic.  This cost improvement property has been extended in [BTW97 <Bertsekas, Tsitsiklis, Wu. Rollout algorithms for combinatorial optimiziation>] to base heuristics that are not DP policies, but rather satisfy an assumption called sequential improvement, which will be discussed shortly within the constrained context of this paper.”
  4. <Spring cleaning – Didn’t get to finish>

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning. Guo, Singh, Lee, Lewis, Wang. NIPS 2014

  1. Model-based Atari planners are better than the famous DQN, but very slow
  2. Goal here is to build a real-time MCTS agent better than DQN
  3. Use UCT to train NN, 300 max depth, 10k trajectories (minor modifications for a couple of games)
  4. A few particular approaches:
    1. Do 800 episodes in the game through UCTs policy, and learn the values computed by UCT to train the NN
    2. Use UCT as above, but to use NN to train a classifier to pick an action
      1. One issue is that the policy generated by these approaches will in general be different from that found by UCT during search, so the values will not be for the policy used
    3. So they propose an “interleaved” approach where part of the behavior is on-policy UCT, and part UCT is used to compute values, but the actions taken are according to NN
  5. Learn in 84×84 greyscale
  6. Use same NN arch as DQN
  7. UCT with classification is as good or better as DQN,  UCT with regression is worse
  8. To be consistent with DQN paper, they leave in 5% exploration for recording quality, although it doesn’t make any sense to do that when evaluating performance
  9. UCT without any realtime constraints really beats up on everything (roughly double score of best)
  10. Also, the emulator runs at 60 fps, not 30 that might be expected <the 60 is consistent with atari hardware, I checked>
  11. Interleaved method performance jumped %10 when doubling trajectories
  12. <I thought they were going to try a DYNA 2 approach, wonder how it would work>
  13. They observed some horizon effects in UCT where some good behavior requires really long timescales that couldn’t be reasoned about
Tagged

Improving UCT Planning via Approximate Homomorphisms. Jian, Singh, Lewis. AAMAS 2014

  1. Improves performance of UCT by:
    1. finding local abstractions in the DAG built by UCT
    2. using approximate homomorphisms (as opposed to true homomorphisms which preserve optimal policies, but are rare and computationally difficult to develop)
  2. Derive a lower bound on the performance of the abstraction method when used with UCT
    1. <If the bounds are specific for UCT it might be moot anyway because of the horrible worst-case performance of the algorithm>
  3. Also global homomorphisms are global, but the paper is concerned with UCT which is generally applied when global methods are too expensive
  4. “One interesting overall insight obtained herein is that the more computationally limited UCT is, the coarser are the
    best-performing abstractions.”
  5. “A homomorphism is a perfect abstraction in the sense that the induced MDP is equivalent to the original MDP for planning purposes.”
  6. Here, they just glob states together into clusters and use that set as an abstracted state (counts average returns apply to set as a whole)
  7. <OK, think I get the idea, not so relevant for now but may come back to it>

Masters Thesis: Automatic Decomposition of Continuous Action and State Spaces in Simulation-Based Planning. Colin Schepers

Abstract

  1. Main goal of the work is to “interchange” the discretization components of:
    1. Tree Learning Search, which discretizes according to a tree of trees where each tree further decomposes the action space
      1. Underlying component is Incremental Regression Tree Induction (IRTI)
    2. HOLOP, where the tree decomposes a space that corresponds to a sequence of actions
      1. Underlying component is HOO
  2. TLS has the issue of throwing away data when new information arrives and trees must be discarded
  3. Also include the idea of transposition tables
  4. Work examines behavior, computation time, computational complexity, and memory of both algs
  5. There are also some new algorithms introduced that extend these algorithms that have better performance in certain situations

cool.

Ch 1: Introduction

  1. In terms of doing the actual action planning, talks of two possible options:
    1. Meta Tree Learning (MTL) is the more traditional approach to MCTS, called meta tree because it constructs a tree of trees
    2. Sequence Tree Learning (STL) explodes the sequence into a large space where each dimension corresponds to one step in the sequence (what HOLOP does)
  2. IRTI and HOO can be combined with either MTL or STL to get 4 planning algorithms (originially TLS was coupled with IRTI and HOLOP with STL)
    1. IRTI x MTL = Regression-based Meta Tree Learning (RMTL), very similar to TLS
    2. STL x HOO = Hierarchical Optimistic Sequence Tree Learning (HOSTL), very similar to HOLOP
    3. IRTI x STL = Regression-based Sequence Tree Learning (RSTL), introduced in this work
    4. MTL x HOO = Hierarchical Optimistic Meta Tree Learning (HOMTL), also introduced in this work
  3. The novelty of using transposition tables here is due to the fact that the spaces considered here are continuous, not discrete as is the case with most transposition tables. Two methods are proposed to do this
  4. The Basic Research Questions for the Thesis are:
    1. Can the IRTI component of TLS be replaced by the HOO component of HOLOP
    2. Can the HOO component of HOLOP be replaced by the IRTI component of TLS
    3. How can the information retreived from simulations be reused for TLS
    4. Can a transposition tree increase the performance of simulation based systems in continuous environments
    5. Which combination of techniques is best
    6. What is the memory cost of these algorithms
    7. What is the computational time complexity of the algorithms

Ch 2: Preliminary Knowledge

  1. Discusses use of Zobrish hashing for high-performance transition tables.  I’ve never heard of it.

Ch 3: Automatic Decomposition

  1. IRTI checks all possible tests (splits) in a leaf node, and takes the one with the highest information gain.  If that split yields two different leaves that are statistically significantly different (F-test), the split is made
  2. <p. 14 the rule used by HOO to calculate b-scores is very reminiscent of UCB as well>
  3. <p. 14 in an extended version of the HOO paper on Arxiv, a version of the algorithm is presented where n_0 doesn’t have to be recalculated at each step, if the total number of pulls is known before sampling starts.  This can make the algorithm much more efficient (log n per step instead of n)>
  4. Discussion of scaling of the exploration factor (commonly called C)
    1. The propose scaling it according to the range of values seen from each node, then modifying it by a factor which is called k
    2. <p. 16, in practice, I definitely buy that this helps, but in certain situations this rule will cause problems (when all rewards observed in a node are the same, there will be no bias, for example)>
    3. But they also mention that if vmin, vmax are known for the problem it can be used, so thats all fine.  If you don’t know that you need to resort to something like above
  5. Discuss ways of conserving memory, such as not allowing leaves to split that have too few samples.  Capping tree depth isn’t mentioned, but is also a reasonable option

Ch 4: Multi-step Optimization in Continuous Environments

  1. In normal MCTS, an edge represents 1 action, and a node represents 1 state. In Regression-based Meta Tree Learning, an edge represents a range of actions, and a node represents a range of states
  2. When selecting an action, UCT rules are used (basically just ignore that edges and nodes represent a range and use the samples)
  3. If a sample causes a split in a leaf in an internal leaf, the trees below that point must be discarded
    1. Replay can be used once trees are discarded, but this is somewhat expensive.  It is cheaper to do sequence planning and pass values into the children on splits (sequence planning, though, has the drawback that it can’t form closed-loop plans)
  4. <p. 21 It isn’t the case that the given splitting rule for HOLOP/HOSTL that prioritizes early actions has to be used, but it is one method that intuitively makes sense, and also doesn’t ruin the original proofs of HOO>
  5. <In general, the way the material is presented respects the original algorithms this paper builds on too much.  It basically talks about TLS and HOLOP (which are both unique in both aspects of how to do decomposition as well as plan over sequences) and then crosses them over.  It would be easier for most readers not already familiar with the history of previous publications to present for example the Meta Tree Learning algorithms and then the Sequence Tree Learning algorithms, or something like that.  It starts with the corners of a 2×2 table describing the attributes of the algorithms instead of starting with a row or column>

Ch 5: Continuous Transposition Tree

  1. Two methods for constructing transposition tables are introduced
  2. In Level-based Transposition Trees, there is a clear distinction between the discretizations over the state and action spaces
  3. <p. 24 “This process [of descending a decision tree over the state space] is expected to be less computationally expensive because it does not require any complex computations like the UCT formula.”  Is there really any significant difference between the costs of both aside from constant level operations which are very cheap anyway?>
  4. One important feature of using the transposition table is that it allows planning to be stored – otherwise planning always has to be redone from every current state (of course it can also make planning cheaper further down the tree)
  5. In LTT a decision tree decomposes the state space.  From each leaf in that tree over the state space, another decision tree is rooted that decomposes the action space
    1. Still suffers from the need to discard action trees on splits in the state-tree leaves, but its easy to do replay in this setting
  6. In a Mixed Transposition Tree (MTT), the tree is built both in terms of states and actions (as opposed to states and then actions, as in LTT); a node represents both spaces and an edge from parent to child represents a split in either the state or action dimension
  7. When MTTs are used, from root to leaf states are traversed according to the query state (naturally), while actions are followed according to those that has the highest value according to a particular computation, I think it is basically like UCB, which allows for exploration.
  8. The advantage MTT has over LTT is that trees do not have to be discarded and rebuilt
  9. <In general, this thesis has a lot of verbiage, but is short on diagrams or other items that would express points succinctly, sometimes the material is presented in a way that is a bit vague and the only way to resolve the issue is to go back to pseudocode and grok it>

Ch 6: Experiments and Results

  1. UCT here isn’t the normal version of UCT, somewhat between HOSTL and IRTI
  2. A couple of the test domains are regular 1-step optimization problems.  Two more are multi step problems.  One is navigating in a circle (agent chooses angle to move in), and the other is cart-pole
  3. For the 1-step optimization problems, comparisons are also made with a random agent, a vanilla MC agent (random but choses best found), and a version of UCT that uses unifom discretization
  4. In 1-step optimization, UCT learns a good policy quickly, but HOO eventually outperforms it with near optimal performance; the regret plots are most illustrative of performance.  IRTI is worse than HOO in both cases (in one case better than UCT by the end of the experiment and in one case worse, but anyway as more time is given IRTI will beat UCT anyway)
  5. Good parameterizations are found for each problem separately.
  6. When constraining the algorithms on time instead of samples, HOO performs worst (due to polynomial time complexity) <as mentioned, an nlogn version can be implemented, though>.  When time is the main constraint IRTI performs best, vanilla-MC actually outperfoms UCT
  7. In multistep experiments, all four algorithms (2×2 grid plus random and vanilla mc) are examined
  8. In donut world, when samples are constrained, HOSTL (HOLOP) performs best, IRTI x STL is 2nd best
  9. In stark contrast, in the cart-pole domain, HOSTL is worst besides random (even vanilla MC is better).
    1. Here, RMTL (original tree-learning search) performs best by a wide margin
    2. The domain used here is more challenging than the one I’ve used as it doesn’t allow for much deviation of the pole
    3. This is under time constraints
  10. <p. 35They test MTT by itself which I’m not exactly groking because I thought it was a technique for allowing tranposition tables to be used and not a policy, but I’m missing something>
  11. Have very nice experiments of sample distributions on 1-step optimization problems
  12. The HOO-family of algorithms have by far the worst memory use.  The regression-tree algorithms use the least memory <I guess cause they are constantly throwing trees out?  Even if they are doing replay, the trees are probably much more compact than the HOO versions because it requires a test confirming statistical significance to split>
  13. In terms of measured computation time, not surprisingly HOO is slowest, but what is interesting is that IRTI is faster than UCT <Impressive.  It is because UCT ultimately ends up building a larger tree>

Ch 7: Experiments and Results

  1. It is actually unclear whether transposition tables in the manner used are helpful (sometimes they help and sometimes they do not)
  2. HOSTL (HOLOP) is best when the domain isn’t time constrained, but RMTL(TLS) is best when there are time constraints as it is very efficient because trees are built fairly compactly
  3. While HOO-based algorithms used the most memory, there were no issues of exhausting memory during experiments or anything like that

Programming a Computer for Playing Chess. Shannon. Philosophical Magazine 1950

I feel myself approaching this paper more as scripture than a scientific paper.
  1. “Although perhaps of no practical importance, the question is of theoretical interest, and it is hoped that a satisfactory solution of this problem will act as a wedge in attacking other problems of a similar  nature and of greater significance.”  He proceeds to list a number of areas that may gain from this research.
  2. “…the proper procedure involves general principles, something of the nature of judgement, and considerable trial and error, rather than a strict, unalterable computing process.”
    1. How the hell did this guy know this back then
  3. <immediately following>”Finally, the solutions of these problems are not merely right or wrong but may have a continuous range of ‘quality’ from best down to worst.  We might be satisfied with a machine that designed good filters even though they were not always the best possible”
  4. Discusses evaluation functions
  5. Discusses brute-force minimax search, and that this is not feasible, and likewise a table lookup for all policies 
  6. Defines a strategy in the same way we do a policy
  7. “Although in chess there is no known simple and exact evaluating function f(P), and probably never will be because of the arbitrary and complicated nature of the rules of the game, it is still possible to perform an approximate evaluation of a position.  Any good chess player must, in fact, be able to perform such a position evaluation.”
  8. Discusses rules of thumb for chess, and how they can be combined to form an evaluation function
  9. Evaluation functions should only be performed for an entire variation (sequence of searches), to fully allow for inclusion of opponent responses
  10. Describes minimax search with cutoffs which use the evaluation function as the function
  11. Its interesting to see how some of the ideas here are totally still relevant today, and other points seem archaic.  For example, he goes to great pains describing how to record a move in a computer program, or check for available moves, which is now taken for granted and considered trivial.
  12. Shannon was able to checkmate a random opponent in 4 or 5 moves, usually
  13. Discusses search methods based on what masters do, which involve doing many plies when simple, but otherwise pruning extensively and only searching a small number of plies ahead
  14. Stochasticity can be introduced so the machine randomly chooses between moves that look nearly the best, in order to make decisions for the opponent more dificult
  15. “The chief weakness is that the machine will not learn by mistakes.  The only way to improve its play is by improving the program.  Some thought has been given to designing a program which is self-improving but, although it appears to be possible, the methods thought of so far do not seem to be very practical.  One possibility is to have a higher level program which changes the terms and coefficients involved in the evaluation function depending on the results of games the machine has played.  Slam variations [Some?-Transcription error? I dont think this is original] might be introduced in these terms and the values selected to give the greatest percentage of ‘wins’.”
  16. Argues that a machine that tries to play based on principles matching board configurations to analogous famous maneuvers “could not be constructed”:
    1. “Although there are various books analyzing combination play and the middle game, they are written for human consumption, not for computing machines.  It is possible to give a person one or two specific examples of a general situation and have him understand and apply they general principles involved.  With a computer an exact and completely explicit characterization of the situation must be given with all limitations, special cases, etc. taken into account.” 
  17. “It is not being suggested that we should design the strategy in our own image.  Rather it should be matched to the capacities and weaknesses of the computer.  The computer is strong in speed and accuracy and weak in analytical abilities and recognition.  Hence, it should make more use of brutal calculations than humans, but with possible variations increasing by a factor of 10^3 every move, a little selection goes a long way forward improving blind trial and error.”

On-line Policy Improvement Using Monte-Carlo Search

The basic takeaway of this paper is that doing even limited rollouts can significantly improve the performance of poor policies
  1. For domains with stochasticity, Monte-Carlo estimations are necessary (so the implication is the policy is determininstic)
  2. Discusses root parallelization and pruning heuristics
  3. On average, rollouts in backgammon need to be around 30 steps long to play to completion
  4. There are generally about 20 legal moves that can be considered at each step, differences in values of initial actions generally vary by 0.01, when scores range from 1 to 3 (or negative) for a win, gammon, and backgammon, respectively
  5. Based on this, using pure Monte-Carlo sampling it would be necessary to perform hundreds of thousands of rollouts
    1. With pruning, roughly a million decisions have to be made to come to a result, typical tournament level human players take roughly 10 seconds
  6. Adding rollouts makes linear policies go from -0.5 to essentially 0 (the opponent is most basic configuration of TD-Gammon 2.1 with no lookahead)
  7. Next experiments do limited length rollouts (7 or 11 steps) and then use ANN for “equity” (evaluation) function
  8. Points to a paper by Shannon from 1950 that discusses rollouts

Rollout Algorithms for Stochastic Scheduling Problems. Bertsekas, Castanon. Decision and Control. 1998.

This paper is great
I am reading this paper because I saw a reference to it elsewhere, where it says if 1-ply rollouts are used (choose one action, then follow policy pi for the remainder of the rollout), the result of that method is provably better than just pi under mild assumptions.  This is relevant to the real-time vanilla MC work
  1. Considers stochastic scheduling problems which I believe means choosing a sequence of decisions to make.  The size of the decision space is combinatorial.
  2. Paper considers the quiz problem:
    1. You are on a game show, and there are a bunch of questions with different payouts.  Your goal is to choose the ordering of the questions that maximizes your cumulative payout until failure
  3. Discuss heuristics for this setting and rollout algorithms based on these heuristics
    1. The performance of these rollout policies ends up being near-optimal, and much better than the heuristics themselves
  4. In the basic quiz setting, questions should just be answered in decreasing order of p_i*v_i/(1-p_i).  The greedy policy of answering in descending p_i*v_i is suboptimal
  5. With small changes, however, the problem becomes very difficult.  Examples are:
    1. The total number of questions you are allowed to attempt is smaller than the entire number of available questions (N)
    2. Constraints on when in time each question may be selected.
    3. Ordering/Precedence constraints on questions
    4. Sequence-dependent rewards, where rewards depend on history
  6. In all of these variants there is an optimal open-loop policy (I suppose because you either get the question right and proceed, or make a mistake, and the episode terminates)
  7. There are other variants also listed of the quiz problem, the introduce stochasticity to the problem and are solvable via DP, but the problem is intractable
  8. In this paper, they propose suboptimal solutions that are computationally intractable
    1. These solutions are based on rollout algorithms, they say is based on policy iteration
  9. Rollout algorithms initially proposed by Bertsekas and Tistsiklis, and Wu in 96,97 (Learning to Act Using Real-Time Dynamic Programming, Rollout Algorithms for Combinatorial Optimization)
  10. Generally, rollout algorithms are capable of magnifying the effectiveness of any given heuristic algorithm through sequential application, by enhancing the policy improvement step of the underlying policy iteration that is occurring
  11. The rollout method described uses a heuristic value (defined for each question), and then greedily chooses questions based on the next best heuristic value.  The cost of this is O(MN), where there are N questions, but you are restricted to choosing a total of M of them
    1. I think the point is this is different from just computing the heuristic values (N times) and then sorting them according to that.
  12. The superiority of the rollout method as compared to the base heuristic is guaranteed when the heuristic is sequentially consistent, which means that basically the order in which the heuristic chooses things is independent of what of what has already been chosen (aside from the fact that it won’t try to choose something that has already been selected, of course).
    1. This is discussed more in the Bertsekas, Tsitsiklis, Wu paper
  13. They also discuss a property which is called sequentially improving, part of which guarantees that the rewards of schedules produced are monotonically nonincreasing
  14. Also talk about other rollout methods, such as m-step lookahead that enumerates all possible sequences of length m, and then follows the heuristic after that, and chooses the best initial action
    1. This is similar to how UCT is used in practice
    2. There is also another heuristic for how to make this selection cheaper (at the potential cost of performance)
  15. They run experiments on 7 different algs (2 Heuristics, with no lookahead, 1-step lookahead, and 2-step lookahead):
  16. Just doing 1-step rollouts leads to very significant (almost doubling) of the performance of heuristics relative to optimal, adding another step of lookahead only helped slightly (sometimes results were equivalent to just 1-step lookahead).  For the more difficult problems the second step of lookahead was significantly helpful
  17. Rollouts help fix heuristics that are myopic
  18. Section 4 generalizes the setting to what is basically a MDP (undiscounted, it seems)
  19. Doing one step of policy iteration given a policy pi for an MDP is better than just following pi.
    1. This is equivalent to saying argmax_a Q^{pi}(s,a) is better than pi(s)
  20. Cites Tesauro’s work of Monte-Carlo rollouts in Backgammon
  21. There is some discussion of doing rollouts with the noise parameters clamped to certain values
  22. Call scearia multiple trajectories where each has different noise
  23. <I’m sort of losing this paper towards the end>
  24. There are then results for more complex quiz domains
    1. The results are comparable to the first batch of empirical results
  25. There is a domain where the questions have graph constraints (like an MDP), and there the benefit of doing rollouts is actually reduced as the heuristics work pretty well by themselves.
    1. Still doing rollouts helps, and brings the results very close to optimal, while the computational costs are drastically reduced

Motion Planning and Optimal Control of Robotic Systems. Marin Kobilarov. Talk

Read through of the actual paper is here:http://wp.me/pglXz-kD

  1. Seems to be strongly motivated by real problems, lots of examples on real robots, or realistic simulations
  2. Continuous vs discrete mechanics (one involves integration, the other summation, Euler Lagrange vs discrete Lagrange)
    1. Discrete mechanics uses discrete trajectories, based on snapshots across time, as opposed to continuous time in trajectories
    2. Gives an example of the continuous and discrete dynamics in modelling pendulum motion
      1. In this case, demonstrates that the Runge-Kutta approximation dissipates energy over time, so the pendulum stops moving, the discrete doesn’t
  3. He is interested in discrete geometric mechanics
    1. Uses vector fields, flow maps, and covariant differentials(?)
    2. Transformations make the space linear
  4. Gives an example of what he calls the snakeboard (skateboard where axles rotate) and shows RK integration is very poor compared to the discrete mechanical equations
  5. Computational costs of these methods are very low
  6. Discrete Geometric Optimal Control.   
    1. Equations for this are very simple, and very similar to the equations for the original dynamics.
    2. The difference is the Lagrangian
  7. When working with nonlinear systems, must use iterative methods, like Newton-type root finding
    1. Usually takes only a few (ten or so) iterations, can be done real-time
    2. Regularity of the Lagrangian for control means there is a optimal control solution
  8. Normal methods Don’t deal well with collisions or other nonsmooth dynamics
    1. This is still an area of active research
  9. The state-of-the-art method of planning in nonsmooth domains (such as with obstacles) are RRTs or PRMs (probabilistic roadmaps)
    1. A drawback is these don’t really converge to optimal, they are good at only finding some path to goal
  10. Proposes random-sampling in trajectory space (adaptive sampling and stochastic optimization, Cross-Entropy for rare-event estimation)
  11. Trajectories can be parameterized in two ways: by piecewise constant control, or by specific states visited.
    1. For control sequences, trajectory estimated by discrete mechanics, for state parametrization, need optimal control
  12. Reformulates the problem as optimization, instead of points try to find small regions around those points
  13. Try to estimate volume of low-cost trajectories via MCMC
  14. When estimating densities can work with combinations of Gaussians, for example.
  15. SCE-RRT* algorithm
  16. Combining sensing and control: how do you plan and try to gain as much information as possible about the environment within a restricted amount of time?
    1. Implemented this on a model helicopter for slam-style learning/planning.  Planned to end up on full-sized helicopters later.
  17. Conclusions:
    1. Discrete mechanics leads to provably good numerical algs
    2. Traj optimization should exploit state space geometry
    3. <missed the rest>
  18. Future Work:
    1. Numerical geom dynamics and control (robust real-time numerics)
    2. planning and control, uncertainty and robustness
    3. info-theoretic bounds on performance/efficiency/complexity
    4. He cares about applications to autonomous robots (surface, sea, air, space)

Score Bounded Monte-Carlo Tree Search. Cazenave, Saffidine. Computers and Games

  1. Designed for settings where there are more than 2 possible outcomes in a game (they don’t mention scores though; they mention games that can also end in a draw)
  2. Apply the algorithm to Seki and Go
  3. They discuss MCTS as done with Go in practice, but say finding specific sequences of play in that setting doesn’t work well
    1. Solvers can be used to correct this issue, as they don’t use estimates based on evaluation functions, but are instead based on real (partial) minimax solutions
  4. There are methods for propagating optimistic and pessimistic bounds
    1. Mention B* which I am not familiar with
    2. The optimistic and pessimistic bounds are always correct (the true value is always between them)
    3. Bounds seem to be backed up in the same way as in FS3
  5. They point out how to make the algorithm more efficient in terms of doing computation only when the bounds in the children change
    1. It is possible to do a similar thing in FS3 where we restart rollouts from a point deeper in the tree if its bound hasn’t changed (can treat the root as the last node whose bounds didn’t change). We knew about this but didn’t go in that direction just to keep the algorithm simpler
  6. This algorithm looks like FS3 without the alpha-beta style cutoffs; it only cuts nodes whose bounds are disjoint from the bounds of the parent 

A Biped Walking Pattern Generator based on “Half-Steps” for Dimensionality Reduction. Perrin, Stasse, Lamiraux, Yoshida. ICRA 2011

(Again, reading up on ICRA)

  1. The basic goal is to find “half-steps” (I suppose half a gait, which they can do symmetrically for the other side?).  At the start, end end of each the robot is statically stable and motionless
  2. Since planning in half-steps is half the size of the normal problem, they can do offline planning much more quickly than previously possible (1 hr)
  3. They use RRT* to get a feasible (no self-collision, no collision with obstacles) 2D sequence of half-steps (this planning takes about 14 seconds)
  4. Once RRT finds the raw sequence, it is smoothed into a fluid gait – they still need ot maintain feasibility, but this is pretty quick – about 12 secs.
  5. <Much of this is particulars to gaits so much is skipped>
  6. The half-step is broken into an upward and downward part