Symbolic Dynamic Programming for Continuous State and Action MDPs. Zamani, Sanner, Fang. AAAI 2012

1. Paper presents a way to optimally (and in closed-form) solve continuous MDPs, does Symbolic Dynamic Programming (SPD)
2. Functions in multivariate continuous state and action, discrete noise (whats that mean in this case?), piecewise linear reward.
3. There is limited work that deals with exact solutions.  In control, there is LQR/LQG (linear quadratic with gaussian noise), but in LQR transition dynamics or reward cannot be piecewise linear
4. The paper deals with 2 domains: Mars Rover, Inventory Control, and Resivior Management
5. States can be composed of discrete and continuous factors
6. (not surprisingly) requires a complete model of the MDP
7. Because CSA-MDPS (continuous state action mdps) are naturally factored (there is a reference for this) , they can be dealt with as a dynamic bayes net (DBN)
1. Assume no synchronic arcs (all dependencies are strictly forward in time), which allows for a relatively simple description of the transition dynamics
8. The solutions here are optimal discounted finite-horizon
9. The conditional probability functions are described in terms of piecewise linear equations with the following properties:
1. Conditioned only on action, current state, and prev state vars (this is weird, why current state?)
2. The are deterministic (can be encoded with dirac deltas), this means gaussian noise in the dynamics, for example is no good.
3. They are piecewise linear
10. Can sort of handle quadratic rewards, but this is restricted a bit (has to be univariate quadratic?)
11. Describe value iteration for continuous MDPs
12. Dynamics get represented as a bunch of disjoint linear and quadratic functions
13. <its so cold in this room that I can’t even concentrate on what I’m reading, probably missing stuff here>
14. Integration is required, but it turns out that in this setting it is solvable in a particular closed form, so it isn’t an issue
15. This is very cool and important, but a general point to make is that this does exact solving, so it is brittle in a way.  A region that isn’t linear or quadratic just ends up being unsolvable.  Methods that are less exact can deal with that sort of problem
16. The max operator is commutative? What is the casemax operator?
17. It is claimed that the solver is “efficient”, but I’d like to know what that means in particular
18. Solutions require linear checking/pruning in the XADD (I think thats the structure that represents the solution) or problems blow up extremely quickly
19. Claims here that this approach produces the first exact solution for a couple of canonical OR problems.  A very neat paper, I wish I understood it better AAAI 2012 seems to have the very prominent talks on videolectures but thats it.

Masters Thesis: Automatic Decomposition of Continuous Action and State Spaces in Simulation-Based Planning. Colin Schepers

Abstract

1. Main goal of the work is to “interchange” the discretization components of:
1. Tree Learning Search, which discretizes according to a tree of trees where each tree further decomposes the action space
1. Underlying component is Incremental Regression Tree Induction (IRTI)
2. HOLOP, where the tree decomposes a space that corresponds to a sequence of actions
1. Underlying component is HOO
2. TLS has the issue of throwing away data when new information arrives and trees must be discarded
3. Also include the idea of transposition tables
4. Work examines behavior, computation time, computational complexity, and memory of both algs
5. There are also some new algorithms introduced that extend these algorithms that have better performance in certain situations

cool.

Ch 1: Introduction

1. In terms of doing the actual action planning, talks of two possible options:
1. Meta Tree Learning (MTL) is the more traditional approach to MCTS, called meta tree because it constructs a tree of trees
2. Sequence Tree Learning (STL) explodes the sequence into a large space where each dimension corresponds to one step in the sequence (what HOLOP does)
2. IRTI and HOO can be combined with either MTL or STL to get 4 planning algorithms (originially TLS was coupled with IRTI and HOLOP with STL)
1. IRTI x MTL = Regression-based Meta Tree Learning (RMTL), very similar to TLS
2. STL x HOO = Hierarchical Optimistic Sequence Tree Learning (HOSTL), very similar to HOLOP
3. IRTI x STL = Regression-based Sequence Tree Learning (RSTL), introduced in this work
4. MTL x HOO = Hierarchical Optimistic Meta Tree Learning (HOMTL), also introduced in this work
3. The novelty of using transposition tables here is due to the fact that the spaces considered here are continuous, not discrete as is the case with most transposition tables. Two methods are proposed to do this
4. The Basic Research Questions for the Thesis are:
1. Can the IRTI component of TLS be replaced by the HOO component of HOLOP
2. Can the HOO component of HOLOP be replaced by the IRTI component of TLS
3. How can the information retreived from simulations be reused for TLS
4. Can a transposition tree increase the performance of simulation based systems in continuous environments
5. Which combination of techniques is best
6. What is the memory cost of these algorithms
7. What is the computational time complexity of the algorithms

Ch 2: Preliminary Knowledge

1. Discusses use of Zobrish hashing for high-performance transition tables.  I’ve never heard of it.

Ch 3: Automatic Decomposition

1. IRTI checks all possible tests (splits) in a leaf node, and takes the one with the highest information gain.  If that split yields two different leaves that are statistically significantly different (F-test), the split is made
2. <p. 14 the rule used by HOO to calculate b-scores is very reminiscent of UCB as well>
3. <p. 14 in an extended version of the HOO paper on Arxiv, a version of the algorithm is presented where n_0 doesn’t have to be recalculated at each step, if the total number of pulls is known before sampling starts.  This can make the algorithm much more efficient (log n per step instead of n)>
4. Discussion of scaling of the exploration factor (commonly called C)
1. The propose scaling it according to the range of values seen from each node, then modifying it by a factor which is called k
2. <p. 16, in practice, I definitely buy that this helps, but in certain situations this rule will cause problems (when all rewards observed in a node are the same, there will be no bias, for example)>
3. But they also mention that if vmin, vmax are known for the problem it can be used, so thats all fine.  If you don’t know that you need to resort to something like above
5. Discuss ways of conserving memory, such as not allowing leaves to split that have too few samples.  Capping tree depth isn’t mentioned, but is also a reasonable option

Ch 4: Multi-step Optimization in Continuous Environments

1. In normal MCTS, an edge represents 1 action, and a node represents 1 state. In Regression-based Meta Tree Learning, an edge represents a range of actions, and a node represents a range of states
2. When selecting an action, UCT rules are used (basically just ignore that edges and nodes represent a range and use the samples)
3. If a sample causes a split in a leaf in an internal leaf, the trees below that point must be discarded
1. Replay can be used once trees are discarded, but this is somewhat expensive.  It is cheaper to do sequence planning and pass values into the children on splits (sequence planning, though, has the drawback that it can’t form closed-loop plans)
4. <p. 21 It isn’t the case that the given splitting rule for HOLOP/HOSTL that prioritizes early actions has to be used, but it is one method that intuitively makes sense, and also doesn’t ruin the original proofs of HOO>
5. <In general, the way the material is presented respects the original algorithms this paper builds on too much.  It basically talks about TLS and HOLOP (which are both unique in both aspects of how to do decomposition as well as plan over sequences) and then crosses them over.  It would be easier for most readers not already familiar with the history of previous publications to present for example the Meta Tree Learning algorithms and then the Sequence Tree Learning algorithms, or something like that.  It starts with the corners of a 2×2 table describing the attributes of the algorithms instead of starting with a row or column>

Ch 5: Continuous Transposition Tree

1. Two methods for constructing transposition tables are introduced
2. In Level-based Transposition Trees, there is a clear distinction between the discretizations over the state and action spaces
3. <p. 24 “This process [of descending a decision tree over the state space] is expected to be less computationally expensive because it does not require any complex computations like the UCT formula.”  Is there really any significant difference between the costs of both aside from constant level operations which are very cheap anyway?>
4. One important feature of using the transposition table is that it allows planning to be stored – otherwise planning always has to be redone from every current state (of course it can also make planning cheaper further down the tree)
5. In LTT a decision tree decomposes the state space.  From each leaf in that tree over the state space, another decision tree is rooted that decomposes the action space
1. Still suffers from the need to discard action trees on splits in the state-tree leaves, but its easy to do replay in this setting
6. In a Mixed Transposition Tree (MTT), the tree is built both in terms of states and actions (as opposed to states and then actions, as in LTT); a node represents both spaces and an edge from parent to child represents a split in either the state or action dimension
7. When MTTs are used, from root to leaf states are traversed according to the query state (naturally), while actions are followed according to those that has the highest value according to a particular computation, I think it is basically like UCB, which allows for exploration.
8. The advantage MTT has over LTT is that trees do not have to be discarded and rebuilt
9. <In general, this thesis has a lot of verbiage, but is short on diagrams or other items that would express points succinctly, sometimes the material is presented in a way that is a bit vague and the only way to resolve the issue is to go back to pseudocode and grok it>

Ch 6: Experiments and Results

1. UCT here isn’t the normal version of UCT, somewhat between HOSTL and IRTI
2. A couple of the test domains are regular 1-step optimization problems.  Two more are multi step problems.  One is navigating in a circle (agent chooses angle to move in), and the other is cart-pole
3. For the 1-step optimization problems, comparisons are also made with a random agent, a vanilla MC agent (random but choses best found), and a version of UCT that uses unifom discretization
4. In 1-step optimization, UCT learns a good policy quickly, but HOO eventually outperforms it with near optimal performance; the regret plots are most illustrative of performance.  IRTI is worse than HOO in both cases (in one case better than UCT by the end of the experiment and in one case worse, but anyway as more time is given IRTI will beat UCT anyway)
5. Good parameterizations are found for each problem separately.
6. When constraining the algorithms on time instead of samples, HOO performs worst (due to polynomial time complexity) <as mentioned, an nlogn version can be implemented, though>.  When time is the main constraint IRTI performs best, vanilla-MC actually outperfoms UCT
7. In multistep experiments, all four algorithms (2×2 grid plus random and vanilla mc) are examined
8. In donut world, when samples are constrained, HOSTL (HOLOP) performs best, IRTI x STL is 2nd best
9. In stark contrast, in the cart-pole domain, HOSTL is worst besides random (even vanilla MC is better).
1. Here, RMTL (original tree-learning search) performs best by a wide margin
2. The domain used here is more challenging than the one I’ve used as it doesn’t allow for much deviation of the pole
3. This is under time constraints
10. <p. 35They test MTT by itself which I’m not exactly groking because I thought it was a technique for allowing tranposition tables to be used and not a policy, but I’m missing something>
11. Have very nice experiments of sample distributions on 1-step optimization problems
12. The HOO-family of algorithms have by far the worst memory use.  The regression-tree algorithms use the least memory <I guess cause they are constantly throwing trees out?  Even if they are doing replay, the trees are probably much more compact than the HOO versions because it requires a test confirming statistical significance to split>
13. In terms of measured computation time, not surprisingly HOO is slowest, but what is interesting is that IRTI is faster than UCT <Impressive.  It is because UCT ultimately ends up building a larger tree>

Ch 7: Experiments and Results

1. It is actually unclear whether transposition tables in the manner used are helpful (sometimes they help and sometimes they do not)
2. HOSTL (HOLOP) is best when the domain isn’t time constrained, but RMTL(TLS) is best when there are time constraints as it is very efficient because trees are built fairly compactly
3. While HOO-based algorithms used the most memory, there were no issues of exhausting memory during experiments or anything like that

Model-Based RL for Evolving Soccer Strategies. Wiering, Salustowicz, Schmidhuber

From  Computational Intelligence in Games, 2001

1. Paper studies model building. Claim in abstract is even that incomplete models can help find good policies
2. Approach is a combination of CMACs and prioritized sweeping
3. They didn’t use the official Robocup simulator because complexity made evaluation of approaches difficult, so they spun their own.
1. Their version is drastically simplified, which almost certainly makes model learning easy
4. This is full soccer so reward is 1 when a goal is scored
5. They claim that in continuous spaces, model learning is most successful when building local function approximators (but they dont explain that adequately here)
6. Looks like they build a different model for each tiling?
7. They have to do some weird hacks to get agents to be able to share data in a way that doesnt ruin the policies
1. They cite MBIE as something similar to what they aren’t doing, but the connection isn’t exactly right
8. They have to do random restarts on policies before they arrive at something that can be learned from
9. They compare Q(lambda) to PIPE, probabilistic incremental program evolution
10. It looks like they give PIPE 5x as much training data?
11. The “CMAC model” algorithm is best, and PIPE is worst, regular CMAC in the middle

Model-based Reinforcement Learning in a Complex Domain. Kalyanakrishnan, Stone, Liu. Robocode symposium 2007.

1. Seems to be from RoboCup-2007: Robot Soccer World Cup XI, a symposium.
2. The approach seems to be a blend of generative model (they have a hand written model of the transition function) and model-learning (the terminal and reward functions):
1. “the next state s’ is computed by applying a simple rule to the current state s. The rule simply assumes the players do not change their positions between s and s’.”  This works because the behavior of the players were modified slightly from the original publication to move less.
2. “In our model, the termination and reward predictors are trained through supervised learning using the observed experiences.”
3. Action-value estimates seem to be estimated using CMACs:
1. “Q is now updated using transitions that are simulated using M (lines 18-36). This is accomplished by generating trajectories of depth depth using M, beginning with some random start state…”
2. The method is described as being similar to experience replay.
4. They try adding rollouts, but inaccuracies in the model degrade performance
5. They try the algorithm with totally myopic actions (hold unless hold is expected to be terminal, otherwise kick), and the performance is worse than the full learning approach but still not bad.  Indicates short rollouts probably work ok.
6. Also discuss relation to dyna-q

Animating the Dead: Computational Necromancy with Reinforcement Learning. Bill Smart, et al. 2008

1. Do manifold learning
2. Give up guarantees of global optimality
3. Results on swimmer are “easy” for the algorithm because its easy to find the manifold
1. Domains with cyclic actions tend to exist on a ring-type manifold/high-dimensional cylinder which connects back when the gait loops back to a certain point
4. Manifolds are best in domains with constraints, gives up performance on arbitrary MDPs
5. Works with reward function, not just path planning
6. Control works by fitting a locally quadratic value function and planning a path through state space that gives good reward
7. Robustness of controller depends on smoothness of domains
8. Needs to be “bootstrapped” to reasonably good policies to learn in some domains:
1. In swimmer domain things are smooth enough that it can start out doing random stuff and improve on that
2. In the walker domain it has to start with a policy that can at least stand upright

Learning Inadmissible Heuristics During Search. Jordan T. Thayer, Austin Dionne, Wheeler Ruml

1. Goal is to transform admissible heuristics to more effective (faster solve times with potential suboptimal results) heuristics online.  This has been done before, but only offline
2. The method used to change heuristic is TD.  John mentioned this feels like dealing with the Bellman residual – first thing I thought of too.
3. Use 4 features: cost at arriving at node n from root, estimate of cost-to-go from n to terminal state, depth(n), the number of transitions needed to get from root to n, and d(n) the estimated number of transitions needed to get from n to goal
4. Doesn’t seem to be a very significant performance difference between online and offline methods tested
5. They ran experiments on 8-tile puzzle.  Also, vacuum world, which is close to my heart.
6. Actual CPU time using the real heuristic is better than the learned one, after that, offline LMS(?) is best.  The comparisons, are a bit apples and oranges though it seems
7. Extremely wide error bars, hard to draw conclusions from the empirical results
8. They give an example where the proposed approach fails because of the single step updates
9. Talk about search along multiple heuristics, and that this approach doesn’t function directly in that setting

Trade-Offs in Sampling-Based Adversarial Planning. Raghuram Ramanujan, Bart Selman.

1. Interested in the cases where UCT does well and Alpha-Beta doesn’t, or vice versa
2. There is an optimal setting for exploration/exploitation in the search tree, where:
1. Exploration means searching the tree wide
2. Exploitation mean searching the tree deep
3. UCT will outperform minimax if running on a small fixed number of available sample.  Minimax will beat UCT if it has accessibility to a good evaluation metric and UCT doesn’t
4. They say that given a good evaluation metric, the standard UCT method for propagating information up the search tree can be improved via a minimax style rule
5. They talk about stopping the UCT rollout when it hits a new node, which I’ve not heard of  before, but its in line with what we’ve been talking about with FS^3
6. Minimax is good in games with many “trap states”, like chess (or mancala, used in this paper).  Apparently, its still better than UCT in chess.
1. From other reading when doing minimax search in chess, they can keep the number of expansions down to about 1.2 per level.  Of course, 1 expansion is the minimum possible, so its quite close.
7. They run experiments on Mancala – since its deterministic, they use randomized start states
8. In their experiments, UCT is allowed to expand at most the number of nodes alpha-beta would expand, given the same situation
9. Its funny their using alpha-beta since we know much better versions exist (AB-SSS*) and this is a planning conference
10. They talk about running UCT with the evaluation function available instead of random play outs, but my understanding was thats what the top go algorithms use anyway
11. The constant attached to the  bias parameter has a pretty significant impact in the experiments here
1. Means exploration exploitation has to be closely balanced
12. Say that the conventional wisdom is that forward pruning is worse than doing wide searches, but the results in UCT (which prunes aggressively) contradicts this
13. If running random rollouts from the leaves to evaluate, its better to only do a small number of them and spend more time on full UCT rollouts of the search tree
1. This is because if the UCT tree itself is expanded, the accuracy of the random evaluations gets more accurate because they are closer to the endgame
14. Talk about minimax backups vs averaging backups.  When random rollouts are used to evaluate (go) because of lack of a good evaluation function, averaging is better.  When there is a good evaluation function, minimax backups are better
1. Also depends on the number of nodes expanded
15. Remember – all these results are just on Mancala

Ensemble Monte-Carlo Planning: An Empirical Study. Alan Fern, Paul Lewis.

1. There are already some papers on this topic, I am familiar with one from the games group at Massricht?
2. About parallelizing MCTS
3. The extension here seems to be on more domains
4. Say that 3 main findings are that parallelization helps when there are time constraints, and that single core ensembles perform better than vanilla given a fixed amount of memory.  Seems in no case does ensemble learning produce poorer results
5. Root parallelization where all agents run entirely separately and only combine results for the root is the simplest method and requires almost no synchronization
6. In the method called “Ensemble UCT”, many small trees are searched out on one process, and then the action taken is based on a weighted vote of those trees
7. Say this is essentially equivalent to applying bagging to RL.  Seems correct.
1. When the predictor has high variance, bagging can significantly help results
2. UCT can also have high variance, I’ve read this is true of many rollout based algorithms
8. Root parallelization always seemed to be about the best
9. Mention bias/variance tradeoff
10. Mention UCT takes too long to converge to values in domains that are highly stochastic, so they used a modified version of it for some of the domians
11. In some cases, the number of trajectories run was pretty small (one domain 128)
12. I don’t like their naming conventions
13. Shows some cases where root parallelization doesn’t help when the total sum of trajectories is held constant

Gaussian Process Bandits: An Experimental Design Approach. Srinivas, Krause, Kakade, Seeger

• A short (4 page) paper
• Analysis of upper confidence bound on Gaussian process optimization
• Smootheness is encoded by covariance function
• The Gaussian Bandit UCB requires a discrete set of test points, which seems strange
• There are some notation abuses that are difficult to understand, for example their definition of a no-regret algorithm and the statement that UCB is a no regret algorithm
• The regret bound in discrete UCB is O(sqrt(kt)).  For the infinite arm case, K is replaced by the bound for the maximum possible information gain due to sampling
• The say this connects GP optimization w/ optimal experimental design, should read more about this
• Here there is noise on the reward observations, it is assumed to be Gaussian
• Although it is a regret algorithm it looks like there is a probability of failure delta
• Information gain is submodular – more information is gained when the total number of points sampled is low (although there has to be cases where this isn’t true)
• There is a different regret bound here that also has sqrtish flavor

Regret Bounds for Gaussian Process Bandit Problems. Grunewalder, Audibert, Opper, Shawe-Taylor

• Results are for the case where there is no noise
• Gives a regret bound for GPs
• Lots of math and I’m not really reading this carefully
• The bounds also assume
• The function being maximized is a GP
• The prior of the GP is known (mean and covariance)
• Other global optimization algorithms don’t have these constraints
• More math I’m skipping
• Regret bound presented is related to sqrt(T), but is a bit more complex:
• sqrt(T)(sqrt(a_T+2log2)-sqrt(a_T))
• This bound is a log-factor off optimal, which I think HOO is as well given its assumptions
• Discuss extending results to domains where rewards are stochastic