**Abstract**

- Main goal of the work is to “interchange” the discretization components of:
- Tree Learning Search, which discretizes according to a tree of trees where each tree further decomposes the action space
- Underlying component is Incremental Regression Tree Induction (IRTI)

- HOLOP, where the tree decomposes a space that corresponds to a sequence of actions
- Underlying component is HOO

- TLS has the issue of throwing away data when new information arrives and trees must be discarded
- Also include the idea of transposition tables
- Work examines behavior, computation time, computational complexity, and memory of both algs
- There are also some new algorithms introduced that extend these algorithms that have better performance in certain situations

cool.

**Ch 1: Introduction**

- In terms of doing the actual action planning, talks of two possible options:
- Meta Tree Learning (MTL) is the more traditional approach to MCTS, called meta tree because it constructs a tree of trees
- Sequence Tree Learning (STL) explodes the sequence into a large space where each dimension corresponds to one step in the sequence (what HOLOP does)

- IRTI and HOO can be combined with either MTL or STL to get 4 planning algorithms (originially TLS was coupled with IRTI and HOLOP with STL)
- IRTI x MTL = Regression-based Meta Tree Learning (RMTL), very similar to TLS
- STL x HOO = Hierarchical Optimistic Sequence Tree Learning (HOSTL), very similar to HOLOP
- IRTI x STL = Regression-based Sequence Tree Learning (RSTL), introduced in this work
- MTL x HOO = Hierarchical Optimistic Meta Tree Learning (HOMTL), also introduced in this work

- The novelty of using transposition tables here is due to the fact that the spaces considered here are continuous, not discrete as is the case with most transposition tables. Two methods are proposed to do this
**The Basic Research Questions for the Thesis are:**
- Can the IRTI component of TLS be replaced by the HOO component of HOLOP
- Can the HOO component of HOLOP be replaced by the IRTI component of TLS
- How can the information retreived from simulations be reused for TLS
- Can a transposition tree increase the performance of simulation based systems in continuous environments
- Which combination of techniques is best
- What is the memory cost of these algorithms
- What is the computational time complexity of the algorithms

**Ch 2: Preliminary Knowledge**

- Discusses use of Zobrish hashing for high-performance transition tables. I’ve never heard of it.

**Ch 3: Automatic Decomposition**

- IRTI checks all possible tests (splits) in a leaf node, and takes the one with the highest information gain. If that split yields two different leaves that are statistically significantly different (F-test), the split is made
- <p. 14 the rule used by HOO to calculate b-scores is very reminiscent of UCB as well>
- <p. 14 in an extended version of the HOO paper on Arxiv, a version of the algorithm is presented where n_0 doesn’t have to be recalculated at each step, if the total number of pulls is known before sampling starts. This can make the algorithm
*much* more efficient (log n per step instead of n)>
- Discussion of scaling of the exploration factor (commonly called
*C*)
- The propose scaling it according to the range of values seen from each node, then modifying it by a factor which is called
*k*
- <p. 16, in practice, I definitely buy that this helps, but in certain situations this rule will cause problems (when all rewards observed in a node are the same, there will be no bias, for example)>
- But they also mention that if vmin, vmax are known for the problem it can be used, so thats all fine. If you don’t know that you need to resort to something like above

- Discuss ways of conserving memory, such as not allowing leaves to split that have too few samples. Capping tree depth isn’t mentioned, but is also a reasonable option

**Ch 4: Multi-step Optimization in Continuous Environments**

- In normal MCTS, an edge represents 1 action, and a node represents 1 state. In Regression-based Meta Tree Learning, an edge represents a range of actions, and a node represents a range of states
- When selecting an action, UCT rules are used (basically just ignore that edges and nodes represent a range and use the samples)
- If a sample causes a split in a leaf in an internal leaf, the trees below that point must be discarded
- Replay can be used once trees are discarded, but this is somewhat expensive. It is cheaper to do sequence planning and pass values into the children on splits (sequence planning, though, has the drawback that it can’t form closed-loop plans)

- <p. 21 It isn’t the case that the given splitting rule for HOLOP/HOSTL that prioritizes early actions has to be used, but it is one method that intuitively makes sense, and also doesn’t ruin the original proofs of HOO>
- <In general, the way the material is presented respects the original algorithms this paper builds on too much. It basically talks about TLS and HOLOP (which are both unique in both aspects of how to do decomposition as well as plan over sequences) and then crosses them over. It would be easier for most readers not already familiar with the history of previous publications to present for example the Meta Tree Learning algorithms and then the Sequence Tree Learning algorithms, or something like that. It starts with the corners of a 2×2 table describing the attributes of the algorithms instead of starting with a row or column>

**Ch 5: Continuous Transposition Tree**

- Two methods for constructing transposition tables are introduced
- In Level-based Transposition Trees, there is a clear distinction between the discretizations over the state and action spaces
- <p. 24 “This process [of descending a decision tree over the state space] is expected to be less computationally expensive because it does not require any complex computations like the UCT formula.” Is there really any significant difference between the costs of both aside from constant level operations which are very cheap anyway?>
- One important feature of using the transposition table is that it allows planning to be stored – otherwise planning always has to be redone from every current state (of course it can also make planning cheaper further down the tree)
- In LTT a decision tree decomposes the state space. From each leaf in that tree over the state space, another decision tree is rooted that decomposes the action space
- Still suffers from the need to discard action trees on splits in the state-tree leaves, but its easy to do replay in this setting

- In a Mixed Transposition Tree (MTT), the tree is built both in terms of states and actions (as opposed to states and then actions, as in LTT); a node represents both spaces and an edge from parent to child represents a split in either the state or action dimension
- When MTTs are used, from root to leaf states are traversed according to the query state (naturally), while actions are followed according to those that has the highest value according to a particular computation, I think it is basically like UCB, which allows for exploration.
- The advantage MTT has over LTT is that trees do not have to be discarded and rebuilt
- <In general, this thesis has a lot of verbiage, but is short on diagrams or other items that would express points succinctly, sometimes the material is presented in a way that is a bit vague and the only way to resolve the issue is to go back to pseudocode and grok it>

**Ch 6: Experiments and Results**

- UCT here isn’t the normal version of UCT, somewhat between HOSTL and IRTI
- A couple of the test domains are regular 1-step optimization problems. Two more are multi step problems. One is navigating in a circle (agent chooses angle to move in), and the other is cart-pole
- For the 1-step optimization problems, comparisons are also made with a random agent, a vanilla MC agent (random but choses best found), and a version of UCT that uses unifom discretization
- In 1-step optimization, UCT learns a good policy quickly, but HOO eventually outperforms it with near optimal performance; the regret plots are most illustrative of performance. IRTI is worse than HOO in both cases (in one case better than UCT by the end of the experiment and in one case worse, but anyway as more time is given IRTI will beat UCT anyway)
- Good parameterizations are found for each problem separately.
- When constraining the algorithms on time instead of samples, HOO performs worst (due to polynomial time complexity) <as mentioned, an nlogn version can be implemented, though>. When time is the main constraint IRTI performs best, vanilla-MC actually outperfoms UCT
- In multistep experiments, all four algorithms (2×2 grid plus random and vanilla mc) are examined
- In donut world, when samples are constrained, HOSTL (HOLOP) performs best, IRTI x STL is 2nd best
- In stark contrast, in the cart-pole domain, HOSTL is worst besides random (even vanilla MC is better).
- Here, RMTL (original tree-learning search) performs best by a wide margin
- The domain used here is more challenging than the one I’ve used as it doesn’t allow for much deviation of the pole
- This is under time constraints

- <p. 35They test MTT by itself which I’m not exactly groking because I thought it was a technique for allowing tranposition tables to be used and not a policy, but I’m missing something>
- Have very nice experiments of sample distributions on 1-step optimization problems
- The HOO-family of algorithms have by far the worst memory use. The regression-tree algorithms use the least memory <I guess cause they are constantly throwing trees out? Even if they are doing replay, the trees are probably much more compact than the HOO versions because it requires a test confirming statistical significance to split>
- In terms of measured computation time, not surprisingly HOO is slowest, but what is interesting is that IRTI is faster than UCT <Impressive. It is because UCT ultimately ends up building a larger tree>

**Ch 7: Experiments and Results**

- It is actually unclear whether transposition tables in the manner used are helpful (sometimes they help and sometimes they do not)
- HOSTL (HOLOP) is best when the domain isn’t time constrained, but RMTL(TLS) is best when there are time constraints as it is very efficient because trees are built fairly compactly
- While HOO-based algorithms used the most memory, there were no issues of exhausting memory during experiments or anything like that