So now we have at least 2 good bandit algorithms that function over continuous spaces. Ali’s recommendation was to leverage one of these algorithms and build a UCT-like algorithm where we do tree-based rollouts but instead of using a UCB-like method for choosing actions, we use a HOO-like(or geometric) method for choosing actions. In this framework, each state has a bandit problem, and the value of an action taken at that state is the “reward” the bandit algorithm receives.
I had already implemented UCT once before, and had presented the paper twice, so I did one quick re-read and then got to coding. UCT is very simple, and to sum it up, it places a UCB algorithm at each <state, depth>, the depth is necessary if assuming a finite-length rollout. At each step in the simulated rollout, UCT asks the UCB at that <state, action> node what action to take, follows that action and keeps the rollout going. The observed discounted estimated value of that action at that <state,depth> is then rolled back into that bandit algorithm. In this experiment I did the same thing except I stuck HOO into each of the <state, depth> nodes and used that to choose an action. Since in this setting we are assuming discrete states, we can reuse the HOO estimators. There are number of parameters here: the ρ and v_1 of HOO, as well as the number of rollouts to do from the start state, and maximum depth of the rollouts.
I implemented a domain where this algorithm could be used – I think this was actually Ali’s idea as well. Basically it’s a kings-move gridworld with a discrete state set and an action value that can range from 0-360 (degrees).
The way it works is that an angle “points to” at most 2 possible squares (for example, action=20 points immediately to the right, or right+up, action=45 points only right+up). Based on how close the action is to one sqaure vs the other, the action is more likely to end up in the square it is pointing more toward. For example, 22.5 is equally likely to go right or right+up, but action 5 will go right 89% of the time, and right+up 11% of the time, it will go nowhere else.
Any actions that move only up/down/left/right incur a penalty of -1.0 and any diagonal actions incur a penalty of -1.5, so optimally diagonal moves should just be used when the goal is diagonal from the agent, but should still be used in those cases. At the goal a reward of +1.0 is granted.
Based on all this, I ran an experiment where I used the same HOO variable settings as before (ρ=0.5, v_1=100) in a continuous-action gridworld setting which was 15×15 with the goal location at (12, 12), and start location at (0,0). I allowed only one rollout from each state with a maximum depth of 50, which I did to better demonstrate the rate of learning (since allowing more rollouts is analogous to allowing more experience per action). The result is in the image below:
Since it is simple to compute the optimal accumulated reward (take action 45, 12 times), it is also easy to compute the optimal accumulated reward, which comes out to 1 -1.5*12 = -17. The best observed accumulated reward was at trajectory 96 with an accumulated reward of -20. I found this pretty impressive since there are 15*15*50=11,250 bandit estimators.
I think going forward, we should try and think about what the theoretical bounds of HOO means in terms of theoretical bounds for leveraging HOO by this method. Compared to other continuous-action stuff I’ve seen, this seems much much simpler by far, and may come out with very good bounds. Other cool stuff is that HOO is already set up to work in multidimensional settings, so we could do more than just 1D actions (for example we could do angle and distance, or something). I still don’t see any way to use this in the continuous state setting as well, unless you throw out all the data generated from a rollout as soon as its used, which means its not really learning at all, just planning.
The last relevant issue I can think of right now is whether it’s possible to utilize the geometric approach here. It seems to beat up on HOO in past experiments, but in those experiments, the Lipshitz constant was given, whereas here we don’t know what to use. In deterministic settings it is probably easy enough to learn the Lipshitz constant, but in the stochastic setting it will probably be much more difficult (may not be practical at all). Maybe we can just arbitrarily pick a number and go with it, like is often done with parameter settings in algorithms.