In my qual, I discussed an approach to RL that was focused on developing a system which could deployed successfully on robotic platforms. The system presented there seemed to be quite effective and efficient in the experiments I ran, but one piece didn’t work.
The algorithm which I called “TCP” for Tree-based cached planning is designed to allow the algorithm to react to queries for policy very quickly even if planning takes much longer. The idea is to decompose the state space by using a tree, and to then store a policy recommendation for an action to take in the region represented by a leaf. The method which I first tried (splitting in the same manner used by MRE/KD-Trees) was successful when a generative model was already available, and also worked reasonably with epslion-greedy exploration. When using MRE, the results tended to be unstable, varying dramatically between pretty good and terrible behavior.
After thinking about the issue for a while, I came up with a reason that may describe why the previous method was not successful. We use a method of splitting in KD-Trees because we believe we have enough samples in a region that we can partition that region into a smaller volume and still have enough samples to inform an accurate decision about the function being approximated.
Using this method of splitting to represent the policy makes less sense because the policy is computed by performing rollouts using the estimated transition and reward functions (along with Smaxes thrown in). Therefore, we want to split a cell representing the policy only when we have enough information to make an accurate decision as to what the policy looks like. That is, we need precision for a string of T(s,a),R(s,a), as opposed to just one.
Based on this idea, the algorithm is changed slightly. The policy tree is informed of the percent of times Smax was encountered during planning (doing rollouts) from a particular state. If that percent drops below a predetermined value, then we decide that we know a part of the region well enough to partition it up to make more refined decisions, and we make a cut in that region.
Here’s a video of the algorithm running in the Double Integrator domain. Here the threshold for cutting was set to be below 5%, so that if visits to Smax was less than 5% of all the state visits during planning, a cut is made.
Also, a graph of the performance:
The performance here is not terribly far from the performance when not using cached planning. With TCP the average cumulative reward seems to converge to something around -4.0. If we replan at every step with the same number of trajectories per planning step (100) the performance is around -2.8, so there is a performance hit, as we would expect, but I think it is within reason. Although it is not directly a fair comparison to make, if we replan at every step, the performance converges after about 10 trials, whereas here it seems to happen after about 75 trials.
I have to test this out in larger domains, but if it holds up ok (not a trivial thing to hope for) we can move on to trying it on a robot.