Big-ol’-Bandit and RMax in Double Integrator

Since Big-ol’-Bandit (BoB) was implemented, and seemed to work nicely, there were a few more issues to take care of: learning and performance.

BoB is only a planning algorithm, so in order to use it in domains where the dynamics are unknown, some method of exploration and model building must be used.  I decided on using Multi-Resulution Exploration (MRE) in order to drive exploration, and used the same tree MRE built for deciding knownness in order to build an estimate of the dynamics.

In terms of the dynamics estimation, the method seemed to work best when estimating deltas in states, and using linear regression in the leaves (as opposed to just averaging).

If the ultimate goal is for use in real time robotics control, major work had to be done to speed up planning.  I did this by parallelizing the planning and then aggregating the results to make a decision.  How this works is there is a server that farms out works to clients.  What the server tells the client is what state the domain is in now, and what, if any updates need to be made to the client’s model (if new information came in since last time the client reported).  The client then takes that information, and does a series of BoB rollouts and reports back with the actions taken and the rewards.  The server then tells the client either to do this again, or to update its model and plan from a new state.  A new state is encountered once enough clients have reported back the server takes that data, builds itself a HOO tree from the data the clients returned, and then makes a greedy decision.

The way it is implemented it is completely distributed and very fault tolerant.  Clients can enter or leave planning at any time and no synchronization is required (the downside is sometimes clients do planning on stale data, but not a very big deal).

When I run this system on my mini-cluster of donated computers, I am able to get 50,000 steps of planning (1000 rollouts of depth 50) in well under a second, when the model can be queried directly (pure planning), or in about 2 seconds when using the constructed model (learning).  Although I was previously skeptical about getting BoB running for real-time control, I think this slightly different version of the algorithm may be capable of doing it.

In order to test everything out I compared the performance of this method to RMax in the double integrator domain, where the action was perturbed by a small amount of noise to introduce stochasticity.

MRE as written in the paper seemed to over-explore in this domain, so I simply set the desired knowness to be at least 3 cuts in each (state and action) dimension, which is analogous to saying that once there are enough samples to cut each dimension into 8 pieces, the estimates are reasonably accurate.

For all versions of RMax, M=5, because using a value less than that often lead to premature convergence with a very poor model, so Rmax as displayed is pretty nicely tuned to converge as quickly as possible with as high performance as the discretized model can accommodate.   In both BoB and Rmax planning is done out to 50 steps, the results are averages of 5 runs of each algorithm.

Rmax with 5 chops (per state and action dimension) quickly converges, but it is to a result that is quite poor.  BoB/MRE takes about another 10 trials to more or less converge to its behavior, but the result is very close to (LQR) optimal (which is actually not optimal with respect to the domain as defined as LQR has no notion of gamma or discrete time).  At any rate using a larger number of chops makes RMax take much longer to converge to good behavior (didn’t happen for most runs by the time 30 trials were experienced), and from what I saw the performance of RMax for 8 or 10 chops when it did converge was pretty much the same as BoB/MRE.

I should also point out that the limiting factor in the performance for BoB was not the accuracy of the model but the amount of planning that was done.  This can be determined because the performance of BoB with the same amount of planning when the actual domain dynamics were used was essentially identical to the results here.

I’m going to try out other simulated domains soon, I’m excited to try this out on a robot when ready.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: