Black Box Policy Search in the Inverse Pendulum


While working on the action sequence search vs. policy parameter search, I noticed a few things:

  1. Action sequence planning with replanning seemed the most robust
  2. Action sequence planning without replanning seemed to be about as bad as random
  3. In the policy parameter search, using the parameter setting that resulted in the single best run (I call this greedy) seemed to work better than asking the HOO algorithm to find the best parameterization (I call this Hoo-greedy, by greedily following the means down the tree)

This left me with a number of questions, one of the bigger ones being what the impact of noise is on the greedy vs. hoo-greedy policies.  I ran all four methods (action sequence and policy parameter search with/out replanning) in the inverted pendulum domain with varying amounts of noise.  The full action range in the domain is from -50 to 50 units, and in the experiments I ran the actions were corrupted by noise uniformly distributed from 0 to +/- 4 units of noise.  The results are below:

Action sequence search with replanning

Action sequence search without replanning

Policy parameter search with replanning

Policy parameter search without replanning

So from here the new basic takeaways are in the inverted pendulum domain:

  1. Action sequence with replanning works close to optimal with the greedy or hoo-greedy results regardless of noise
  2. Parameter search with replanning works close to random performance either way.  This is good because in the double-integrator domain the results between parameter search and action sequence search with replanning were on top of each other.
  3. Parameter search without replanning only seems to work when the greedy parameter settings are chosen, but that only works in domains with no noise at all.  Even very small amounts of noise seem to cause trouble in this domain

Aside from that, I’m not sure to say.  Unfortunately, I can’t really think of anything concrete to pull from the experiments here and in the double integrator so I think I’ll keep running experiments and see what other things seem to come up.


Update: a similar graph rolled in 1 for the double integrator domain.  Its a bit of a mess, but action sequence search with replanning use the Hoo-suggested action still comes out on top.  There are some gnarly error bars because I only ran a few trials of each as they take a bit of time:

Performance of all the approaches in the double integrator


Another update, here are results of a similar experiment in the Ball-Beam domain:

Action sequence search with replanning

Action sequence search without replanning

Policy parameter search with replanning

Policy parameter search without replanning


Another update, here are results of a similar experiment in the Bicycle domain:

Action sequence search with replanning

Action sequence search with replanning

Action sequence search without replanning

Action sequence search without replanning

Policy parameter search with replanning

Policy parameter search with replanning

Policy parameter search without replanning

Policy parameter search without replanning

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: