Our last email ended with Michael Saying:
Have you (or could you) fill in a 2×2 table that looks something like this?
1a. Behavior = parameterized policy, planning performed = compile time
1b. Behavior = parameterized policy, planning performed = run time (sequential)
2a. Behavior = action sequence, planning performed = compile time
2b. Behavior = action sequence, planning performed = run time (sequential)
I’d call 1a “classical” policy search and 2b HOLOP. I’d compare them in terms of computation performed beforehand and computation performed during the run and effectiveness (in terms of reward). I’d hypothesize that the as beat the bs in terms of computational demands during the run, but the bs beat the as in terms of computational demands before the run. I’d also expect to see a smaller difference between the effectiveness of 1b and 2b than of 1a and 2a. That is, given that you are going to do run-time decision making, the less expensive action sequence representation is sufficient for realistic problems. Ideally, the run-time resources needed for 2b are not substantially greater than those of 1a, the compile time resources are much better, and the effectiveness is similar.
I just finished up implementing and running these different approaches. Here’s a picture:
So more of an explanation. The x-axis represents the number of double integrators (DIs) simultaneously controlled, the y-axis represents the avg. cumulative reward of each agent. Episodes were 200 steps long, and the planners were given access to 100 roll-outs from a generative model for each step in the episode. 1a and 2a took all 200*100 rollouts upfront and attempted to find an optimal policy, while 1b, 2b replanned from scratch and used 100 rollouts per step.
All algorithms used HOO to search the parameter space. In 1a and 1b, HOO was used to find weights of a neural net encoding the policy of an agent. The structure of the net was to have |S|+1 inputs, 2|S| hidden nodes, and |A| output nodes. Accordingly, the size of the neural network grows polynomially with the number of DIs controlled. 1a and 1b yielded very poor (close to random) performance when the final weights were selected by HOO according to a greedy search down the trees by the mean. In the graph the performance shown is the result of simply taking the weights from whichever run yielded the maximum reward, which is safe to do in a deterministic domain but not in a stochastic one (which this was).
For 2a, 2b an action sequence was found by HOO/HOLOP and then the “best” policy was selected by greedily following nodes down the HOO tree, which yielded better results in that case.
The most interesting result I find is that classical policy search (1a) starts out with the best performance but quickly degrades to random performance as the number of parameters describing the policy grows. 1b and 2b, which do replanning are much more tolerant of the increasingly difficulty of the task, probably because of the fact that they aren’t trying to find a good representation of the policy globably, and can recover from a few poor policy searches by later ones later on. 2a, which is HOLOP without planning is quite bad, but had very low variance in returns (not graphed here), with close to constant performance, and at least managed to stay above random performance.
I know this isn’t the graph you asked for, as you wanted to compare them between computation performance before and during the run, but I’m not sure how to really show that. With the a’s, very close to 100% of the computation is done before and very close to 0% is done during the run, and vice-versa with the b’s, so I don’t think its interesting in really graphing?