Ideas from Rollout sampling approximate policy iteration

Rollout sampling approximate policy iteration: Christos Dimitrakakis, Michail Lagoudakis (Machine Learning, 2008).

  1. Builds on work of Lagoudakis and Parr (2003), which suggests an approximate policy iteration algorithm represented as a classifier, which bypasses the need for any representation of the value function.
    1. Algorithm works by producing a new classifier at each iteration which uses data which is generated during rollouts from a generative model
    2. Called Rollout Classification Policy Iteration (RPCI).
    3. Algorithm is in the class of approximate policy iteration (API) algorithms, which attempts to do policy iteration in the case where the size of the MDP is very large.
    4. Guarantees of normal policy iteration do not exist with API
  2. This approach requires a large number of samples, and this paper attempts to reduce the number of samples/computation required
  3. Paper claims to:
    1. Suggest an improved statistical test for identifying early and with high confidence states with dominating actions
    2. Obtain up to an order of magnitude improvement over previous algorithms in terms of sample complexity
  4. In RCPI states are examples and actions are class labels, a policy maps states to actions.
    1. Rollouts are done from a state to split the actions between the optimal and suboptimal actions
    2. An action is said to be best based on pairwise t-test
    3. Once this is done the policy is set and then policy iteration is done again in this manner
    4. Algorithm terminates when performance of the new policy is equivalent to the old policy
    5. Since this is an approximation, there is no guarantee of converging to an optimal policy
  5. Savings in algorithm in this paper over RCPI is that RCPI does the same number of rollouts from each state
  6. Views set of rollout states as a bandit problem where each state is an arm.  A sample of that state is a rollout that is conducted from that arm
  7. Goodness of training data should satisfy:
    1. States are sampled only as needed to produce training data w/o wasting rollouts
    2. With high probability the discovered action labels indicate correct labels
    3. Training data covers the state space enough to provide a good representation of the policy
  8. Algorithm maintains a pool of states, S_R, from which sampling is performed, here it is assumed the states in S_R are sampled from the full space uniformly.
  9. In order to have good sample efficiency, the goal is to identify which state to sample from at any time step, as well as a good stopping to stop sampling that state, as well as when to terminate the algorithm altogether
    1. There are a few different (3) metrics for choosing which state to sample from next
  10. Once a state is considered  “good” or known, it is removed from the state pool S_R
  11. <sorta skipping the proofs right now>
  12. Proof basically says its possible to bound how likely the classification on actions is likely to be correct by the point it is removed from the pool.
Tagged , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: