Rollout sampling approximate policy iteration: Christos Dimitrakakis, Michail Lagoudakis (Machine Learning, 2008).

- Builds on work of Lagoudakis and Parr (2003), which suggests an approximate policy iteration algorithm represented as a classifier, which bypasses the need for any representation of the value function.
- Algorithm works by producing a new classifier at each iteration which uses data which is generated during rollouts from a generative model
- Called Rollout Classification Policy Iteration (RPCI).
- Algorithm is in the class of approximate policy iteration (API) algorithms, which attempts to do policy iteration in the case where the size of the MDP is very large.
- Guarantees of normal policy iteration do not exist with API

- This approach requires a large number of samples, and this paper attempts to reduce the number of samples/computation required
- Paper claims to:
- Suggest an improved statistical test for identifying early and with high confidence states with dominating actions
- Obtain up to an order of magnitude improvement over previous algorithms in terms of sample complexity

- In RCPI states are examples and actions are class labels, a policy maps states to actions.
- Rollouts are done from a state to split the actions between the optimal and suboptimal actions
- An action is said to be best based on pairwise t-test
- Once this is done the policy is set and then policy iteration is done again in this manner
- Algorithm terminates when performance of the new policy is equivalent to the old policy
- Since this is an approximation, there is no guarantee of converging to an optimal policy

- Savings in algorithm in this paper over RCPI is that RCPI does the same number of rollouts from each state
- Views set of rollout states as a bandit problem where each
*state*is an arm. A sample of that state is a rollout that is conducted from that arm - Goodness of training data should satisfy:
- States are sampled only as needed to produce training data w/o wasting rollouts
- With high probability the discovered action labels indicate correct labels
- Training data covers the state space enough to provide a good representation of the policy

- Algorithm maintains a pool of states,
*S_R*, from which sampling is performed, here it is assumed the states in*S_R*are sampled from the full space uniformly. - In order to have good sample efficiency, the goal is to identify which state to sample from at any time step, as well as a good stopping to stop sampling that state, as well as when to terminate the algorithm altogether
- There are a few different (3) metrics for choosing which state to sample from next

- Once a state is considered “good” or known, it is removed from the state pool
*S_R* *<sorta skipping the proofs right now>*- Proof basically says its possible to bound how likely the classification on actions is likely to be correct by the point it is removed from the pool.