### An Empirical Evaluation of Interval Estimation for Markov Decision Processes; Strehl, Littman:

- MBIE was already shown to be PAC in bandit problems, here it is used to build confidence intervals (CIs) on the params of a model of an MDP. This is used to find an upper bound on the maximum estimated reward.
- It was already shown naive (just pull each arm a certain number of times and then exploit) and interval estimation approaches are PAC for bandit problems, here epsilon-greedy is also shown to be PAC.
- Rmax is analogous to the naive solution to bandit problems, interval estimation (IE) was used by Kaelbling in terms of Q-learning previously.
- MBIE is slightly different that it uses formal CIs on the distributions for each <s,a>. Like Rmax, it keeps estimates of T, R, T^, and R^, respectively. There is an exploration bonus like Rmax, but the bonus is based on how much better the outcomes could be as compared to the outcomes that were already experienced.
- MBIE looks for an optimistic estimated transition function, T~(s,a,•) that leads to the highest value policy, where T~(s,a,•) is inside the CI held for (s,a):
- Q(s,a)=R^(s,a) + max_{T~(s,a,•)} γ ∑_{s’} T~(s,a,s’) max_{a’} Q(s’,a’)
- Why isn’t a corresponding optimistic R~(s,a) also used? – Oh, nevermind, I suppose its because of the use of an sMax, which can be estimated to be transitioned to if the CI is too loose
- There is math on how to build the distribution T~(s,a,•)

- This particular Q function is a contraction mapping.
- Beats up on Rmax in the SixArms domain

### A Theoretical Analysis of Model-Based Interval Estimation; Strehl, Littman:

- MBIE is PAC – it achieves near optimal expected performance with high probability.
- Earlier proofs for PACness in RL assumed restarts were required, but it was later noted that this isn’t necessary. Instead the focus can be on whether or not after a polynomial number of steps if an algorithm can output a near optimal policy starting from that state.
- There is yet another metric that states the
*sample complexity of exploration*of an algorithm is the number of timesteps over the course of a run that the algorithm is not executing the epsilon optimal policy. - A performance metric introduced here is the
*average loss*, which is defined in terms of actual rewards recieved by an algorithm while it is running (as opposed to what it would get in the future).- Consider an algorithm that is run for T steps. Define the
*instananeous loss*as il(t) = V*(s_t) – ∑_{i=t to T} γ^{i-t} r_i , which is the difference between the optimal value function at s_t and the actual discounted return of the agent from time t until the end of the trial. - The
*average loss*is l = 1/T ∑_{i=t to T} il(t)

- Consider an algorithm that is run for T steps. Define the
- An algo that is PAC in terms of sample-complexity is also PAC in terms of aveage loss.
- In this paper, it says MBIE also uses the upper bound on the reward function, which didn’t seem to be the case in the first paper based on my reading.
- In terms of the true and estimated transition functions, with prb 1-δ, the L1 distance between T(s,a,•), and T^(s,a,•) will be at most:
- ε = sqrt[ (2 [ln 2^|S|-2-ln(δ)] ) /n(s,a) ] , where n(s,a) is the count of s,a samples
- This fact is used to build the transition confidence intervals

- A CI C is called consistent if the mean of the true distribution is within the CI, a setting of parameters can be calculated which will guarantee the CIs are consistent with some probablility.
- A guarantee on how much truncating a trajectory can impact the true infinite duration value is given.
- A slight improvement on the simulation lemma is given.
- The optimistic Q~ is always greater than the true Q.
- MBIE is proved PAC.

Good summary. One element missing is that the first paper gives a polynomial time algorithm for finding the highest scoring MDP within a given L1 distance of an MDP model.