## Ideas from: MBIE (model-based interval estimation)

### An Empirical Evaluation of Interval Estimation for Markov Decision Processes; Strehl, Littman:

1. MBIE was already shown to be PAC in bandit problems, here it is used to build confidence intervals (CIs) on the params of a model of an MDP. This is used to find an upper bound on the maximum estimated reward.
2. It was already shown naive (just pull each arm a certain number of times and then exploit) and interval estimation approaches are PAC for bandit problems, here epsilon-greedy is also shown to be PAC.
3. Rmax is analogous to the naive solution to bandit problems, interval estimation (IE) was used by Kaelbling in terms of Q-learning previously.
4. MBIE is slightly different that it uses formal CIs on the distributions for each <s,a>. Like Rmax, it keeps estimates of T, R, T^, and R^, respectively. There is an exploration bonus like Rmax, but the bonus is based on how much better the outcomes could be as compared to the outcomes that were already experienced.
5. MBIE looks for an optimistic estimated transition function, T~(s,a,•) that leads to the highest value policy, where T~(s,a,•) is inside the CI held for (s,a):
1. Q(s,a)=R^(s,a) + max_{T~(s,a,•)} γ ∑_{s’} T~(s,a,s’) max_{a’} Q(s’,a’)
2. Why isn’t a corresponding optimistic R~(s,a) also used? – Oh, nevermind, I suppose its because of the use of an sMax, which can be estimated to be transitioned to if the CI is too loose
3. There is math on how to build the distribution T~(s,a,•)
6. This particular Q function is a contraction mapping.
7. Beats up on Rmax in the SixArms domain

### A Theoretical Analysis of Model-Based Interval Estimation; Strehl, Littman:

1. MBIE is PAC – it achieves near optimal expected performance with high probability.
2. Earlier proofs for PACness in RL assumed restarts were required, but it was later noted that this isn’t necessary. Instead the focus can be on whether or not after a polynomial number of steps if an algorithm can output a near optimal policy starting from that state.
3. There is yet another metric that states the sample complexity of exploration of an algorithm is the number of timesteps over the course of a run that the algorithm is not executing the epsilon optimal policy.
4. A performance metric introduced here is the average loss, which is defined in terms of actual rewards recieved by an algorithm while it is running (as opposed to what it would get in the future).
1. Consider an algorithm that is run for T steps. Define the instananeous loss as il(t) = V*(s_t) – ∑_{i=t to T} γ^{i-t} r_i , which is the difference between the optimal value function at s_t and the actual discounted return of the agent from time t until the end of the trial.
2. The average loss is l = 1/T ∑_{i=t to T} il(t)
5. An algo that is PAC in terms of sample-complexity is also PAC in terms of aveage loss.
6. In this paper, it says MBIE also uses the upper bound on the reward function, which didn’t seem to be the case in the first paper based on my reading.
7. In terms of the true and estimated transition functions, with prb 1-δ, the L1 distance between T(s,a,•), and T^(s,a,•) will be at most:
1. ε = sqrt[ (2 [ln 2^|S|-2-ln(δ)] ) /n(s,a) ] , where n(s,a) is the count of s,a samples
2. This fact is used to build the transition confidence intervals
8. A CI C is called consistent if the mean of the true distribution is within the CI, a setting of parameters can be calculated which will guarantee the CIs are consistent with some probablility.
9. A guarantee on how much truncating a trajectory can impact the true infinite duration value is given.
10. A slight improvement on the simulation lemma is given.
11. The optimistic Q~ is always greater than the true Q.
12. MBIE is proved PAC.

## One thought on “Ideas from: MBIE (model-based interval estimation)”

1. Michael Littman says:

Good summary. One element missing is that the first paper gives a polynomial time algorithm for finding the highest scoring MDP within a given L1 distance of an MDP model.