Ideas from Kernel-Based Reinforcement Learning: Dirk Ormoneit, Saunak Sen

Approximate VI converges to a finite fixed point

There are some assumptions: Lipshitz continuity, finite covariance of T(y|x, a), iid in transitions, kernel is Lipshitz “mother kernel” (don’t know what that is, but my guess is almost anything reasonable will satisfy it)

Classic bias/variance tradeoff that comes along with a varying bandwidth (high width = low variance, high bias)

Requirement that bandwidth drops to zero over time, but not quickly enough to cause a large increase in variance during the course of the decreasing bandwidth (so bandwidth size should be at least partly a function of the amount of data)

If a reasonable shrinking rate is chosen, the estimated value function converges to optimal

A formula for optimal shrinkage rate is given, which is exponential in the dimension, not surprisingly

Using the max operator over a random estimate of the Q function in the bellman equation leads to a biased estimator (an observed maximum state action value may come from a suboptimal action)

Unless priors are used, the curse of dimensionality can’t be broken

the “mother kernel” thing is interesting—several of us have tried find out what it could mean with no success. very odd that they would use obscure terminology and not define it!

I am the second author of the paper, and it’s been over a decade since we worked on it, so pardon the rust. The mother kernel is a Lipschitz continuous function from [0,1] -> R+ that integrates to 1. See page 172.

the “mother kernel” thing is interesting—several of us have tried find out what it could mean with no success. very odd that they would use obscure terminology and not define it!

I am the second author of the paper, and it’s been over a decade since we worked on it, so pardon the rust. The mother kernel is a Lipschitz continuous function from [0,1] -> R+ that integrates to 1. See page 172.

I am sorry.. Page 172 in what? Any book?

Not sure – I found versions starting at page 1 or page 757 in Intelligent Computing

Page 172 from the Ormoneit and Sen paper (in the appendix).