Regularized Off-Policy TD-Learning. Liu, Mahadevan, Liu. NIPS 2012

At this point, I don’t have time to give this paper justice, so consider this pretty close to an abstract

  1. Another regularized TD paper
  2. This algorithm is called RO-TD, is off policy and converges
  3. On the other hand, “Although TD converges when samples are drawn ‘on-policy’ … it can be shown to divergent when samples are drawn ‘off-policy’.”
  4. Mentions LARS-TD and LCP-TD
  5. “In this paper, the off-policy TD learning problem is formulated from the stochastic optimization perspective.  A novel objective function is proposed based on the linear equation formulation of the TDC  algorithm [used in LS-TD].  The optimization problem underlying off-policy TD methods, such as TDC is reformulated as a convex-concave saddle-point stochastic approximation problem, which is both convex and incrementally solvable.”
  6. <Not sure if this is simpler or more complex than the other L1 regularizations of LSTD>
  7. Gives an example (well known) domain where LSTD and LARS-TD diverges but RO-TD and TDC converge
  8. Seems like TDC actually outperforms RO-TD, but then in the next example, TDC fails and RO-TD and LARS perform properly

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: