- Trying to do robotics control directly from high-dimensional sensor readings. The trick is that with robots experience is costly so you need to use data very efficiently <the deepmind guys were able to run each game for the equivalent of over a month because its just in simulation>
- Review 2 methods for extracting features:
**deep auto-encoders**and**slow feature analysis** - Mention deep learning’s deep Q network
- Until DQN, using FAs has been largely unsuccessful, and has generally worked only on toy domains <although Tesauro’s TD Gammon is a significant exception>
- “In the last years several researchers have also considered end-to-end learning of behavioral policies p, represented by general function approximators. First attempts towards this include an actor-critic formulation termed NFQ-CA [14], in which both the policy p and the Q-function are represented using neural networks and learning proceeds by back-propagating the error from the Q-network directly into the policy network—which was, however, only applied to low dimensional control problems.”
- Also cites deterministic policy gradient https://aresearch.wordpress.com/2014/06/09/deterministic-policy-gradient-algorithms-silver-lever-heess-degris-wierstra-riedmiller-icml-2014/, which worked on a 30-DOF arm
- Cites deep fitted Q (same lab, papers from a couple of years prior that also was used on slot cars and an inverted pendulum from raw camera images
- The first part of deep fitted Q is unsupervised, setup as an autoencoder
- In order for the RL to work, it is important that the problem has low “intrinsic dimensionality” (the manifold it lies on)
- For example, the image a camera gets from a robotic inverted pendulum can be arbitrarily large, but the underlying dimension of the problem is just 2

- They do not assume a pomdp; assume that the high-dimensional input completely describes the problem <although this is almost certainly not actually the case with at least the slot car, but that is not a ding on their work it just makes it more impressive>
- The representation chosen for the state depends on an optimization problem
- Train an autoencoder to minimize MSE
- DFQ is a clustering-based approach<?>
- There are some problems with DFQ that they will discuss later, but:
- Representation learned by DFQ isn’t reward-based, so representation doesn’t care about that
- “learning auto-encoders for inputs with high variability (i.e. many objects of relevance) can be hard” Although says regularization can help this, and cites a number of other papers that deal with this

- “Despite these drawbacks DFQ also comes with advantages over end-to-end learning: since the auto-encoder merely learns to reconstruct sampled observations and it can be fed with samples generated by any sampling policy for any task, is thus less susceptible to non-stationarity of the training data.”
- Mentions things like proto-value functions, laplacian eigenmaps etc for discrete RL, relationship between SFA and LEM and PVF
- “From a theoretical point of view, SFA and PVF both approximate subspace-invariant features. This classification has its origin in the the analysis of approximation errors in linear RL [39]. Here subspace-invariant features induce no errors when the future reward is propagated back in time. It can be shown that under these conditions the least-squares temporal difference algorithm (LSTD, [9]) is equivalent to supervised least-squares regression of the true value function [5]. However, this is only possible for the class of RL-tasks with a self-adjoint transition model. As this class is very rare, both SFA and PVF substitute a self-adjoint approximation of the transition model to compute almost subspace-invariant representations.4 An analysis of the optimal solution5 shows that SFA approximates eigenfunctions of the symmetrized transition operator [50]. Moreover, with a Gaussian prior for the reward, one can show that SFA representations minimize a bound on the expected LSTD error of all tasks in the same environment [5]. However, as the solution depends on the sampling distribution, straight forward application for transfer learning is less obvious than in the case of PVF. Future works may rectify this with some sensible importance sampling, though.”
- When the data completely covers the space, PVF and SFA are pretty equivalent, but when data is generated from a random walk, PVF fails and SFA still manages to do almost as well as when there is perfect coverage
- In less artificial settings/domains there is less of a performance gap between PFV and SFA but it still exists, SFA seems to do a better job generating features for LSTD and LSPI
- In these problems, the representation learned must:
- Maintain Markov property (no partial observability)
- Be able to represent the value function well enough to bootstrap from it and improve the policy
- Generalize well (especially to cases that may not be well covered by the training distribution)
- Low dimensional

- “In stochastic environments one can only compare probability distributions over future states based on a random policy, called diffusion distances. It can be shown that SFA approximates eigenfunctions of the symmetrized transition operator, which encode diffusion distances [5]. SFA features are therefore a good representation for nonlinear RL algorithms as well.”
- <Depends how you define “stochastic environments” I assume they mean environments where there is no choice in action selection at all because otherwise what is written is incorrect>

- “In summary, SFA representations X seem in principle the better choice for both linear and non-linear RL: nonlinear SFA extracts eigenfunctions of the transition model Pp, which are the same for every isomorphic observation space Z, encode a diffusion metric that generalizes to states with similar futures and approximates a Fourier basis of the (unknown) underlying state space.”
- Empirical work shows SFA works best when exposed to data that gets uniform coverage
- “a Fourier basis as approximated by SFA grows exponential in the underlying state dimensionality. Linear algorithms, which depend on this basis to approximate the value function, are therefore restricted to low dimensional problems with few or no variables unrelated to the task. Non-linear RL algorithms, on the other hand, could work in principle well with only the first few SFA features of each state-dimension/variable. e. The order in which these variables are encoded as SFA features, however, depends on the slowness of that variable. This can in practice lead to absurd effects. Take our example of a wheeled robot, living in a naturally lit room. The underlying state space that the robot can control is three-dimensional, but the image will also depend on illumination, that is, the position of the sun.”
- Describes this as being vulnerable to
*slow distractors*

- Describes this as being vulnerable to
- “Also, auto-encoders minimize the squared error over all input dimensions of Z equally. This can produce incomplete representations if a robot, for example, combines observations from a camera with measurements from multiple joints. Due to the large number of pixels, small improvements in the reconstruction of the image can outweigh large improvements in the reconstruction of the joint positions.”
- Compares results from a deep auto encoder and one with a NN that has objectives of slowness and predictability. The latter produces a much better output <but this is really a garbage comparison because the prior is based on actual physical experiments and the latter is based on an extremely oversimplified clean simulation.>
- “Take the example of uncontrollable distractors like blinking lights or activity outside a window. Each distractor is an independent variable of the isomorphic state S, and to learn an isomorphic representation X requires thus samples from all possible combinations of these variables. The required training samples grow exponentially in the number of distractors.”
- <The XSENS suit has tons of blinking lights on it. I was thinking about this as a potential distractor for the methods we are working on…>

- Moves on to a discussion of “factored representations and symbolic RL” at the end of the paper
- Basically discuss object oriented RL (discuss it in terms of relational RL)