related to https://aresearch.wordpress.com/2015/04/18/learning-task-specific-state-representations-by-maximizing-slowness-and-predictability-jonschkowski-brock-international-workshop-on-evolutionary-and-reinforcement-learning-for-autonomous-robot-s/
- Uses the fact that robots interact with the physical world to set constraints on how state representations are learnt
- Test on simulated slot car and simulated navigation task, with distractors
- How to extract a low dimensional representation relevant to the task being undertaken from high dimensional sensor data?
- The visual input in the experiments is 300-D
- From the perspective of RL
- “According to Bengio et al. [1], the key to successful representation learning is the incorporation of “many general priors about the world around us.” They proposed a list of generic priors for artificial intelligence and argue that refining this list and incorporating it into a method for representation learning will bring us closer to artificial intelligence.”
- “State representation learning is an instance of representation learning for interactive problems with the goal to find a mapping from observations to states that allows choosing the right actions. Note that this problem is more difficult than the standard dimensionality reduction problem, addressed by multi-dimensional scaling [14] and other methods [23, 29, 6] because they require knowledge of distances or neighborhood relationships between data samples in state space. The robot, on the other hand, does not know about semantic similarity of sensory input beforehand. In order to know which observations correspond to similar situations with respect to the task, it has to solve the reinforcement learning problem (see Section III), which it cannot solve without a suitable state representation.”
- State representation learning can be done by:
- Deep autoencoders
- SFA (and its similarity to proto-value functions)
- Predictability / Predictive actions. Points to a Michael Bowling paper <I haven’t read – will check out>
- Bunch of nice references
- The Robotic priors they care about (these are all defined mathematically later):
- Simplicity: For any task, only a small number of properties matter
- Temporal coherence: important properties change gradually over time
- Proportionality: Amount of change in important properties is proportional to action magnitude
- Causality: important properties and actions determine the reward
- Repeatability: Same as causality but in terms of transition not reward
- These properties hold for robotics and physical systems but aren’t necessarily appropriate to all domains
- Even in robotics these sometimes don’t hold (for example, a robot running into a wall will have an abrupt change in velocity; once pushed against the wall proportionality doesn’t hold because no amount of pushing will allow the robot to move
- They set learning up as an optimization problem that includes loss terms for each of the priors above (just a linear combination)
- <these priors aren’t directly applicable to the mocap work I am doing now because they involve actions which we dont have access to>
- Formally they would need to compare all samples, leading to n^2 loss, but they restrict this to a window
- Linear mapping from observations to states
- Epsilon greedy exploration, although there is a bias to repeat the previous action
- Have distractors in their simulations
- In the navigation task they demonstrate an invariance to perspective by using either an overhead or first-person view from the robot
- Representations learned are highly similar
- Learned in the course of 5,000 observations
- The observations in the experiment are 300 pixels or 100 (not 300×300)
- For the simulated slot car task, the state sample matrix has rank 4. Two large eigenvalues correspond to the position of the controlled car, and two smaller eigenvalues correspond to the position of the distractor car
- Ideally the distractor shouldn’t show up at all, but because of things like stochasticity and limited samples, weight can be placed on it to explain events that it is not related to
- They then to an RL comparison based on a number of different methods for learning the representation (5 features extracted by)
- Their approach
- SFA
- PCA
- Raw 300D representation
- (They also compare to ground truth representation)
- Use neural fitted Q

- SFA features are really terrible, their approach is pretty similar to operating from the ground truth
- With further investigation on the same results they demonstrate that their method has very good generalization properties
- Conjecture that the primary difference between their approach and the other dimension reduction methods are that they didn’t learn to disregard the distractors