Structure in the Space of Value Functions. Foster, Dayan. Machine Learning 2002.

Consider case where dynamics are static but rewards may change – need at least value functions for a few instantiations of the rewards to use it
Standard hierarchical RL does temporal abstraction.
1. The approach here instead performs structural abstraction, which is state aggregation
A common method is to group states together, which is often lossy (Michael has a paper on when this isn’t lossy)
Here they propose augmenting state information with structural abstraction, as opposed to throwing away the original state and only using the aggregated state
1. This gets rid of problems of sub-optimal solutions or partial observability
Generally structural abstraction is done top-down, but here a bottom-up approach is proposed
1. Do unsupervised learning on collections of optimal value functions <how is that unsupervised then?>
“…value functions, particularly for a few different goals, contain substantial information about the functional texture of the underlying MDP, amounting, in many cases, to an appropriate structural decomposition of the problem.”
When doing unsupervised learning on the structure, some parameters to the classifier are goal-independent and some are goal-dependent
<Not really focusing on how they do their unsupervised learning, as I think it can be treated mostly as a black-box for conceptual purposes; based on model fitting via EM of Gaussian mixture model>
<Their boundaries are surprisingly sharp given that they are using mixtures of Gaussians; not sure how they got that to work. There are only a couple of cells that are incorrectly labelled>
When increasing resolution (same world topology with a finer grid) basically the same results fall out
RL approach used here is actor-critic
For critic, the state and chunk/fragmentation label are used (they get different learning rates)
They find a fragmentation from just 2 states set as goal, and then evaluate performance on domains where any other state can be the goal (so the train and test domains are separate)
The fragmentation found helps. Outperforms no fragmentation as well as a random fragmentation
So this example had just 1 level of fragmentation; its possible to have multiple levels of fragmentation: “A natural generalisation is to a hierarchical fragmentation in which states are free to belong simultaneously to many fragments at different levels, reflecting, for example, thier position in rooms, in particular parts of rooms, and the like.”
Here they consider a 2-level hierarchy
“The solution to the problem of fitting a hierarchy of c0ntributing experts is under-constrained and so, … we first train at the higher level and then train lower levels on the residual error from the higher level. The hierarchical model incorporates prior information that the flat model did not: that there should be a certain number of low-level fragments tied to each high-level fragment.”
The fragmentation found from the hierarchical decomposition seems like it makes sense. Learning is as you would expect quite sped up
<I still don’t grok how its unsupervised. Too sleep deprived to go back and figure that out; I think the basic points are what matter for right now anyway.>
Mentions a few issues with the unsupervised learning they are doing
Here there are no risks of partial observability, but the hypothesis space being searched over while learning in any particular instantiation of the problem is larger than the flat problem
1. <This may not be a problem as it seems to help speed up learning even though it is more complex>

Ari Weinstein's Research

Structure in the Space of Value Functions. Foster, Dayan. Machine Learning 2002.

Leave a comment

Ari Weinstein's Research

Structure in the Space of Value Functions. Foster, Dayan. Machine Learning 2002.

Share this:

Related

Leave a comment