Using Relative Novelty to Identify Useful Temporal Abstractions in Reinforcement Learning. Simsek, Barto. ICML 2004.

  1. “… states that allow the agent to transition to a different region of the state space are useful subgoals, and propose a method for identifying them using the concept of relative novelty.
  2. Many earlier papers deal with planning when given a hierarchical abstraction, but the goal here is to find the abstraction
  3. HEXQ uses the heuristic of figuring out the frequency of feature change, with a layer of hierarchy for each variable, and each child operates a smaller sub-MDP
  4. Other methods are finding common subpolicies among collections of tasks
  5. Others are to measure visit count, reward gradient, or Maxflow
  6. The algorithm we propose also identifies subgoals in the state space and creates temporally-extended activities that take the agent efficiently to these subgoals. Our subgoals are states that allow the agent to transition to a part of the state space that is otherwise unavailable or difficult to reach from its current region. We call them access states. Some examples are a doorway between two rooms…”
  7. These subtasks can aid exploration <when it is undirected>, as well as transfer
    1. <The more I think of it, doing an experiment with the combination lock feels like it makes sense – the type of exploration people do has implications to the usefulness of hierarchy>
  8. Relative novelty is the metric used to determine whether to set a state as a subgoal
  9. And therefore the algorithm is called the Relative Novelty algorithm (RN)
  10. The algorithm only considers structure and does not consider reward
  11. Can be used in a sort of anytime mode <?>
  12. Novelty is defined as visitation to a state/region not “visited recently”
  13. Access states are targets, non-access states are non-targets
  14. The novelty of a state is defined as 1/sqrt(n_s)  where n_s is the number of times s has been visited
    1. This can also be defined over a set of states, and in that case, it is simply in terms of the mean visit count
  15. They are measured in a “short-term measure of novelty.”  This requires that computations aren’t based on the entire history, so some form of forgetting is needed.  They mention that decay can be used, but they just do a periodic reset – in episodic domains the end of the episode is a natural place to reset
  16. The relative novelty of a state is the ratio of the novelty of itself and following states to preceding states
    1. The amount of history that is factored in this way is a parameter
  17. Because the novelty is based on the trajectory (and not the actual structure of the domain) this measure will change from visitation to visitation, but targets should most often have high novelty
  18. <They have a figure depicting relative novelty score vs proportion of observed scores with different curves for target vs non-target.  Its very strange because the two curves really sit on each other for almost the whole graph, with a small amount of separation in the middle.  There are no error bars, so especially because of the closeness of the two curves and the fact that they are characteristically similar means the graph is of limited use. I imagine this can’t be the metric they are using to separate them, as they are mostly non-separable by this criteria>
  19. They cast detecting subgoals as classification problem
  20. They just come up with a thresholding rule
  21. Limitations of using the approach in domains where episodic experience is representative of the task <does this mean it must have good coverage?  I’m actually not clear on why this is necessary>
  22. Propose setting the threshold by measuring ROC
  23. When creating options, the initiation set is are the set of states visited “shortly” before novelty measured above threshold.
  24. Options are learned by experience replay with a reward function designed for that option
  25. Experimental results in Rooms and Taxi
  26. The competitor algorithm is Q-learning with e-greedy.  Naturally vanilla QL loses
  27. Subgoals identified by the method are what you would expect
  28. They note it would be nice to compare other methods that develop subgoals
  29. Mentions an algorithm Q-Cut and that that algorithm and this are trying to do the same thing, but that Q-Cut does so globally, and approximates this via a local method
    1. Naturally, as a global method, Q-Cut is expensive, with a O(n^3) computational (and probably memory) cost
    2. <They say this alg is O(1), but I think that is not accurate, as the above does all states for O(n^3), but in this algorithm is O(1) for each transition in each episode, so applied to all states its at least O(n), and then the above doesn’t have to be recomputed whereas this does, so as the number of episodes grows, the amortized cost of Q-Cut is constant, while this is linear>

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: