Intrinsically Motivated Hierarchical Skill Learning in Structured Environments. Vigorito, Barto. IEEE Transactions on Autonomous Mental Development 2010


  1. Covers intrinsic motivation for learning in hierarchical environments
  2. “Using Bayesian network structure-learning techniques and structured dynamic programming algorithms, we show that reinforcement learning agents can learn incrementally and autonomously both the causal structure of their environment and a hierarchy of skills that exploit this structure.”
  3. Motivation is “ensemble” <what I think of as transfer> learning
  4. Mentions structured value iteration for VI in domains with compact representations
  5. Factored representations lead to sparseness
  6. Naturally, mention VISA by Jonsson, Barto
  7. But VISA needs a model of the domain to work from (in the form of DBN).  The goal here is to do model building so that assumption can be removed
  8. Produces results that are recursively optimal but may not be hierarchically optimal – full solution may be suboptimal, even though policy for each option is optimal
  9. Optimize DBN according to Bayes Information Criterion (BIC)
    1. Doing this optimally is intractable, they use a greedy heuristic
    2. Basically involves building decision trees using BIC (use a chi-squared test)
  10. Actions are purely explorative; are selected s.t. leaf (leaves) of <s,a> maximize change in entropy in leaf
  11. In original VISA, algorithm <ironically> doesn’t explore thoroughly enough, because action selection is myopic in terms of improving knowledge of DBN only as far as current state is concerned, so there is no real directed exploration in the way that RMAX does, mention a few possible fixes <although its not clear if any of them are “smart” in an RMAX sense>
    1. Mention work by Schmidhuber for intrinsic motivation (and its problems) <but why not RMAX as it does things correctly?>
    2. Then mention stuff by Barto that is better, but isn’t designed for factored domains
    3. <Oh, they will go into KWIK, Carlos’ stuff later in related work>
  12. To do learning, maintain a set C of what are called controllable variables, which are variables which the agent knows how to set to any possible value that feature can take
  13. When choosing an action, look for features that may be changed if more information is achieved.  Then make sure that features ancestors in the DBN are controllable.  If so, set up a plan to try and change the feature
  14. So how do you know when to stop refining your model for a feature in stochastic environments?
  15. <Its a little unclear to me how this happens.  There is some expectation for the probability that the agent will be able to change a feature to a particular value.  If it fails to do so with a ratio a certain value less than that, it is abandoned?  This doesn’t make sense so I must not understand it correctly>
  16. Looks like they do the RMAX trick, but the feature they pick to assign value to is the one that has controllable sources and the highest potential change in entropy <I thought Bayes info – is it the same?>
  17. “When options happen to be created prematurely and are malformed, their lack of utility is discovered fairly quickly by the agent when it attempts to use those  options in its experimental plans and they fail repeatedly. These options will be removed from the agent’s skill set until the agent performs more experiments relevant to discovering their structure, at which point they will be re-created and tested in further experiments. Once a correct option is learned, its empirical success rate will on average match its expected success rate, and the option will remain in the agent’s skill set to be used in
    all further experiments.”
  18. Experimental results
  19. The light box domain has 20 “lights” or binary variables: ~1 million raw states, 20 million <s,a> pairs
    1. Lights are separated into categories
      1. Circular, which are controlled directly by their switch
      2. Triangular, which are turned on if a particular set of circular lights are on (with own respective switch)
      3. Rectangular which depend on triangular (with own respective switch)
      4. Diamond which depend on rectangular (with own respective switch)
    2. Stochastic; actions “work” with p=0.9
    3. “However, if an action is taken to toggle a light whose dependencies are not currently satisfied, the entire domain is reset to all lights being off.”
      1. No way random actions can get anywhere
    4. Learning must be active so that exploration can be done in an intelligent directed manner
  20. <Seems to learn effective, but not “correct” policies, dependencies that are flat (1,2,3 needed on to turn on 4, but 1,2,3, independent of each other) end up being constructed serially so it looks like they are dependent (1, then 2, then 3 – even though they could be done in any order)>
  21. Not surprisingly, global (directed) exploration is the only thing that works in this domain
  22. Planning times for planning with options+primitives vs just primitives is options is flat with increasing problem complexity (level in the hierarchy — circular, triangular, rectangular, diamond) while primitives only has exponentially increasing planning cost with increasing complexity
  23. Mention KWIK algorithms for DBNs, knock them for having limited (and exponential cost in) in-degree, but you must have that in general problems to get an optimal solution – the greedy approach here only works in submodular domains
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: