Category Archives: Dynamic Bayes Nets

Learning probability distributions over partially-ordered human everyday activities. Tenorth, De la Torre, Beetz. ICRA 2013

  1. Attempt “… to learn the partially ordered structure inherent in human everyday activities from observations by exploiting variability in the data.”
  2. Learns full joint probability over actions that make up a task, their partial ordering, and their parameters
  3. Can be used for classification, but also figuring out what actions are relevant to a task, what objects are used, if it was done correctly, or what is typical for an individual
  4. Use synthetic data as well as TUM Kitchen and CMU MMAC
  5. Use Bayesian Logic Nets (another paper has author overlap and uses the same approach)
  6. Common alternate approaches are HMMs, conditional random fields (CRFs), or suffix trees
    1. But these are most effective when the ordering of the subtasks are pretty fixed
    2. Also Markov assumption of HMM doesn’t really hold in the way data is often represented and may require all history information
  7. Also some other approaches for representing partial ordering
    1. <Whatever this means> “All these approaches focus only on the ordering among atomic action entities, while our system learns a distribution over the order as well as the action parameters.”
  8. Literature on partially ordered plans require lots of prior information, and have been applied to synthetic problems
  9. Working off the TUM dataset, they needed to resort to bagging and data noisification techniques in order to get enough samples
  10. Needs annotation / data labeling
  11. Learns, for example, that after making brownies (the CMU MMAC dataset) some people like to put the frying pan in the sink, and others on the stove
  12. Performance of this approach is much better than that of CRFs, and is more robust to variability in how people undertake tasks

Intrinsically Motivated Hierarchical Skill Learning in Structured Environments. Vigorito, Barto. IEEE Transactions on Autonomous Mental Development 2010

  1. Covers intrinsic motivation for learning in hierarchical environments
  2. “Using Bayesian network structure-learning techniques and structured dynamic programming algorithms, we show that reinforcement learning agents can learn incrementally and autonomously both the causal structure of their environment and a hierarchy of skills that exploit this structure.”
  3. Motivation is “ensemble” <what I think of as transfer> learning
  4. Mentions structured value iteration for VI in domains with compact representations
  5. Factored representations lead to sparseness
  6. Naturally, mention VISA by Jonsson, Barto
  7. But VISA needs a model of the domain to work from (in the form of DBN).  The goal here is to do model building so that assumption can be removed
  8. Produces results that are recursively optimal but may not be hierarchically optimal – full solution may be suboptimal, even though policy for each option is optimal
  9. Optimize DBN according to Bayes Information Criterion (BIC)
    1. Doing this optimally is intractable, they use a greedy heuristic
    2. Basically involves building decision trees using BIC (use a chi-squared test)
  10. Actions are purely explorative; are selected s.t. leaf (leaves) of <s,a> maximize change in entropy in leaf
  11. In original VISA, algorithm <ironically> doesn’t explore thoroughly enough, because action selection is myopic in terms of improving knowledge of DBN only as far as current state is concerned, so there is no real directed exploration in the way that RMAX does, mention a few possible fixes <although its not clear if any of them are “smart” in an RMAX sense>
    1. Mention work by Schmidhuber for intrinsic motivation (and its problems) <but why not RMAX as it does things correctly?>
    2. Then mention stuff by Barto that is better, but isn’t designed for factored domains
    3. <Oh, they will go into KWIK, Carlos’ stuff later in related work>
  12. To do learning, maintain a set C of what are called controllable variables, which are variables which the agent knows how to set to any possible value that feature can take
  13. When choosing an action, look for features that may be changed if more information is achieved.  Then make sure that features ancestors in the DBN are controllable.  If so, set up a plan to try and change the feature
  14. So how do you know when to stop refining your model for a feature in stochastic environments?
  15. <Its a little unclear to me how this happens.  There is some expectation for the probability that the agent will be able to change a feature to a particular value.  If it fails to do so with a ratio a certain value less than that, it is abandoned?  This doesn’t make sense so I must not understand it correctly>
  16. Looks like they do the RMAX trick, but the feature they pick to assign value to is the one that has controllable sources and the highest potential change in entropy <I thought Bayes info – is it the same?>
  17. “When options happen to be created prematurely and are malformed, their lack of utility is discovered fairly quickly by the agent when it attempts to use those  options in its experimental plans and they fail repeatedly. These options will be removed from the agent’s skill set until the agent performs more experiments relevant to discovering their structure, at which point they will be re-created and tested in further experiments. Once a correct option is learned, its empirical success rate will on average match its expected success rate, and the option will remain in the agent’s skill set to be used in
    all further experiments.”
  18. Experimental results
  19. The light box domain has 20 “lights” or binary variables: ~1 million raw states, 20 million <s,a> pairs
    1. Lights are separated into categories
      1. Circular, which are controlled directly by their switch
      2. Triangular, which are turned on if a particular set of circular lights are on (with own respective switch)
      3. Rectangular which depend on triangular (with own respective switch)
      4. Diamond which depend on rectangular (with own respective switch)
    2. Stochastic; actions “work” with p=0.9
    3. “However, if an action is taken to toggle a light whose dependencies are not currently satisfied, the entire domain is reset to all lights being off.”
      1. No way random actions can get anywhere
    4. Learning must be active so that exploration can be done in an intelligent directed manner
  20. <Seems to learn effective, but not “correct” policies, dependencies that are flat (1,2,3 needed on to turn on 4, but 1,2,3, independent of each other) end up being constructed serially so it looks like they are dependent (1, then 2, then 3 – even though they could be done in any order)>
  21. Not surprisingly, global (directed) exploration is the only thing that works in this domain
  22. Planning times for planning with options+primitives vs just primitives is options is flat with increasing problem complexity (level in the hierarchy — circular, triangular, rectangular, diamond) while primitives only has exponentially increasing planning cost with increasing complexity
  23. Mention KWIK algorithms for DBNs, knock them for having limited (and exponential cost in) in-degree, but you must have that in general problems to get an optimal solution – the greedy approach here only works in submodular domains