Autonomous Robot Skill Acquisition. Konidaris. Dissertation

  1. (x) Dissertation is interested in concept of skill acquisition in high dimensional continuous spaces
  2. (xi) Based on Hierarchical RL
  3. Proposes skill chaining “a general skill discovery method for continuous domains”
  4. Introduces abstract selection which selects “skill specific compact representations from a library of representations when creating a new skill”
    1. This can be combined with skill chaining
  5. (xii) Finally, it generally takes a long time to develop effective policies, but when working on robots this is prohibitive.  The algorithm CST is introduced that creates skill trees based on human demonstration
  6. (13) Dissertation relies on Linear TD(λ)  VFA with fourier bases?
  7. (16) Then moves on policy search – not clear which he’s going with yet
    1. Says right now most effective way of direct search over parameters for policy search is Cross-Ent, with Mannor and Szita’s papers
    2. (18) Discusses policy gradient
  8. (20) Options, TD in resulting semi-MDPs
  9. (21) Options here only deal with subgoal options, listing of different ways in the literature of selecting goal states
    1. Visit frequency and reward gradient
    2. Visit frequency on successful trajectories
    3. Variable change frequency (how many components of a state vector change there?)
    4. Novelty? Salience?
    5. Clustering and value gradients
    6. Local graph partitioning
    7. Min cuts
    8. Causal decomposition
    9. Trajectory analysis with DBNs
    10. Examining commonalities in policies in a single state space
  10. (23) Discusses deliberative vs reactive architectures, seems relevant, but citation comes from a book so it would take me time to fish out
  11. Thing like subsumption arch were effective, but require real engineering to get everything to work, not learning
  12. (26) Says PEGASUS was used for Ng’s helicopter flight – forgot that
  13. (29) An overview of other similar methods from robotics, does not seem entirely relevent to my current research direction so I won’t list here
  14. (32) In summary of current state of learning in robotics, there was no preceding methods that identify target skills (I suppose he means of the form of options as opposed to global policies).  They also rely on human expertise to engineer the state space, generally relies on expert training
    1. Skills that are learned are generally not integrated back into control architecture
  15. (35) Options are useful in transfer learning
  16. (36) Also says options are good for working in continuous spaces, as regressors (and I suppose VFAs) only have to fit a region of the MDP where the option is defined as opposed globally
  17. Some items must be reconsidered when using options in continuous domains
    1. A target state won’t work, because in general it will be impossible to hit any real-valued state exactly.  Simple definitions of regions are also problematic because interesting goal states are often hard to reach; a wide region may allow for targets that don’t actually result in a state that helps planning
    2. Difficult to work on statistics of continuous spaces (this is why I went with open-loop methods in my research)
    3. A similar issue is that of initiation sets.  How do you figure out from which start states is the desired goal state reachable?
    4. How do you represent the value function? Should be compact enough but rich enough to actually represent V*
    5. Characterization – options are useful across problems if solutions to the various problems involve the same states.  How do you define whether solution trajectories have the same states in them when working in continuous spaces?
  18. (38) Skill chaining.  Seems like its defined to work only for goal-based episodic tasks
    1. Looks like it works backwards from the goal state, identifying a region from where the goal can be reached reliably.  From there I suppose the idea is to define a previous region where goal states is part of the are covered by the subsequent region
  19. (38) Can find the initialization states of the options by running trajectories according to the option’s policy, and seeing where the initial state leads to success and where it leads to failure.  Then these initial states and binary results can be fed into a classifier to determine what an appropriate set of start states is
  20. (41) Can also make skill trees instead of simple chains – specifically branches are not allowed to overlap
  21. (42) Instead of simply working from end to beginning, can also have other intermediate states targeted by the various heuristics that exist for defining “interesting” intermediate goal states
  22. (44) Results in Pinball are done with SARSA with linear VFA and Fourier basis (1296 basis functions per action)
    1. “Option policy learning was accomplished using Q-learning” Not clear why one was SARSA and one was QL and exactly what the distinction is
    2. Although I think SARSA was used to determine when to select the option and QL was used to find the Q-function in each individual policy?
    3. Initialization sets were learned via logistic regression
    4. Props for detailing parameters and implementation, but looks like its not a simple method to run and tune
  23. (53) Mentions similarity of LQR trees
  24. (54) His method is in the “full” RL learning setting
  25. (55) Pinball is low-dimensional, but robotics is very large, so other techniques will be needed to move to larger problems
  26. (56) Now moving to real-time high dimensional domains
  27. Often even if a domain is high dimensional, subproblems dont rely on all dimensions and can therefore be addressed more simply
  28. (57) Argue that for large tasks, behaving in the flat domain is intractable, but also that learning/building abstractions can also be prohibitively expensive, especially in robotics domains.
    1. Proposal is to provide the agent with a library of abstractions which it can use
    2. This section focuses on abstraction selection; given that this library exists, how do you choose the appropriate one?
  29. Abstraction selection is framed as a model selection problem
    1. By model selection, he means basis function selection
    2. Discusses Bayesian Information Criterion as a method for doing this, not noting the particulars of this
  30. (64) Costs are quadratic, but it must be ok as he runs it on a robot – ah initial experiments are on the combinatoric “playroom” domain
  31. (74) Lots of stuff on tuning, basis function selection that is good to write about but not particularly relevant for me
  32. (80) Abstraction selection also not relevant exactly at the moment but probably will be later on
  33. (84) If we are given demonstrations, how do we take advantage of that information
  34. In particular, given that we consider skill chaining, how to we turn these complete trajectories into smaller pieces that are simpler and reusable?  Called the multiple changepoint detection problem. Some ways to do so are:
    1. When the best state abstraction changes
    2. When the value function gets too complex to represent with a single option
  35. Here, CST, a constant cost incremental algorithm to solve this problem is introduced
  36. (85) Changepoints based on Viterbi (HMM)
  37. (88) Changepoint detection done according to sets of basis functions associated w/each abstraction as models, the target variable is the return
  38. (92) Segmenting should be done with a low-order FA because the amount of data is usually fairly small
  39. (101) This was performed successfully on pinball (beats other logical alternatives) and a real (humanoid) robot
  40. (108) Now moving onto robot problem without human demonstration
  41. (111) For task-level planning it considers a discrete state space
  42. Planning on robot is online and real time
  43. (123) Robotics domains are to his own admission fairly simple
  44. (126) Work here uses options but not option hierarchies, algs like MAXQ and HAM build hierarchies though.
    1. Comments that its difficult to build such hierarchies just from interactions with the environment?
    2. Also, the size of the state space increases w/the size of the hierarchy
  45. Ideally:
    1. As levels of the hierarchy increase, problem complexity reduces
    2. Each level forms and MDP
    3. All levels (except maybe the lowest) are discrete
    4. Each state either has a symbolic representation or can be converted to one
    5. Each level (except perhaps the first) has enough information to allow planning without a model of the actual environment
  46. Good stuff

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: