Learning Policies for Contextual Submodular Prediction. Ross, Zhou, Yue, Dey, Bagnell. (ICML 2013?) Talk.


From: http://techtalks.tv/talks/learning-policies-for-contextual-submodular-prediction/58145/

  1. Related to RL but not a RL talk – has applications to lots of domains
  2. For robot motion planning, a common method is to start with some start state, produce a simple seed trajectory (perhaps based on features of start state), and then iteratively correct the trajectory so that it is out of collision through some local optimization.  If you cant do this, get a new seed trajectory and start again.
  3. Problem is due to time constraints, can only work from a small number of seeds, want to find any collision free path
  4. Related tasks: predicting news articles to read based on search history
  5. Both seed trajectories and recommended articles should have some sort of diversity
  6. These tasks are monotone submodular; adding seeds or recommended articles can only help, but benefit diminishes
  7. Greedy approach its within 63% of optimal
  8. Extends standard machine learning to be about making lists of predictions instead of single predictions
  9. Here, you dont know what the evaluation function is, but can try and learn what greedy approach works well on training set (based on features) and apply the same rules to the test set
    1. More specifically, make guesses based on environmental features as well as features of the current list of seeds
  10. SCP: Submodular Contextual Policy
  11. Iterative approach:
    1. Start with some policy for constructing lists (may start greedy) – proposes seeds
    2. Based on the set of seeds, compute new data set which is environmental features, as well as features of current items in list
    3. Trying to predict the benefit of adding a particular seed based on both sets of above features
    4. (Have to upweigh later seeds)
    5. Then this is run through standard RL (cost sensitive learner), which gives a new policy for selecting seeds
    6. Return to step 1
  12. ~63% of optimal, but extra error term from the supervised learning step <mentions that if its a no-regret learner that goes away, but what does that mean in this context?>
    1. Treats list generation as a bandit task basically
    2. Competes with result from oracle <pretty amazing>
Advertisements
Tagged ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: