Category Archives: Continuous state spaces

Model-Based Reinforcement Learning in Continuous Environments Using Real-Time Constrained Optimization. Andersson, Heintz, Doherty. AAAI 2015

  1. Working on high-D continuous RL
  2. Builds a model with sparse Gaussian processes, and then does local (re)planning “by solving it as a constrained optimization problem”
  3. Use MPC/control related methods that were done back in ’04 but revisited here and can be used for real-time control now
  4. Test in “extended” cart-pole <all this means here is the start state is randomized> and  quadcopter
  5. Don’t try to do MCTS, because it is expensive.  Instead use gradient optimization
  6. Instead of normal O(n^3) costs for GPs, this has O(m^2n), whre m < n
  7. “However, as only the immediately preceding time steps are coupled through the equality constraints induced by the dynamics model, the stage-wise nature of such modelpredictive control problems result in a block-diagonal structure in the Karush-Kuhn-Tucker optimality conditions that admit efficient solution. There has recently been several highly optimized convex solvers for such stage-wise problems, on both linear (Wang and Boyd 2010) and linear-timevarying (LTV) (Ferreau et al. 2013; Domahidi et al. 2012) dynamics models.”
  8. Looks like the type of control they use has to linearize the model locally
  9. “For the tasks in this paper we only use quadratic objectives, linear state-action constraints and ignore second order approximations.”
  10. Use an off-the shelf convex solver for doing the MPC optimization
  11. Use warm starts for replanning
  12. The optimization converges in a handful of steps
  13. <Say they didn’t need to do exploration at all for the tasks they considered, but it looks like they have a pure random action period at first>
  14. Although the cart-pole is a simple task, they learn it in less than 5 episodes
    1. <But why no error bars, especially when this experiment probably takes a few seconds to run.  This is crazy in a paper from 2015, although it is probably fine it makes me wonder if it sometimes fails to get a good policy>
  15. Use some domain knowledge to make learning the dynamics for the quadcopter a lower-dimensional problem
    1. 8D state, 2D action
  16. For quadcopter there is training data from a real quadcopter? fed in and then it is run in simulation
  17. “By combining sparse Gaussian process models with recent efficient stage-wise solvers from approximate optimal control we showed that it is feasible to solve challenging problems in real-time.”

Autonomous reinforcement learning on raw visual input data in a real world application. Lange, Riedmiller, Voigtlander. Neural Networks (IJCNN) 2012

  1. Learn slot car racing from raw camera input
  2. Deep-fitted Q
  3. The slot car problem is interesting in particular because as being rewarded for speed, the best policies are the closest to failure (losing control of the car)
  4. Time resolution at about 0.25s
  5. Camera input is 52×80 = 4160d (greyscale it seems?)
  6. They do pretraining: “The size of the input layer is 52×80 = 4160 neurons, one for each pixel provided by the digital camera. The input layer is followed by two hidden layers with 7×7 convolutional kernels each. The first convolutional layer has the same size as the input layer, whereas the second reduces each dimension by a factor of two, resulting in 1 fourth of the original size. The convolutional layers are followed by seven fully connected layers, each reducing the number of its predecessor by a factor of 2. In its basic version the coding layer consists of 2 neurons.
    Then the symmetric structure expands the coding layer towards the output layer, which shall reproduce the input and accordingly consists of 4160 neurons.”
  7. “Training: Training of deep networks per se is a challenge: the size of the network implies high computational
    effort; the deep layered architecture causes problems with vanishing gradient information. We use a special two-stage training procedure, layer-wise pretraining [8], [29] followed by a fine-tuning phase of the complete network. As the learning rule for both phases, we use Rprop [34], which has the advantage to be very fast and robust against parameter choice at the same time. This is particularly important since one cannot afford to do a vast search for parameters, since training times of those large networks are pretty long.”
  8. Training is done with a set of 7,000 images, with the car moving a constant speed.
  9. “For the layer-wise pretraining, in each stage 200 epochs of Rprop training were performed.”
  10. “Let us emphasis the fundamental importance of having the feature space already partially unfolded after pretraining. A partially unfolded feature space indicates at least some information getting past this bottle-neck layer, although errors in corresponding reconstructions are still large. Only because the autoencoder is able to distinguish at least a few images in its code layer it is possible to calculate meaningful derivatives in
    the finetuning phase that allow to further “pull” the activations in the right directions to further unfold the feature space.”
  11. “Altogether, training the deep encoder network takes about 12 hours on an 8-core CPU with 16 parallel threads.”
  12. Because still images don’t include all stat information (position but not velocity) they need to use some method to allow the other parts of the state information in.  One option is to make the state the previous two frames, the other is to do a Kohonen Map <I assume this is what they do>
  13. “As the spatial resolution is non uniform in the feature space spanned by the deep encoder, a difference in the feature space is not necessarily a consistent measure for the dynamics of the system. Hence, another transformation, a Kohonen map … is introduced to linearize that space and to capture the a priori known topology, in this case a ring.”
    1. <Need to read this part over again>
  14. Use a paricular form of value FA I haven’t heard of before
  15. It seems this VFA may be an averager, not based strictly on an NN
  16. 4 discrete actions
  17. “Learning was done by first collecting a number of ’baseline’ tuples, which was done by driving 3 rounds with the constant safe action. This was followed by an exploration phase using an epsioln-greedy policy with epsilon= 0.1 for another 50 episodes. Then the exploration rate was set to 0 (pure exploitation). This was done until an overall of 130 episodes was finished. After each episode, the cluster-based Fitted-Q was performed until the values did not change any more. Altogether, the overall interaction time with the real system was a bit less than 30 minutes.”

Predictive Projections. Sprague. IJCAI 09.

  1. algorithm for linear “projections in which accurate predictions of future states can be made using simple nearest neighbor style learning.”
  2. Experiments on simulated pendulum and real roomba robot
  3. Considers video data
  4. Based on earlier work (by Tesauro among others) in “which nearest neighbor learning is recast in a probabilistic framework that allows for gradient based optimization of the distance metric.  These existing algorithms discover projections of the training data under which nearby points are likely to have the same class label or similar regression targets.”
    1. Neighborhood Components Analysis (NCA) – for nearest neighbor classification
  5. “It would be difficult to directly find an A that minimizes the k-nearest-neighbor classification error rate, because the number of errors will be a highly discontinuous function of A; very small changes in A may change the set of nearest neighbors for some points. The innovation behind the NCA algorithm is to recast nearest neighbor learning in a probabilistic framework. In this framework, expected error is a continuous, differentiable function of A and thus may be minimized using gradient based techniques.”
  6. NCA was originally designed for classification but was later extended to regression
  7. A minor tweak is needed to get NCA for regression to work well on predicting vectors when the problem is stochastic
  8. Here they use conjugate gradient to do the optimization
  9. The approach can scale well; a paper from ’07 was able to run on 60k training points
  10. “Following this gradient acts to reduce the objective in two different ways. It adjusts A to be more predictive by increasing the probability that a neighbor will be chosen if it successfully predicts the next state. It also adjusts A to be more predictable, by moving target states together whenever there is a high probability they will be chosen to predict each other.”
  11. Important to whiten data first
  12. To deal with actions, partition data by action but constrain to use same projection <how is this different than just throwing out the action data>
  13. Only parameters to the algorithm are size of projection matrix A and its initial values
  14. “The predictive projections algorithm as described above may not perform well in cases where the effects of different actions are restricted to specific state dimensions. Since there is no explicit penalty for failing to predict some dimensions, the algorithm may minimize the objective function by finding an A which is not full rank, thus accurately predicting some dimensions while discarding others.”
    1. Although there is a potential fixes for this
  15. For the applications this method is coupled with LSPI (along with RBFs)
  16. Algorithm deals well with Lagoudakis+Parrs’ LSPI
  17. For the roomba experiment they put a camera on top – all the robot needs to do is to continue to move forwards without hitting the wall
    1. <Seems like not a terribly hard task just turn when the wall takes up most of the view?>
  18. temp
  19. Image is 20×20: 1200-d once color is factored in
    1. Then this is reduced via PCA to 25d
    2. Then their approach takes that to 2d (A is initialized to the first 2 principal components)
  20. Their approach basically recovers translation and rotation, while PCA recovers amount of wall present and lighting (their approach can use this to make a good policy, but that can’t be done linearly from what PCA produces)
  21. “To our knowledge, the proto-value function framework has not been applied to the type of noisy, high dimensional control problems addressed in this paper. It seems likely that the neighborhood calculations required for constructing the diffusion model could be dominated by noise dimensions, particularly in very noisy tasks such as the modified pendulum domain described above. In that case, the PVF approach and
    predictive projections would be complementary: The PP algorithm could find a low dimensional state projection that contains relevant state information, and the PVF algorithm could then be used to discover a set of appropriate basis functions in that space”
  22. “Another closely related project is the basis iteration algorithm described in [Sprague, 2007]. This algorithm also uses gradient based metric learning to discover an appropriate projection, but it focuses directly on finding a metric that allows for accurate estimation of the optimal value function. It accomplishes this by iterating value function estimation with updates to the projection matrix. This algorithm has the advantage of incorporating reward information, but it depends on starting with an initial projection that enables a reasonableestimate of the optimal value function.”

State Representation Learning in Robotics: Using Prior Knowledge about Physical Interaction. Jonschowski, Brock. RSS 20Re10Q

related to

  1. Uses the fact that robots interact with the physical world to set constraints on how state representations are learnt
  2. Test on simulated slot car and simulated navigation task, with distractors
  3. How to extract a low dimensional representation relevant to the task being undertaken from high dimensional sensor data?
  4. The visual input in the experiments is 300-D
  5. From the perspective of RL
  6. “According to Bengio et al. [1], the key to successful representation learning is the incorporation of “many general priors about the world around us.” They proposed a list of generic priors for artificial intelligence and argue that refining this list and incorporating it into a method for representation learning will bring us closer to artificial intelligence.”
  7. “State representation learning is an instance of representation learning for interactive problems with the goal to find a mapping from observations to states that allows choosing the right actions. Note that this problem is more difficult than the standard dimensionality reduction problem, addressed by multi-dimensional scaling [14] and other methods [23, 29, 6] because they require knowledge of distances or neighborhood relationships between data samples in state space. The robot, on the other hand, does not know about semantic similarity of sensory input beforehand. In order to know which observations correspond to similar situations with respect to the task, it has to solve the reinforcement learning problem (see Section III), which it cannot solve without a suitable state representation.”
  8. State representation learning can be done by:
    1. Deep autoencoders
    2. SFA (and its similarity to proto-value functions)
    3. Predictability / Predictive actions.  Points to a Michael Bowling paper <I haven’t read – will check out>
  9. Bunch of nice references
  10. The Robotic priors they care about (these are all defined mathematically later):
    1. Simplicity: For any task, only a small number of properties matter
    2. Temporal coherence: important properties change gradually over time
    3. Proportionality: Amount of change in important properties is proportional to action magnitude
    4. Causality: important properties and actions determine the reward
    5. Repeatability: Same as causality but in terms of transition not reward
  11. These properties hold for robotics and physical systems but aren’t necessarily appropriate to all domains
    1. Even in robotics these sometimes don’t hold (for example, a robot running into a wall will have an abrupt change in velocity; once pushed against the wall proportionality doesn’t hold because no amount of pushing will allow the robot to move
  12. They set learning up as an optimization problem that includes loss terms for each of the priors above (just a linear combination)
  13. <these priors aren’t directly applicable to the mocap work I am doing now because they involve actions which we dont have access to>
  14. Formally they would need to compare all samples, leading to n^2 loss, but they restrict this to a window
  15. Linear mapping from observations to states
  16. Epsilon greedy exploration, although there is a bias to repeat the previous action
  17. Have distractors in their simulations
  18. In the navigation task they demonstrate an invariance to perspective by using either an overhead or first-person view from the robot
    1. Representations learned are highly similar
  19. Learned in the course of 5,000 observations
  20. The observations in the experiment are 300 pixels or 100 (not 300×300)
  21. For the simulated slot car task, the state sample matrix has rank 4.  Two large eigenvalues correspond to the position of the controlled car, and two smaller eigenvalues correspond to the position of the distractor car
    1. Ideally the distractor shouldn’t show up at all, but because of things like stochasticity and limited samples, weight can be placed on it to explain events that it is not related to
  22. They then to an RL comparison based on a number of different methods for learning the representation (5 features extracted by)
    1. Their approach
    2. SFA
    3. PCA
    4. Raw 300D representation
    5. (They also compare to ground truth representation)
  23. Use neural fitted Q
  24. screenshot
  25. SFA features are really terrible, their approach is pretty similar to operating from the ground truth
  26. With further investigation on the same results they demonstrate that their method has very good generalization properties
  27. Conjecture that the primary difference between their approach and the other dimension reduction methods are that they didn’t learn to disregard the distractors

Deterministic Policy Gradient Algorithms. Silver, Lever, Heess, Degris, Wierstra, Riedmiller. ICML 2014

  1. Deterministic policy gradient for continuous action MDPs
    1. Deterministic in that the action selected by the algorithm is deterministic as opposed to a stochastic
    2. Not a function of stochasticity in the domain <so far>
  2. “The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function.  This simple form means that the deterministic policy gradient can be estimated much more effectively than the usual stochastic policy gradient.”
  3. Exploration is driven by off-policy actor-critic
  4. This method is shown to beat stochastic policy gradient in high-D spaces
  5.  “It was previously believed that the deterministic policy gradient did not exist, or could only be obtained when using a model (Peters, 2010).  However, we show that the deterministic policy gradient does indeed exist, and furthermore it has a simple model-free form that simply follows the gradient of the action-value function.”
  6. The determinstic policy gradient is simply equivalent to the stochastic policy gradient as policy variance approaches 0.
  7. The basic difference between the deterministic vs stochastic policy gradient is that the first only integrates over state, while the latter also integrates over actions (so doing stochastic requires more samples <and similarly is more error-prone>)
  8. The benefit of stochastic policy gradient is that it has a nice way of doing exploration built-in.  Here an actor-critic setup is used to drive exploration: “We use the deterministic policy gradient to derive an off-policy actor-critic algorithm that estimates the action-value function using a differentiable function approximator, and then updates the policy parameters in the direction of the approximate action-value gradient.  We also introduce a notion of compatible function approximation for deterministic policy gradients, to ensure that the approximation does not bias the gradient.”
  9. Experiments on a high-D bandit, some other stuff, and octopus arm
  10. “… compatible <in that the critic does not introduce bias> function approximators are linear in ‘features’ of the stochastic policy….”
  11. Discusses off-policy actor critic, which is a combination of policy gradient and TD, provides an approximation of the true gradient
    1. Has to use importance sampling to compensate for the fact that the state distribution that is being observed and the one the alogrithm would like to generate are different.
  12. “In continuous action spaces, greedy policy improvement becomes problematic, requiring a global maximization at every step.  Instead, a simple and computationally attractive alternative is to move the policy in the direction of the gradient of Q, rather than globally maximizing Q.”
  13. The estimate of the action parameterization-value gradient depends on the distribution of states visited, but it turns out the gradient of the state distribution does not have to be calculated
  14. Naturally, as in the case of stochastic case, a differentiable estimate of the Q function must be used.
    1. In general, this will not preserve the gradient of the true value function (it is called compatible) but there are classes of FAs that will
  15. Start with a description of gradient SARSA
  16. Then move to off-policy deterministic actor critic
  17. Linear FAs are compatible, and can be effective if they only have to locally dictate how to adjust parameters <but is it used only locally?>
    1. Seems like it can be linear in any set of basis functions, though
  18. Minimizing squared-error
  19. “To summarise, a compatible off-policy deterministic actor-critic (COPDAC) algorithm consists of two components.  The critic is a linear function approximator that estimates the action-value from features [math]… This may be learnt off-policy from samples of a behaviour policy β(a|s), for example using Q-learning or gradient Q-learning.  The actor then updates its parameters in the direction of the critic’s action-value gradient.”
  20. Although off-policy QL may diverge when using linear VFA, there are now methods that are safer, which is what is used here
  21. Computational complexity is mn where m = |A| and n is the number of policy parameters <which may be |S||A|?>
  22. On to experimental results
  23. High-D (D = 10, 25, 50) quadratic bandit.  Seems like a pretty trivial problem – perf in the 50D case converges at around 1000 samples.  Stochastic is still almost at exactly the same performance it started with at that point
  24. Then work in mountain car, pendulum, puddle world <at least in the first two, exploration isn’t trivial, although not extremely difficult>
  25. <Because the VFA is being done linearly, this doesn’t solve the problem of engineering features that allow the problem to be solvable, which is a fundamental issue in continuous RL>
  26. In octopus, reward is the distance from the arm to the target, so there is a nice smooth landscape to optimize on
    1. State space is simplified to 6-D
    2. VFA is done by ANN
  27. <Discussion>
  28. “Using a stochastic policy gradient algorithm, the policy becomes more deterministic as the algorithm homes in on a good strategy.  Unfortunately this makes stochastic policy gradient harder to estimate, because the policy gradient ∇θπθ(a|s) changes more rapidly near the mean.  Indeed, the variance of the stochastic policy gradient for a Gaussian policy N(μ, σ2) is proportional to 1/σ2 (…), which grows to infinity as the policy becomes deterministic.  This problem is compounded in high dimensions, as illustrated by the continuous bandit task.  The stochastic actor-critic estimates the stochastic policy gradient in … The inner integral …[math], is computed by sampling a high dimensional action space.  In contrast, the deterministic policy.”

Finite-Sample Analysis of Lasso-TD. Ghavamzadeh, Lazaric, Munos, Hoffman. ICML 2011.

Brief notes from a very light read – look at the authors and you know that you can either read this paper in 5 minutes, or 5 weeks

  1. Theoretical results encompass both LARS-TD and LC-TD.  They call their method Lasso-TD.
  2. Methods form a contraction mapping, so they converge
  3. A Lasso-TD solution always exists (although it may not be unique) and has a unique value function
  4. Performance bounds are derived “estimation error with the rate O(||α||1 n -1/4).”  Additional assumptions improve this to O(||α||0 n -1/2)
  5. Theorem connects “… prediction performance of Lasso-TD (which is an L1-regularized fixed point problem) and the L1-norm of the best approximation of V in Fn.”
  6. As number of features d increases wrt number of samples n, performance of vanilla LSTD gets continually worse, while Lasso-TD only degrades at (~O(log d))
  7. “… in order for Lasso-TD to be effective, a sparse approximation of the value function must exist.”
    1. They go on to talk about some properties of an MDP that would cause this property to exist
  8. Domains with smooth rewards and dynamics will have smooth value functions.
    1. “From approximation theory, we know that piecewise-smooth functions can be accurately approximated by localized features, such as wavelets (…).  As a result, if we define F as the linear function space spanned by a set of d wavelets, any value function is likely to be well-approximated by functions in F with a relatively small number of non-zero coefficients.”
  9. Also mentions LSTD-RP (LSTD with random projections <on my reading list>).
    1. In the general unrestricted case, LSTD-RP works better than Lasso-TD.
    2. When sparsity does exist, however, Lasso-TD works better

Dimension reduction and its application to model-based exploration in continuous spaces. Nouri, Littman. Machine Learning 2010

Just a note that I read this paper while I didn’t have ability to take notes, so notes are pretty sparse here.  If I go down the road of dimensionality reduction / feature selection in RL I should cite this, but probably the Parr, Ng papers on L1 regularization in TD are more relevant at the moment.

  1. Addresses exploration in high dimensional domains
  2. Uses a continuous value of knowness <C-KWIK?> to drive exploration smoothly, is closer in spirit to MBIE.  This is better than the standard binary known/unknown labels that R-Max uses
  3. Does learning according to KNN with RBFs, learns RBF parameters (σ) for each component of the output vector independently, which allows for more compact representations
  4. Performance degrades very slowly as a function of problem size, generalization and accuracy are very good as related to methods that do not attempt dimension reduction

Qualitative Hybrid Control of Dynamic Bipedal Walking. Ramamoorthy, Kuipers. RSS 2006.

  1. Approach is concerned with dynamic walking on uneven terrain
  2. A common approach is to formulate the walking problem as a linear system which can be solved with a number of methods.  This is problematic because it reduces energy efficiency, overly constrains the type of gaits that can be found, and requires constant compensation for what the body naturally does (that introduces non linearity) which introduces other forms of complexity
  3. Dynamic walkers utilize, instead of overcome the inherent nonlinearities involved in legged locomotion
  4. The idea is to only introduce small corrections into the trajectories that will naturally occur in the system
  5. Discuss motivation based on studies of how animals walk, and how humans learn to walk.  Idea is that planning is done in some simpler subspace and then mapped back to the original problem.  This paper is motivated by trying to find simple methods of walking on uneven terrain that can then be mapped back to more complex versions of the problem.
  6. Talk about Poincare maps and limit cycles for walking on smooth terrain, but these approaches don’t transfer well to walking on uneven terrain.
  7. Method here proposes to use trajectory segments that can be searched over in order to walk on complex terrain
  8. They present their mathematical representation of a pendulum walker – three point masses, each at the hip and feet
  9. In the swing phase, one foot is planted and there are only two moving masses, which is actually equivalent to the acrobot (only the hip is actuated).  The distinction is that it is assumed the legs can retract or extend instantaneously to clear obstacles.
  10. Here, impulses are only provided during double support (when both feet contact the ground), so the problem is very highly constrained.
  11. It is modeled as a discrete system, with systems being when one or two feet are planted on the ground
  12. Talk about the fact that a driven pendulum is chaotic, so getting everything to synchronize correctly (between the two separate systems of swinging and standing, and dealing with complications due to obstacles) may be difficult
    1. The transition to chaos is gradual, “As the driving amplitude is smoothly varied from a small value, only a few trajectories near critical points and the separatrix are affected.”
  13. They want to try and avoid situations that could become chaotic, do so by maintaining constraints that keep the system non-chaotic
  14. Again, forces are only applied during the double support phase.  After that the leading leg is retracted (and presumably later expanded)
  15. The method requires solution of a sequential quadratic program, and the method is admittedly not appropriate for on-line use, and then using some method of storing the policy
  16. The result settles into a stable limit cycle when walking on flat terrain
  17. Have some experiments walking on terrain that is flat, inclining, and of random heights

STOMP: Stochastic Trajectory Optimization for Motion Planning. Kalakrishnan, Chitta, Theodorou, Pastor, Schaal

  1. Another planner that uses optimization to minimize cost
  2. Performs local search by starting with an initial candidate trajectory and then introducing noise to sample nearby it:
    1. STOMP is an algorithm that performs local optimization, i.e. it finds a locally optimum[al] trajectory rather than a gobal one.  Hence, performance will vary depending on the initial trajectory used for optimization.  STOMP cannot be expected to solve typical motion planning benchmark problems like the alpha puzzle in a reasonable amount of time.

  3. Compares itself to the algorithm CHOMP, which is similar but CHOMP is a gradient-based method.  Because STOMP is not a gradient method, and has some randomized exploration, it has a better ability to escape local optima than CHOMP
  4. Because it is not a gradient method, STOMP can be used with wider classes of cost functions (such as those hard constraints on angles or torques which many method have trouble with), and more general constraints
    1. Hard limits can also cause local optima, so while the method doesn’t just break in these settings, it is possible convergence will not be to optimal resutls
  5. Basic motion planning algorithms generally require smoothing and other further transformations after a solution is found in order to maintain energy efficiency, minimize jerk (and motor wear), among humans abrupt changes caused by classical motion planning methods can be disconcerting.
    1. A result from a classical motion planner could be used as the initial proposal for this algorithm
  6. In contrast to the classical motion planning approach (which they call sampling-based), are optimization based planners which minimize cost
  7. Seems like this method also uses inverse kinematics, runs on a domain that does is *not* underactuated or exhibit drift: “Specifically, we consider trajectories of a fixed duration T, discretized into N waypoints, equally spaced in time.”
  8. Also does an EM-style update, and weighs samples according to their goodness: “The trajectory updates can be considered safer than a standard gradient descent update rule.  The new trajectory is essentially a convex combination of the noisy trajectories which already been evaluated, i.e. there are no unexpected jumps to unexplored parts of the state space due to a noisy gradient evaluation.”
  9. The only open parameter to the algorithm is the exploration noise
  10. In settings where optimization is done in a multidimensional space, each dimension is dealt with independently
  11. They do a  grasping task on a PR2 in simulation and on the real robot
    1. CHOMP fails to produce an effective plan about 30% of the time, while STOMP is always successful, probably due to the bit of randomized exploration STOMP does.
  12. They mention a paper that introduces noise into CHOMP to help get around local minima, but that it was difficult to use and implement, with many parameters that needed to be tuned

Progress in Algorithmic Motion Planning and Opportunities at the Intersection with Perceptual Science. Kostas Bekris talk.

  1. Interests:
    1. High-quality plans that
    2. Respect of physical constraints of the system
    3. Real time constraints are important
      1. Real-time planning may require short planning horizons which can lead to situations where collision cannot be avoided
    4. Multi-agent planning
    5. Interaction with humans
  2. Methods cited for control theory: LQR, partial feedback linearization, HJB and pnotyagin’s principle
  3. In late 70s, PSPACE-Hardness was established (Reif ’79, Schwartz & Sharir 82, 84)
  4. Cites Canny’s work (I just grabbed his thesis from the library), talked about roadmaps, but the algorithm presented was doubly-exponential, never even implemented.
  5. PRMs were introduced by Kavarki, Bekris’ advisor in 1996.  An efficient way of producing roadmaps
  6. In the general case, planning without respecting dynamics and then trying to smooth the path so it complies with dynamics is not possible
  7. He is interested in finding more computationally near-optimal results as opposed to optimal results at the limit with infinitely dense graphs
    1. Do dense planning and then prune
  8. His general goal is PAC-style planning
    1. The results in the field are almost entirely at the limit
  9. Under differential constraints, PRMs are difficult because of the BVP problem, RRTs dont have that problem
  10. Methods of encouraging efficient exploration (otherwise acrobot for example just hangs down with random actions)
  11. Can do replanning in domains that are non-static
  12. Multi-agent path planning is poly-time (non-optimal) in discrete domains
  13. Future work in area of co-robotics (robot, human interaction), maybe from perspective of game theory, but want pareto optimality, introduces a number of other problems, of course
  14. They may get a BAXTER(!)
  15. Minimax regret: Savage ’51, Niehas ’48
    1. Good for human interaction, because minimax results agree more with human intuition than game-theoretic results