## Ideas from Decision Theoretic Planning

Decision Theoretic Planning, Boutilier, 1999:

• Fully Observable MDPs can be solved in time poly to |state|*|action|*|inputs|, most commonly used are value, policy iteration (p. 24).
• Finite-horizon requires O(|S|^2 |A|), computation per iteration, infinite horizon is O(|S|^2 |A| + |S|^3) per iteration. Both converge in a polynomial number of iterations.
• POMDPs are exponentially hard in |state| and |horizon| (p.25).
• The problem of finding a policy that maximizes (or approximately maximizes) expected discounted reward over an infinite horizon is undecidable.
• Littman (1996) examines a special case of POMDP where boolean rewards are given and the goal is to find an infinite horizon policy with nonzero total reward given that the rewards associated with all states are non-negative.
• In probabilistic goal-oriented planning for POMDPs, search is usually done for a probability distribution over states. Even simplest problem in this domain is undecidable. Why? The set of states may be finite, but the set of distributions over the states is not, and worst case the agent may have to search an infinite number of plans before being able to determine whether a solution exists (p. 25)
• Policies under finite-horizon problems are generally non-stationary (even if the MDP is), whereas policies under infinite-horizon problems are stationary (p.29)
• In value iteration, the sequence of estimated value functions converges linearly to the optimum (p. 29).
• Policy iteration can converge quadratically, but in practice converges faster than value iteration (p.30).
• To best make decisions in POMDPs, it is necessary to look not only at the most recent observation, but the entire history of the current trajectory. History dependent policies grow in size exponential to the horizon length. Can instead use a probability distribution over states (belief states), and from this distribution a policy can be computed (p. 31).
• When using belief states, the problem becomes one that can be viewed as a fully observable MDP.
• RTDP (Barto et al 1995) is an online asynchronous value iteration algorithm (p. 36).
• In a DBN, diachronic arcs go across time (from t-1 to t), and synchronic arcs stay in the same time frame (from t to t).
• It is possible to restructure the graphs so that a process that required synchronic arcs can be modeled without synchronic arcs, but this is basically just graph restructuring trickery (p. 50).
• Aggregation techniques can be used to group regions of the state space together to allow for more efficient (less granular) computation to be done.  Worst-case costs for value function calculation is cubic in the size of the largest cluster (p. 58).
• Goal regression is the process of achieving subgoals, each of which brings the state of the system closer to the goal state, without undoing the work a previous subgoal accomplished (p. 60).
• A technique called model minimization attempts to find a minimal (or at least smaller) representation of the MDP that is equivalent.  Takes idea from state machines (p. 70).
• Problem decomposition takes the MDP and breaks it up into a number of independent MDPs, each of which has a state which solves some aspect of the original MDP (p. 71).
• There is an algorithm that identifies recurrent classes of an MDP as represented by a 2DBN (2-step dynamic bayes net, Boutilier, Puterman 1995) (p. 75).
• Can break up an MDP into a set of parallel MDPs that run together (p. 79).