- Policies considered are constrained by hierarchies of partially specified machines
- Is intended to allow for incorporation of prior knowledge and create a smaller search space
- Algorithms presented are provably convergent
- Uses HAMs or hierarchical abstract machines, which constrain policies that can be selected and whose transitions may invoke lower level machines
- A HAM is a program which constrains the set of actions the agent can choose from in each state
- In a maze navigation task it may select a fixed action until the end of a hallway is encountered
- This machine in turn can call lower level machines to do other tasks like simple obstacle avoidance

- Mentioned this architecture is similar to Brooks’ subsumption architechture
- At the highest level, this example machine just selects a direction at intersections, but other operation is automatic
- A level below this is four machines, each traverse a hallway in a particular direction
- Below this, a number of strategies for getting around obstacles while going in that direction exist
- Finally are the primitive actions
- The machine also uses a richer feature representation that describes near obstacles and hallway intersections

- The idea is that a HAM applied to an MDP composes that MDP into a new domain. Essentially:
- The new set of states is the cross product of states in H (the HAM) and M (the MDP)
- For each state in HoM where H is an action state (there are two types of states in H), the transition functions of H and M are combined
- For each stat in HoM where H is a choice state, actions only change machine state
- The rewards are taken only from primitive actions in M

- Policies that are optimal in HoM correspond to an optimal policy in M, and there is a way to derive the optimal policy for M from the optimal policy for HoM
- The complexity of HoM is the number of choice states in HoM
- This will usually be smaller than the number of choices that have to be selected for M

- They introduce a variant of QL that learns directly in the reduced space, but doesn’t have to perform the explicit creation of HoM
- Keeps track of
- Current state in H, as well as M
- The last action selected
- The states of H and M at the time the last action was selected
- The reward (as well as discount) accumulated since the last action selection
- The Q-table is extended as a SxMxA (actual states x machine states x actions)

- The algorithm converges to optimal
- This stuff is orthogonal to state aggregation, which can introduce partial observability issues if not done carefully (domain must also have certain features to allow aggregation to happen safely)
- A number of references worth reading, may not be relevant but important to thoroughly check whats out there
- Says the options framework is a subset of what is described here