##
Category Archives: nonstationary

- Deals with multi-armed bandit setting with adaptive opponents (explicitly mentions HIV as an example)
- Regret is measured w.r.t. best history-dependent strategy
- Reward functions are allowed to be arbitrary functions of the history
- The discuss operating under a class of constraints on the opponent (must be too difficult without). Constraint can be unknown?
- There are some existing algorithms, but they are intractable or nonimplementable
- They say the model here is a special case of “sleeping bandits”… whats that?
- Talk about equivalence classes of histories, the method can express anything from fully adversarial to vanilla bandits
- When only one model is considered the regret is sqrt-ish. The main contribution is a tractable algorithm that has sub-linear regret. Comparisons to optimal hist0ry-based strategy
- They call their history-based regret phi-regret. Phi is the equivalence relation on histories
- Phi-regret is always at least as big as vanilla regret. When there is only one equivalence class they are equal

- Stochastic phi-regret is minimized by using vanilla regret algorithms for each history class
- Resulting algorithms are called phi-ucb and phi-exp3

- “Thus we show that there always exist an environment such that whatever the strategy of the player is, a particular opponent will lead to visit all history-classes uniformly in expectation.”
- Players can depend on phi, opponent can depend on phi and agent. Analysis is of worst (maximal regret?) opponent against the best agent of worst phi
- Consider case where opponent can use a number of equivalence relations defined by multiple class functions, which is the phi_theta regret
- They say this case in particular is related to sleeping bandits

- Basic idea is to use multiple experts, but the performance for each expert is worse than when they are used alone because they may not get the data they want
- Point out regret term with theta is log(theta)^(1/2), and is because all models give rewards which provide information about all models, so the learning is faster than might be initially expected

**Tagged** ensemble
- Uses deictic (not familiar with this) representations that lead to partial observability, but seems to also allow for generalization
- “The goal of the paper is… to study the effects of representational choices and the topology of the environment on an already well-understood RL algorithm.”
- They claim dealing with nonstationarity is an instance of a more general generalization problem. I can see how this is true, but it isn’t how I always think about it
- The representation used here is in terms of logical predicates that describe the state, this allows for more generalization than just enumerating states
- Predicates describe properties of objects, n-ary predicates describe relationships among objects

- Problems arise when the relational language can’t adequately distinguish between two configurations that require different policies
- “The difficult, then is ensuring that the relational representation captures enough of the state description to function effectively, while discarding enough to generalize well”

- Strategy employed is of state envelopes, that the agent maintains an explicit representation of states near the envelope and disregards anything outside it (basically makes the agent work in a local fashion). Here the envelope is centered around the agent
- Discuss a gridworld with multiple types of terrain
- Say the classic discrete state representation doesn’t work if environment (such as goal location changes)
- Results applied to Q-learning. A very low γ is used, which sort of creates an envelope around the state anyway, so naturally the representation with the smaller state space will learn faster, as I don’t think it can learn much better with the small γ
- Doesn’t seem that nonstationary to me, because with their method they are representing everything that is needed to remove nonstationarity, whereas it seems like QL is deprived that information (such as where the goal is)

- Hidden Mode MDPs (HM-MDPs) are a model of nonstationary model where the environment dynamics change over time according to a MDP
- Although general POMDP algorithms can be used, more effective algorithms which leverage special cases of HM-MDPs can be more effective in these domains
- Proposed algorithm works by decomposing Q() into a number of components, simplifying the operation, allowing for a special form of VI
- In HM-MDPs, environmental distributions are restricted to a fixed number of unobservable modes; each mode specifies an MDP.
- States, however, are fully observable

- Proposes keeping track of belief state of mode
- As most POMP literature, seems this algorithm assumes complete knowledge of the environmental distributions
- Says VI can’t be computed explicitly because it needs to compute all infinitely many belief states. Offhand, I can’t figure out why you can’t just do VI for each model and then blend those values when you are in a belief state over models?
- Their VI algorithm looks like it enumerates all belief states that are encountered during the course of computing VI? This would be finite, but also very large.
- Guess they could do this because they know initial distribution over modes

- They then give a couple of methods of doing these computations more cheaply
- Compares their performance vs standard POMDP planner in terms of fixed computation time