Category Archives: Bayesian Networks

Learning probability distributions over partially-ordered human everyday activities. Tenorth, De la Torre, Beetz. ICRA 2013

  1. Attempt “… to learn the partially ordered structure inherent in human everyday activities from observations by exploiting variability in the data.”
  2. Learns full joint probability over actions that make up a task, their partial ordering, and their parameters
  3. Can be used for classification, but also figuring out what actions are relevant to a task, what objects are used, if it was done correctly, or what is typical for an individual
  4. Use synthetic data as well as TUM Kitchen and CMU MMAC
  5. Use Bayesian Logic Nets (another paper has author overlap and uses the same approach)
  6. Common alternate approaches are HMMs, conditional random fields (CRFs), or suffix trees
    1. But these are most effective when the ordering of the subtasks are pretty fixed
    2. Also Markov assumption of HMM doesn’t really hold in the way data is often represented and may require all history information
  7. Also some other approaches for representing partial ordering
    1. <Whatever this means> “All these approaches focus only on the ordering among atomic action entities, while our system learns a distribution over the order as well as the action parameters.”
  8. Literature on partially ordered plans require lots of prior information, and have been applied to synthetic problems
  9. Working off the TUM dataset, they needed to resort to bagging and data noisification techniques in order to get enough samples
  10. Needs annotation / data labeling
  11. Learns, for example, that after making brownies (the CMU MMAC dataset) some people like to put the frying pan in the sink, and others on the stove
  12. Performance of this approach is much better than that of CRFs, and is more robust to variability in how people undertake tasks

Symbolic Dynamic Programming for Continuous State and Action MDPs. Zamani, Sanner, Fang. AAAI 2012

  1. Paper presents a way to optimally (and in closed-form) solve continuous MDPs, does Symbolic Dynamic Programming (SPD)
  2. Functions in multivariate continuous state and action, discrete noise (whats that mean in this case?), piecewise linear reward.
  3. There is limited work that deals with exact solutions.  In control, there is LQR/LQG (linear quadratic with gaussian noise), but in LQR transition dynamics or reward cannot be piecewise linear
  4. The paper deals with 2 domains: Mars Rover, Inventory Control, and Resivior Management
  5. States can be composed of discrete and continuous factors
  6. (not surprisingly) requires a complete model of the MDP
  7. Because CSA-MDPS (continuous state action mdps) are naturally factored (there is a reference for this) , they can be dealt with as a dynamic bayes net (DBN)
    1. Assume no synchronic arcs (all dependencies are strictly forward in time), which allows for a relatively simple description of the transition dynamics
  8. The solutions here are optimal discounted finite-horizon
  9. The conditional probability functions are described in terms of piecewise linear equations with the following properties:
    1. Conditioned only on action, current state, and prev state vars (this is weird, why current state?)
    2. The are deterministic (can be encoded with dirac deltas), this means gaussian noise in the dynamics, for example is no good.
    3. They are piecewise linear
  10. Can sort of handle quadratic rewards, but this is restricted a bit (has to be univariate quadratic?)
  11. Describe value iteration for continuous MDPs
  12. Dynamics get represented as a bunch of disjoint linear and quadratic functions
  13. <its so cold in this room that I can’t even concentrate on what I’m reading, probably missing stuff here>
  14. Integration is required, but it turns out that in this setting it is solvable in a particular closed form, so it isn’t an issue
  15. This is very cool and important, but a general point to make is that this does exact solving, so it is brittle in a way.  A region that isn’t linear or quadratic just ends up being unsolvable.  Methods that are less exact can deal with that sort of problem
  16. The max operator is commutative? What is the casemax operator?
  17. It is claimed that the solver is “efficient”, but I’d like to know what that means in particular
  18. Solutions require linear checking/pruning in the XADD (I think thats the structure that represents the solution) or problems blow up extremely quickly
  19. Claims here that this approach produces the first exact solution for a couple of canonical OR problems.  A very neat paper, I wish I understood it better AAAI 2012 seems to have the very prominent talks on videolectures but thats it.
Tagged ,

Submodularity and Factored Models

From Near-Optimal Nonmyopic Value of Information in Graphical Models (Krause, Guestrin):

Any submodular function F over a finite set V must follow the following:

F(A U B) – F(A) => F(A’ U B) – F(A’)

(Where as you would imagine U is union)

Where A is a proper subset of A’ and A’ is a subset of V, and B not in A.

Now information gain on the xor function is not submodular, because in the case where xor(X Y) = Z, and P(X=1) = P(Y=1) = 1/2, then for entropy H(.):

H(Z|X) – H(Z) > H(Z|X U Y) – H(Z | Y)

1 > 0

To get around this, they use a clever structuring of the sets which is:

For R, S which are disjoint subsets of V, s.t. variables in S are indep. given R, and for A which is a subset of (S U R), a there is a  function (related to the information gain) which is submodular:

G(A) = H(R) – H(R \ A | A)

Now this is pretty nice, in the case of a one step DBN, R can be the set of features at a time step t, and S can be the set of features at time step (t+1), as all elements at time (t+1) are inependent given those at t, as long as there are no synchronic arcs.

The problem, as I see it for our uses is that that is a function of the entropy at t, but we care about the entropy at (t+1), so we need the Rs to be Ss in G() in order to just be able to use that function for what we need.

There is still useful stuff here though, for example, if we assume that a domain is submodular, and it is, then a greedy algorithm doing feautre selection is at most (1 – 1/e) worse than the optimal set.


Thoughts on KWIK Learning the Stock Game

In my previous post, I wrote about using entropy to find the structure of the stock game.  In discussions afterward, it came up whether it was provable that the method described was KWIK (Knows What It Knows – Li,  Littman, Walsh – ICML 2008).

István pointed out that this might be difficult due to the log factor in the equation for entropy, but I think I figured out a way around it; the basic answer is to just raise the equation H(X) to 2^{H(X)}, which then removes the log factor.  I wrote a quick breakdown of the math steps, its available here: Removing the log from the equation for entropy [note – Binomial should read Bernoulli].

From there I was unsure what to do, as it didn’t seem there was really anything simple to do as a next step in that line of thinking.  Reading over the KWIK paper again though gave me exactly what I was looking for; estimating the probability of a Bernoulli distribution is KWIK learnable.

So the algorithm I used was just using the conditional Bernoulli distributions to compute the entropy.  The question is, does anything weird happen based on the calculations done with those estimated probabilities to remove the KWIK bound?  Basically, does an accurate estimation of the probability imply an accurate estimation of the entropy?  More specifically, can the product of a number of KWIK-learned values be not KWIK anymore for some reason?

Tagged , , , ,

Solving the Stock Game in 300 Steps

In my last post, I wrote about using entropy to find the order of a few extremely simplified simulations of Pitfall.  This post will use the same approach, but will instead be used to find the structure of the Stock Game, which I am familiar from “Efficient Structure Learning in Factored-state MDPs”, Alexander L. Strehl, Carlos Diuk and Michael L. Littman. AAAI 2007.

The game is another extremely simplified simulation, this time of the dynamics of a stock market.  Basically the system is composed of a number of sectors, with each sector composed of a number of stocks.  The stocks across sectors are independent of each other, but within a sector, the probability of any stock rising is defined as (if it doesnt rise, it falls):

p(rising) = 0.1 + 0.8*(#stocksRoseInSector/#stocksDroppedInSector)

Where the number that rose/dropped is based on the state in the previous step.

So stocks in a sector have predictive power over each other, but not across sectors.  In order to most efficiently learn a model the dynamics of the market, this structure needs to be discerned.  Additionally, the reward per turn is: For each sector, if that sector is owned, +1 for each stock that went up, and -1 for each stock that went down.  If the sector is not owned, 0 reward is earned from that sector.

Just like before, I looked at the difference in conditional entropies.  In this case, however, the environment is stochastic, so I couldn’t just wait until the entropy goes to zero to stop.  Additionally, it is undesirable to take any conditioning statement just because it reduces entropy by some small amount, which can easily happen due to noise.

In this case, I considered that variable_2 has predictive power over variable_1 if conditioning on variable_2 leads to a 10% or greater reduction in entropy of variable_1.

Here, I did not specify any maximum (or minimum) number of parents allowed for each item.  Additionally, I also did not restrict the model to search over only structures that were order 1; it was free to search further in the past to reduce the entropy even more.  In this case, since the actual process is order 1, it is nice that the algorithm finds a solution that is also order 1.

Even though I ruined the punchline with the title, this approach seemed to work quite well, learning the dynamics and reward structure of the stock game with 3 sectors and 2 stocks/sector in about 300 steps of simulation (actions were selected randomly).   As you would expect, the output of the program looks like:

Stock:  2_0:, predictors:  [('2_0', -1), ('2_1', -1)]
Stock:  2_1:, predictors:  [('2_0', -1), ('2_1', -1)]
Stock:  1_1:, predictors:  [('1_0', -1), ('1_1', -1)]
Stock:  1_0:, predictors:  [('1_0', -1), ('1_1', -1)]
Stock:  0_0:, predictors:  [('0_1', -1), ('0_0', -1)]
Stock:  0_1:, predictors:  [('0_1', -1), ('0_0', -1)]
Reward predictors:  [('1_0', -1), ('own_2', -1), ('own_0', -1),
    ('2_0', -1), ('own_1', -1), ('0_0', -1), ('2_1', -1),
    ('0_1', -1), ('1_1', -1)]

Where stock x_y is the yth stock in sector x, the -1 just means the predictor is from the previous time step, and own_w means that sector w is owned.  It correctly isolates all the stocks into the right sectors and identifies that the reward function depends on all stocks and all ownership values.

I think this is good because the published result from Strehl et al seems to learn the structure at about 1,500 steps of simulation, where this algorithm can accomplish the same thing in 1/5 the time of SLF-Rmax.  Basically, we can just use this algorithm to learn the structure and then just pass that along to factored-Rmax, which takes right off.

About the empirical computational costs, this approach will solve the 2×2 (2 sector, 2 stocks/sector) in less than 20 iterations, and seems to need about 5,000 to solve the 3×3, so I’m not sure yet exactly how the number of steps required grows as the size of the state space grows exponentially (4 for 2×2, 63 for 3×2, 512 for 3×3), but I’m sure I can figure that out simply enough .  Those are kind of scary numbers when I think of them in terms of Pitfall though… Not sure if the other things I discussed would be any better, my guess is not though, as computing the entropy of one distribution is just linear in the number of samples (and constant in memory), as opposed to quadratic (potentially in both) for many other methods.

Ideas from Decision Theoretic Planning

Decision Theoretic Planning, Boutilier, 1999:

  • Fully Observable MDPs can be solved in time poly to |state|*|action|*|inputs|, most commonly used are value, policy iteration (p. 24).
    • Finite-horizon requires O(|S|^2 |A|), computation per iteration, infinite horizon is O(|S|^2 |A| + |S|^3) per iteration. Both converge in a polynomial number of iterations.
  • POMDPs are exponentially hard in |state| and |horizon| (p.25).
    • The problem of finding a policy that maximizes (or approximately maximizes) expected discounted reward over an infinite horizon is undecidable.
    • Littman (1996) examines a special case of POMDP where boolean rewards are given and the goal is to find an infinite horizon policy with nonzero total reward given that the rewards associated with all states are non-negative.
  • In probabilistic goal-oriented planning for POMDPs, search is usually done for a probability distribution over states. Even simplest problem in this domain is undecidable. Why? The set of states may be finite, but the set of distributions over the states is not, and worst case the agent may have to search an infinite number of plans before being able to determine whether a solution exists (p. 25)
  • Policies under finite-horizon problems are generally non-stationary (even if the MDP is), whereas policies under infinite-horizon problems are stationary (p.29)
  • In value iteration, the sequence of estimated value functions converges linearly to the optimum (p. 29).
  • Policy iteration can converge quadratically, but in practice converges faster than value iteration (p.30).
  • To best make decisions in POMDPs, it is necessary to look not only at the most recent observation, but the entire history of the current trajectory. History dependent policies grow in size exponential to the horizon length. Can instead use a probability distribution over states (belief states), and from this distribution a policy can be computed (p. 31).
    • When using belief states, the problem becomes one that can be viewed as a fully observable MDP.
  • RTDP (Barto et al 1995) is an online asynchronous value iteration algorithm (p. 36).
  • In a DBN, diachronic arcs go across time (from t-1 to t), and synchronic arcs stay in the same time frame (from t to t).
  • It is possible to restructure the graphs so that a process that required synchronic arcs can be modeled without synchronic arcs, but this is basically just graph restructuring trickery (p. 50).
  • Aggregation techniques can be used to group regions of the state space together to allow for more efficient (less granular) computation to be done.  Worst-case costs for value function calculation is cubic in the size of the largest cluster (p. 58).
  • Goal regression is the process of achieving subgoals, each of which brings the state of the system closer to the goal state, without undoing the work a previous subgoal accomplished (p. 60).
  • A technique called model minimization attempts to find a minimal (or at least smaller) representation of the MDP that is equivalent.  Takes idea from state machines (p. 70).
  • Problem decomposition takes the MDP and breaks it up into a number of independent MDPs, each of which has a state which solves some aspect of the original MDP (p. 71).
  • There is an algorithm that identifies recurrent classes of an MDP as represented by a 2DBN (2-step dynamic bayes net, Boutilier, Puterman 1995) (p. 75).
  • Can break up an MDP into a set of parallel MDPs that run together (p. 79).
Tagged ,

Ideas from Learning the Structure of Dynamic Probabilistic Networks

From Learning the Structure of Dynamic Probabilistic Networks by Freidman, Murphy, Russell:

scores combine the likelihood of the data according to the network … with some penalty relating to the complexity of the network. When learning the structure of PNs, a complexity penalty is essential since the maximum-likelihood network is usually the completely connected network.

Well, one nice thing about my post on Estimating the order of an infinite state MDP with Gaussian Processes is that because of the regression, variables that don’t help predict the desired value increase the error.  This error itself acts as a regularizer that metrics for scoring learned DBNs need to have added artificially.

Paper also discusses Friedman’s (Learning belief networks in the presence of
missing values and hidden variables, 1997)  Structural EM (SEM) algorithm, which can modify the structure of the DBN being learned during the M-step.

Ideas from: Operations for Learning with Graphical Models

From Operations for Learning with Graphical Models, Wray Buntine:

  • Gibbs is a type of MCMC
  • Gibbs is really simulated annealing with constant temperature = 1 (p.201).
  • EM and Gibbs are really the same thing (p.208).