Category Archives: Information Theory

RealKrimp — Finding Hyperintervals that Compress with MDL for Real-Valued Data. Witteveen, Duivesteijn, Knobbe, Grunwald. Advances in Intelligent Data Analysis 2014

Ah its from a symposium

An implementation exists here

  1. Idea behind minimum description length (MDL) principle is that it is possible to do induction by compression.
  2. Here they take a popular MDL algorithm, KRIMP, and extend it to real valued data
  3. “Krimp seeks frequent itemsets: attributes that co-occur unusually often in the dataset. Krimp employs a mining scheme to heuristically find itemsets that compress the data well, gauged by a decoding function based on the Minimum Description Length Principle.”
  4. RealKRIMP “…finds interesting hyperintervals in real-valued datasets.”
  5. “The Minimum Description Length (MDL) principle [2,3] can be seen as the more practical cousin of Kolmogorov complexity [4]. The main insight is that patterns in a dataset can be used to compress that dataset, and that this idea can be used to infer which patterns are particularly relevant in a dataset by gauging how well they compress: the authors of [1] summarize it by the slogan Induction by Compression. Many data mining problems can be practically solved by compression.”
  6. “An important piece of mathematical background for the application of MDL in data mining, which is relevant for both Krimp and RealKrimp, is the Kraft Inequality, relating code lengths and probabilities”  They extend the Kraft Inequality to continuous spaces
  7. <Ok skipping most – interesting but tight on time.>

Adaptive Submodularity: A New Approach to Active Learning and Stochastic Optimization. Glovin, Krause, Ray. Talk (NIPS 2010)


  1. Basic idea is consider submodular functions, they they are stochastic and unfold over time.  So instead of making all decisions in one shot (before execution begins), better to wait to see how the first decision unfolds before making the second, etc.
    1. Gives the example of influence in social networks – if you want to get people to plug a product you give to a few people for free, see how influence is spreading from the first person before deciding the second
  2. The setting described here is a generalization of standard suboptimality
  3. Analysis here allows many results from standard submodular optimization to the adaptive setting
  4. Instead of maximizing margin, maximizes expected margin based on current state, but does this only for the first selection, and the results of the first selection on state then become conditions for the next choice
  5. Gets standard bounds of 1-1/e
  6. Because adaptive, can now do things like “take the minimal number of actions needed to achieve X”, which you couldn’t do with standard submodularity (there, you can just say “maximize the benefit of N choices”
    1. This is a log-approximation ln(n)+1 for stochastic set cover
  7. Another application optimal decision trees
    1. “Generalized binary search” which is equivalent to max info gain
    2. This problem is adaptive submodular <I thought things like xor in decision trees arent submodular, but maybe because we are talking w.r.t. information gain it is?>
    3. Give bounds for this approximation
    4. “Results require that tests are exact (no noise)!”
  8. If trying to do decision trees in noisy setting, can do a Bayesian version
    1. But generalized binary search, maxing info gain, and maxing value of info aren’t adaptive submodular in noisy setting.  In noisy settings they require exponentially more tests than optimum <Should really read this paper>
  9. Trick is to transform noisy problem into normal problem by making outcome part of the hypothesis
    1. Only need to distinguish between noisy hypotheses that lead to different decisions
  10. Gives bounds for this Bayesian active learning for noisy domains
  11. Give an example of testing different psyhological/economic models of how people perform decision making
    1. For this setting, turns out many common approaches actually do much worse than random

Predictive Coding and the Slowness Principle: An Information-Theoretic Approach. Creutzig, Sprekeler. Neural Computation 2008.

  1. “Here, we use the information bottleneck method to state an information-theoretic objective function for temporally local predictive coding.  We then show that the linear case of SFA can be interpreted as a variant of predictive coding that maximizes the mutual information between the current output of the system and the input signal at the next time step.”
  2. “One approach for the self-organized formation of invariant representation is based on the observation that objects are unlikely to change or disappear completely from one moment to the next.”
  3. SFA reproduces some invariances in the visual system as well as “… properties of complex cells in primary visual cortex (…).”
    1. Including place cells
  4. Predictive coding basically means model building of the environment
    1. A big part of this biologically is “redundancy reduction”
  5. This paper sets up an information-theoretic framework for predictive coding
  6. “We focus on gaussian input signals and linear mapping.  In this case, the optimization problem underlying the information bottleneck can be reduced to a eigenvalue problem (…).  We show that the solution to this problem is similar to linear slow feature analysis, thereby providing a link between the learning principles of slowness and predictive coding.”
  7. In the information bottleneck, “One seeks to capture those components of a random variable X that can explain observed values of another variable R.  This task is achieved by compressing the variable into its compressed representation Y while preserving as much information as possible about R.  The trade-off between these two targets is controlled by the trade-off parameter β.”
  8. The problem is solved by minimizing a Lagrangian, where the first term minimizes the complexity of the mapping and the second maximizes accuracy of the representation
  9. “From the point of view of clustering, the information bottleneck method finds a quantization, or partition, of X that preserves as much mutual information as possible about R.”
  10. IB has been used for document clustering, nerual code analysis, gene expression analysis, extraction of speech features.
  11. “In particular, in case of a linear mapping between gaussian variables, the optimal functions are the solution of an eigenvalue problem (…).”
  12. “SFA is guaranteed to find the slowest features first, whereas TLPC [temporally local predictive coding] finds the most predictive components first.  For example, a very fast component can be very predictive, for example, if the value at t+1 is the negative of the current value (…).  Hence, from the TLPC point of view, the absolute deviation from random fluctuations rather than slowness is relevant.”  Although this distinction is only really relevant for discrete data with fine time resolution; in continuous spaces they should be much more closely related
    1. The relationship between them is λ i TLPC = λ i SFA – 1/4 (λ i SFA)2, for eigenvalues λ
  13. TLPC weighs each feature while SFA does not.  
  14. TLPC “… and SFA find the same components in the same order <when assumptions hold?>.  The difference is that TLPC allows quantifying the components in terms of predictive information… slow feature analysis accredits the same amplitude to all components, while TLPC gives higher weights to slower components according to their predictive power.”
    1. <But in SFA you do get the eigenvalues of the eigenvectors that can be used for weighing?  The goal is to find the lowest-eigenvalued eigenvectors.  Why can’t you use those weights, or do they simply have a less direct interpretation?>
  16. “The relationship between predictive coding and temporal invariance learning has also been suggested in other work, for example, by Shaw (2006), who argued that temporal invariance learning is equivalent to predictive coding if the input signals are generated from Ornstein-Uhlenbeck processes.”
  17. “In one regard, temporally local predictive coding differs from slow feature analysis. The information bottleneck approach is continuous in terms of the trade-off parameterβ, and new eigenvectors appear as second-order phase transitions. The weighting of the eigenvectors is different in that it depends on their eigenvalue (see Figure 3). This can be important when analyzing or modeling sensory systems where available bandwidth and, hence, resulting signal-to-noise ratio, is a limiting factor. For temporally local predictive coding, available bandwidth, such as number of neurons, should be attributed according to relative amplitude, whereas slow feature analysis accredits the same bandwidth to all features.
  18. <Immediately following> “We emphasize that our approach is not directly applicable to many realworld problems. Our derivation is restricted to gaussian variables and linear mappings.
  19. The restriction on the immediate past implies that SFA does not maximize predictive information for non-Markovian processes.”  <They claim followup work on this in progress at time of publication>

Gaussian Process Optimization in the Bandit Setting: No Regeret and Experimental Design. Srinivas, Krause, Kakade, Seeger. IC

<I did see this presented in person in ICML 2010.>

  1. Analyzes an algorithm called GP-UCB
  2. Regret is bounded in terms of information gain
  3. <I think a general problem with the approach is the computational (and even moreso) memory requirements involved in using GPs.  I gave them up years ago because they kept running out of memory.>
  4. Deals with noisy setting
  5. Information gain is bounded by use of submodularity argument
  6. Says HOO has issues with high-dimensional domains, which is contrary to what I remember from the paper.  Anyway, argues that measures of smoothness other than Lipshitz (or Holder) can give better results (based on what kernel you use in the GP)
  7. Mentions that there is a difference bettween experimental design and optimization but I am not familiar with the distinctions
  8. GP-UCB is a distinct method of using GPs for optimization; there is a small literature on different algorithms with a number of citations in the paper
  9. Discusses RKHS, Reproducing Kernel Hilbert Spaces (something I’ve heard of  before but have no idea what it is)
    1. There is something called the RKHS norms which is a measure of smoothness
  10. Greedily selecting points that maximize information gain is approximately (a constant fraction of) optimal <they cite guestrins paper for this>
  11. The GP-UCB algorithm, however, doesn’t just try to maximize information globally.  Since we are performing optimization we only care about areas where the maximum may lie.  There is some scaling constant that is used to balance gaining information vs finding the optimum.  It must be selected correctly or the algorithm will converge prematurely
  12. There is a major problem with this approach which is what I thought I remembered hearing in the talk.  Given a bunch of potential points, there is math that tells you which one to sample next.  But in continuous optimization there is an infinitely large set of points and this approach has no way to deal with it, turning the global optimization problem into another global optimization problem (which they claim is easier, though.  I don’t really buy that)
  13. Putting together the computational, memory, and most fundamental issue that it doesn’t resolve the global optimization problem, I think this needs to be seriously improved upon before its considered seriously.
  14. For example in the experimental results section, they discretize a 1-d optimization problem into 1000 points.  This works fine for 1-D but clearly this won’t scale in any reasonable manner, one of the things they say the algorithm is good at.  For larger dimensions it is necessary to resort to some other method than uniform discretization

Submodularity and Factored Models

From Near-Optimal Nonmyopic Value of Information in Graphical Models (Krause, Guestrin):

Any submodular function F over a finite set V must follow the following:

F(A U B) – F(A) => F(A’ U B) – F(A’)

(Where as you would imagine U is union)

Where A is a proper subset of A’ and A’ is a subset of V, and B not in A.

Now information gain on the xor function is not submodular, because in the case where xor(X Y) = Z, and P(X=1) = P(Y=1) = 1/2, then for entropy H(.):

H(Z|X) – H(Z) > H(Z|X U Y) – H(Z | Y)

1 > 0

To get around this, they use a clever structuring of the sets which is:

For R, S which are disjoint subsets of V, s.t. variables in S are indep. given R, and for A which is a subset of (S U R), a there is a  function (related to the information gain) which is submodular:

G(A) = H(R) – H(R \ A | A)

Now this is pretty nice, in the case of a one step DBN, R can be the set of features at a time step t, and S can be the set of features at time step (t+1), as all elements at time (t+1) are inependent given those at t, as long as there are no synchronic arcs.

The problem, as I see it for our uses is that that is a function of the entropy at t, but we care about the entropy at (t+1), so we need the Rs to be Ss in G() in order to just be able to use that function for what we need.

There is still useful stuff here though, for example, if we assume that a domain is submodular, and it is, then a greedy algorithm doing feautre selection is at most (1 – 1/e) worse than the optimal set.


Thoughts on KWIK Learning the Stock Game

In my previous post, I wrote about using entropy to find the structure of the stock game.  In discussions afterward, it came up whether it was provable that the method described was KWIK (Knows What It Knows – Li,  Littman, Walsh – ICML 2008).

István pointed out that this might be difficult due to the log factor in the equation for entropy, but I think I figured out a way around it; the basic answer is to just raise the equation H(X) to 2^{H(X)}, which then removes the log factor.  I wrote a quick breakdown of the math steps, its available here: Removing the log from the equation for entropy [note – Binomial should read Bernoulli].

From there I was unsure what to do, as it didn’t seem there was really anything simple to do as a next step in that line of thinking.  Reading over the KWIK paper again though gave me exactly what I was looking for; estimating the probability of a Bernoulli distribution is KWIK learnable.

So the algorithm I used was just using the conditional Bernoulli distributions to compute the entropy.  The question is, does anything weird happen based on the calculations done with those estimated probabilities to remove the KWIK bound?  Basically, does an accurate estimation of the probability imply an accurate estimation of the entropy?  More specifically, can the product of a number of KWIK-learned values be not KWIK anymore for some reason?

Tagged , , , ,

Solving the Stock Game in 300 Steps

In my last post, I wrote about using entropy to find the order of a few extremely simplified simulations of Pitfall.  This post will use the same approach, but will instead be used to find the structure of the Stock Game, which I am familiar from “Efficient Structure Learning in Factored-state MDPs”, Alexander L. Strehl, Carlos Diuk and Michael L. Littman. AAAI 2007.

The game is another extremely simplified simulation, this time of the dynamics of a stock market.  Basically the system is composed of a number of sectors, with each sector composed of a number of stocks.  The stocks across sectors are independent of each other, but within a sector, the probability of any stock rising is defined as (if it doesnt rise, it falls):

p(rising) = 0.1 + 0.8*(#stocksRoseInSector/#stocksDroppedInSector)

Where the number that rose/dropped is based on the state in the previous step.

So stocks in a sector have predictive power over each other, but not across sectors.  In order to most efficiently learn a model the dynamics of the market, this structure needs to be discerned.  Additionally, the reward per turn is: For each sector, if that sector is owned, +1 for each stock that went up, and -1 for each stock that went down.  If the sector is not owned, 0 reward is earned from that sector.

Just like before, I looked at the difference in conditional entropies.  In this case, however, the environment is stochastic, so I couldn’t just wait until the entropy goes to zero to stop.  Additionally, it is undesirable to take any conditioning statement just because it reduces entropy by some small amount, which can easily happen due to noise.

In this case, I considered that variable_2 has predictive power over variable_1 if conditioning on variable_2 leads to a 10% or greater reduction in entropy of variable_1.

Here, I did not specify any maximum (or minimum) number of parents allowed for each item.  Additionally, I also did not restrict the model to search over only structures that were order 1; it was free to search further in the past to reduce the entropy even more.  In this case, since the actual process is order 1, it is nice that the algorithm finds a solution that is also order 1.

Even though I ruined the punchline with the title, this approach seemed to work quite well, learning the dynamics and reward structure of the stock game with 3 sectors and 2 stocks/sector in about 300 steps of simulation (actions were selected randomly).   As you would expect, the output of the program looks like:

Stock:  2_0:, predictors:  [('2_0', -1), ('2_1', -1)]
Stock:  2_1:, predictors:  [('2_0', -1), ('2_1', -1)]
Stock:  1_1:, predictors:  [('1_0', -1), ('1_1', -1)]
Stock:  1_0:, predictors:  [('1_0', -1), ('1_1', -1)]
Stock:  0_0:, predictors:  [('0_1', -1), ('0_0', -1)]
Stock:  0_1:, predictors:  [('0_1', -1), ('0_0', -1)]
Reward predictors:  [('1_0', -1), ('own_2', -1), ('own_0', -1),
    ('2_0', -1), ('own_1', -1), ('0_0', -1), ('2_1', -1),
    ('0_1', -1), ('1_1', -1)]

Where stock x_y is the yth stock in sector x, the -1 just means the predictor is from the previous time step, and own_w means that sector w is owned.  It correctly isolates all the stocks into the right sectors and identifies that the reward function depends on all stocks and all ownership values.

I think this is good because the published result from Strehl et al seems to learn the structure at about 1,500 steps of simulation, where this algorithm can accomplish the same thing in 1/5 the time of SLF-Rmax.  Basically, we can just use this algorithm to learn the structure and then just pass that along to factored-Rmax, which takes right off.

About the empirical computational costs, this approach will solve the 2×2 (2 sector, 2 stocks/sector) in less than 20 iterations, and seems to need about 5,000 to solve the 3×3, so I’m not sure yet exactly how the number of steps required grows as the size of the state space grows exponentially (4 for 2×2, 63 for 3×2, 512 for 3×3), but I’m sure I can figure that out simply enough .  Those are kind of scary numbers when I think of them in terms of Pitfall though… Not sure if the other things I discussed would be any better, my guess is not though, as computing the entropy of one distribution is just linear in the number of samples (and constant in memory), as opposed to quadratic (potentially in both) for many other methods.

Using entropy to find the order of a process

In a previous post, I wrote about using regression to infer the order of a process, which was motivated by trying to find an approach to solving Pitfall.

The idea was that the same approach could be used to not only find the order of a process, but also the causal structure, as would be described in a temporal Bayesian network.

But, after thinking about it, it seems like there are more principled ways of doing the same thing, especially because everything in Pitfall is integer (not real) valued, and deterministic given the right model.  In line with my attempts to sneak information theory into model building, it seemed like we could use entropy to decide whether or not to include a variable in the prediction of a target value.

In a gross simplification, I considered a simulation of the crocodile where the data is just a list of  {0,1}*, where 0 indicates a closed mouth and 1 an open mouth.  Assume that the sequence actually looks like {0^k, 1^k}*.  In this case, we know from inspection that k is the order, but how can entropy help us?

Consider the case where k = 5, so our sequnce looks like: [0,0,0,0,0,1,1,1,1,1,0,…]

If we are given no data, and need to predict the next state, basically we’re hosed.  We know from the data (imagine its a very long sequence, or that the length is some multiple of 10) that there is an equal probability of the next state being a 0 or 1.  In this case, as with an even coin, the entropy of prediction is = 1.

Now consider being given data which is the state at the present.  Consider the case it is 0.  If we are looking at one of the “first” 4 zeros in the sequence above, the next value will be 0.  If we are looking at the “fifth” zero in the sequence, then the next value will be 1.  In the case where the data is split (4/5), (1/5), the entropy is approximately 0.722.

You can imagine that as more history of the trajectory is given, the entropy will decrease further until it is 0.

I wrote up a quick experiment of the above scenario, and this is what the result looks like:

Order: 0 Entropy: 1.0
Order: 1 Entropy: 0.721928094887
Order: 2 Entropy: 0.649022499567
Order: 3 Entropy: 0.550977500433
Order: 4 Entropy: 0.4
Order: 5 Entropy: 0.0
Order: 6 Entropy: 0.0

Just as you would expect.  I think using information gain instead of the direct entropy will give you basically the same result (just with differences instead).

Similarly, consider another gross simplification of the vine swinging.  Think of this as a process that ranges back and forth between some minimum and maximum values.  In this case, it modeled it as a process that looks like: [0,1,2,3,4,3,2,1,0,…]

In this case, we know its a 2-order Markov process because if we know only the current value, we don’t know if it is increasing or decreasing, but with 2 steps of history we know both.  Running the same experiment on this process yields:

Order: 0 Entropy: 2.25
Order: 1 Entropy: 0.75
Order: 2 Entropy: 0.0
Order: 3 Entropy: 0.0

Note that the entropy in a discrete case is bounded by log_2(#possible values).  In this case, log_2(5) = 2.32.  The 2.25 is slightly lower because 0 and 4 occur less often than 1,2,3, so it is almost, but not quite uniform.

I think thats pretty cool, because it gives some more justification to basically the same thing I was doing earlier in a very ad-hoc manner.  Also, this is computationally cheaper than doing most forms of regression.  Of course, this can be used in an OO manner by attempting to measure the entropy not only by the order of the object itself, but also by the including the state of other objects.