In a previous post, I wrote about using regression to infer the order of a process, which was motivated by trying to find an approach to solving Pitfall.

The idea was that the same approach could be used to not only find the order of a process, but also the causal structure, as would be described in a temporal Bayesian network.

But, after thinking about it, it seems like there are more principled ways of doing the same thing, especially because everything in Pitfall is integer (not real) valued, and deterministic given the right model. In line with my attempts to sneak information theory into model building, it seemed like we could use entropy to decide whether or not to include a variable in the prediction of a target value.

In a gross simplification, I considered a simulation of the crocodile where the data is just a list of {0,1}*, where 0 indicates a closed mouth and 1 an open mouth. Assume that the sequence actually looks like {0^k, 1^k}*. In this case, we know from inspection that k is the order, but how can entropy help us?

Consider the case where k = 5, so our sequnce looks like: [0,0,0,0,0,1,1,1,1,1,0,…]

If we are given no data, and need to predict the next state, basically we’re hosed. We know from the data (imagine its a very long sequence, or that the length is some multiple of 10) that there is an equal probability of the next state being a 0 or 1. In this case, as with an even coin, the entropy of prediction is = 1.

Now consider being given data which is the state at the present. Consider the case it is 0. If we are looking at one of the “first” 4 zeros in the sequence above, the next value will be 0. If we are looking at the “fifth” zero in the sequence, then the next value will be 1. In the case where the data is split (4/5), (1/5), the entropy is approximately 0.722.

You can imagine that as more history of the trajectory is given, the entropy will decrease further until it is 0.

I wrote up a quick experiment of the above scenario, and this is what the result looks like:

Order: 0 Entropy: 1.0 Order: 1 Entropy: 0.721928094887 Order: 2 Entropy: 0.649022499567 Order: 3 Entropy: 0.550977500433 Order: 4 Entropy: 0.4 Order: 5 Entropy: 0.0 Order: 6 Entropy: 0.0

Just as you would expect. I think using information gain instead of the direct entropy will give you basically the same result (just with differences instead).

Similarly, consider another gross simplification of the vine swinging. Think of this as a process that ranges back and forth between some minimum and maximum values. In this case, it modeled it as a process that looks like: [0,1,2,3,4,3,2,1,0,…]

In this case, we know its a 2-order Markov process because if we know only the current value, we don’t know if it is increasing or decreasing, but with 2 steps of history we know both. Running the same experiment on this process yields:

Order: 0 Entropy: 2.25 Order: 1 Entropy: 0.75 Order: 2 Entropy: 0.0 Order: 3 Entropy: 0.0

Note that the entropy in a discrete case is bounded by log_2(#possible values). In this case, log_2(5) = 2.32. The 2.25 is slightly lower because 0 and 4 occur less often than 1,2,3, so it is almost, but not quite uniform.

I think thats pretty cool, because it gives some more justification to basically the same thing I was doing earlier in a very ad-hoc manner. Also, this is computationally cheaper than doing most forms of regression. Of course, this can be used in an OO manner by attempting to measure the entropy not only by the order of the object itself, but also by the including the state of other objects.

nice. can you clarify the advantages of entropy over a simpler auto-correlation?

Well, from what I understand, correlation isn’t a great thing to test since it only measures the degree that two variables are linearly related, which means that it is possible for two variables to have very strong functional relationships, but in many cases the measure of correlation will still be low, or even 0.

For example, x and sin(x) would have 0 correlation, even if there is a very strong dependence between the two variables. That isn’t very good. Right?

Sorry, “autocorrelation” means identifying repeating patterns by correlating the pattern with shifted versions of itself.