In my last post, I wrote about using entropy to find the order of a few extremely simplified simulations of Pitfall. This post will use the same approach, but will instead be used to find the structure of the Stock Game, which I am familiar from “Efficient Structure Learning in Factored-state MDPs”, Alexander L. Strehl, Carlos Diuk and Michael L. Littman. AAAI 2007.

The game is another extremely simplified simulation, this time of the dynamics of a stock market. Basically the system is composed of a number of sectors, with each sector composed of a number of stocks. The stocks across sectors are independent of each other, but within a sector, the probability of any stock rising is defined as (if it doesnt rise, it falls):

p(rising) = 0.1 + 0.8*(#stocksRoseInSector/#stocksDroppedInSector)

Where the number that rose/dropped is based on the state in the previous step.

So stocks in a sector have predictive power over each other, but not across sectors. In order to most efficiently learn a model the dynamics of the market, this structure needs to be discerned. Additionally, the reward per turn is: For each sector, if that sector is owned, +1 for each stock that went up, and -1 for each stock that went down. If the sector is not owned, 0 reward is earned from that sector.

Just like before, I looked at the difference in conditional entropies. In this case, however, the environment is stochastic, so I couldn’t just wait until the entropy goes to zero to stop. Additionally, it is undesirable to take any conditioning statement just because it reduces entropy by some small amount, which can easily happen due to noise.

In this case, I considered that variable_2 has predictive power over variable_1 if conditioning on variable_2 leads to a 10% or greater reduction in entropy of variable_1.

Here, I did not specify any maximum (or minimum) number of parents allowed for each item. Additionally, I also did not restrict the model to search over only structures that were order 1; it was free to search further in the past to reduce the entropy even more. In this case, since the actual process is order 1, it is nice that the algorithm finds a solution that is also order 1.

Even though I ruined the punchline with the title, this approach seemed to work quite well, learning the dynamics and reward structure of the stock game with 3 sectors and 2 stocks/sector in about 300 steps of simulation (actions were selected randomly). As you would expect, the output of the program looks like:

Stock: 2_0:, predictors: [('2_0', -1), ('2_1', -1)]
Stock: 2_1:, predictors: [('2_0', -1), ('2_1', -1)]
Stock: 1_1:, predictors: [('1_0', -1), ('1_1', -1)]
Stock: 1_0:, predictors: [('1_0', -1), ('1_1', -1)]
Stock: 0_0:, predictors: [('0_1', -1), ('0_0', -1)]
Stock: 0_1:, predictors: [('0_1', -1), ('0_0', -1)]
Reward predictors: [('1_0', -1), ('own_2', -1), ('own_0', -1),
('2_0', -1), ('own_1', -1), ('0_0', -1), ('2_1', -1),
('0_1', -1), ('1_1', -1)]

Where stock *x*_*y* is the *y*th stock in sector *x*, the -1 just means the predictor is from the previous time step, and own_*w* means that sector *w* is owned. It correctly isolates all the stocks into the right sectors and identifies that the reward function depends on all stocks and all ownership values.

I think this is good because the published result from Strehl et al seems to learn the structure at about 1,500 steps of simulation, where this algorithm can accomplish the same thing in 1/5 the time of SLF-Rmax. Basically, we can just use this algorithm to learn the structure and then just pass that along to factored-Rmax, which takes right off.

About the empirical computational costs, this approach will solve the 2×2 (2 sector, 2 stocks/sector) in less than 20 iterations, and seems to need about 5,000 to solve the 3×3, so I’m not sure yet exactly how the number of steps required grows as the size of the state space grows exponentially (4 for 2×2, 63 for 3×2, 512 for 3×3), but I’m sure I can figure that out simply enough . Those are kind of scary numbers when I think of them in terms of Pitfall though… Not sure if the other things I discussed would be any better, my guess is not though, as computing the entropy of one distribution is just linear in the number of samples (and constant in memory), as opposed to quadratic (potentially in both) for many other methods.