From Near-Optimal Nonmyopic Value of Information in Graphical Models (Krause, Guestrin):
Any submodular function F over a finite set V must follow the following:
F(A U B) – F(A) => F(A’ U B) – F(A’)
(Where as you would imagine U is union)
Where A is a proper subset of A’ and A’ is a subset of V, and B not in A.
Now information gain on the xor function is not submodular, because in the case where xor(X, Y) = Z, and P(X=1) = P(Y=1) = 1/2, then for entropy H(.):
H(Z|X) – H(Z) > H(Z|X U Y) – H(Z | Y)
1 > 0
To get around this, they use a clever structuring of the sets which is:
For R, S which are disjoint subsets of V, s.t. variables in S are indep. given R, and for A which is a subset of (S U R), a there is a function (related to the information gain) which is submodular:
G(A) = H(R) – H(R \ A | A)
Now this is pretty nice, in the case of a one step DBN, R can be the set of features at a time step t, and S can be the set of features at time step (t+1), as all elements at time (t+1) are inependent given those at t, as long as there are no synchronic arcs.
The problem, as I see it for our uses is that that is a function of the entropy at t, but we care about the entropy at (t+1), so we need the Rs to be Ss in G() in order to just be able to use that function for what we need.
There is still useful stuff here though, for example, if we assume that a domain is submodular, and it is, then a greedy algorithm doing feautre selection is at most (1 – 1/e) worse than the optimal set.