From Near-Optimal Nonmyopic Value of Information in Graphical Models (Krause, Guestrin):

Any submodular function *F* over a finite set *V* must follow the following:

*F*(*A* U *B*) – *F*(*A*) => *F*(*A’* U *B*) – *F*(*A’*)

(Where as you would imagine U is union)

Where *A* is a proper subset of *A’* and *A’* is a subset of *V*, and *B* not in *A*.

Now information gain on the *xor* function is not submodular, because in the case where *xor(**X*, * Y)* = *Z*, and *P*(*X*=1) = *P*(*Y*=1) = 1/2, then for entropy *H*(.):

*H*(*Z*|*X*) – *H*(*Z*) > *H*(*Z*|*X* U *Y*) – *H*(*Z* | *Y*)

1 > 0

To get around this, they use a clever structuring of the sets which is:

For *R**, S* which are disjoint subsets of *V*, s.t. variables in *S* are indep. given *R*, and for *A* which is a subset of (*S* U *R*), a there is a function (related to the information gain) which is submodular:

*G*(*A*) = *H*(*R*) – *H*(*R *\ *A* | *A*)

Now this is pretty nice, in the case of a one step DBN, *R* can be the set of features at a time step *t*, and *S* can be the set of features at time step (*t*+1), as all elements at time (*t*+1) are inependent given those at *t*, as long as there are no synchronic arcs.

The problem, as I see it for our uses is that that is a function of the entropy at *t*, but we care about the entropy at (*t*+1), so we need the *R*s to be *S*s in *G*() in order to just be able to use that function for what we need.

There is still useful stuff here though, for example, if we assume that a domain is submodular, and it is, then a greedy algorithm doing feautre selection is at most (1 – 1/e) worse than the optimal set.