Category Archives: Submodularity

How Optimized Environmental Sensing helps address Information Overload. Guestrin. Talk 2009

  1. When working in non-submodular functions there is <usually?> a trick that allows the problem to be solved.
    1. That is SATURATE
    2. Basically its as good as you can do
    3. Works by artificially allowing more sensors than allowed and then truncating the actual selection
    4. Of course this isn’t for all problems, but a subset of problems that are not natively submodular

Adaptive Submodularity: A New Approach to Active Learning and Stochastic Optimization. Glovin, Krause, Ray. Talk (NIPS 2010)


  1. Basic idea is consider submodular functions, they they are stochastic and unfold over time.  So instead of making all decisions in one shot (before execution begins), better to wait to see how the first decision unfolds before making the second, etc.
    1. Gives the example of influence in social networks – if you want to get people to plug a product you give to a few people for free, see how influence is spreading from the first person before deciding the second
  2. The setting described here is a generalization of standard suboptimality
  3. Analysis here allows many results from standard submodular optimization to the adaptive setting
  4. Instead of maximizing margin, maximizes expected margin based on current state, but does this only for the first selection, and the results of the first selection on state then become conditions for the next choice
  5. Gets standard bounds of 1-1/e
  6. Because adaptive, can now do things like “take the minimal number of actions needed to achieve X”, which you couldn’t do with standard submodularity (there, you can just say “maximize the benefit of N choices”
    1. This is a log-approximation ln(n)+1 for stochastic set cover
  7. Another application optimal decision trees
    1. “Generalized binary search” which is equivalent to max info gain
    2. This problem is adaptive submodular <I thought things like xor in decision trees arent submodular, but maybe because we are talking w.r.t. information gain it is?>
    3. Give bounds for this approximation
    4. “Results require that tests are exact (no noise)!”
  8. If trying to do decision trees in noisy setting, can do a Bayesian version
    1. But generalized binary search, maxing info gain, and maxing value of info aren’t adaptive submodular in noisy setting.  In noisy settings they require exponentially more tests than optimum <Should really read this paper>
  9. Trick is to transform noisy problem into normal problem by making outcome part of the hypothesis
    1. Only need to distinguish between noisy hypotheses that lead to different decisions
  10. Gives bounds for this Bayesian active learning for noisy domains
  11. Give an example of testing different psyhological/economic models of how people perform decision making
    1. For this setting, turns out many common approaches actually do much worse than random

Near-optimal Nonmyopic Value of Information in Graphical Models. Krause, Guestrin. UAI 2005.

<Working off the arxiv version, although it looks the same as the UAI one>

  1. For submodular functions, gives proof of a tractable approach for solving combinatoric problems that is within a factor of (1-1/e-ε) for ε > 0.
    1. Shows this is pretty tight, and that nothing can be better than (1-1/e) unless P=NP
  2. In terms of sensor networks, the goal is to maximize information gain, or equivalently, “decrease uncertainty about unobserved variables”
  3. A common criterion for node selection in sensor placement is to pick the location that has maximal entropy, given the previously selected locations.  This method has a fault, however, in that it will push sensed locations to the edges of the space.  The problem with this, in particular, is that sensors then have about half of their possible observation area (if its spherical) clipped by the boundaries
    1. Thats why the approach here, which is to minimize entropy of unsensed locations is better; it ensures more even coverage by sensors
  4. A set function is submodular if <among other possible things> it satisfies the diminishing returns property, that is, if a new piece of information is added to a set, or its superset, the gain in F will be larger with a set than its superset
    1. The marginal increase of X to A is the increase in F between A and X
  5. Exact maximization of submodular functions is NP hard (by reduction from max-cover)
  6. F also must be non-decreasing (like information “the information never hurts principle”; adding more sensors can never reduce the amount of information you have)
    1. So dealing with entropy is submodular.  Information gain is as well
  7. Basic theoretical results are from G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of the approximations for maximizing submodular set functions. Mathematical Programming, 1978.
  8. Information gain is not always submodular <xor pattern in decision trees>, but given independence assumptions it is
  9. “A convenient property of submodular functions is that they are closed under positive linear combinations. This allows us, for example, to have temperature models for different times of the day, and select sensors that are most informative on average for all models.”
  10. Versions with budgets, individual unique costs per item
  11. Unit-cost case is squared cost, budgetedis O^5, but if willing to take another hit on accuracy this can drop down to cubic
  12. Has links to simulated annealing
  13. <Skipping the proof that the greedy bound is tight with any non-exponential solution>
  14. The ε in the proof comes from the fact that accurately computing conditional entropies as necessary is #P <I’m actually not familiar with what that means offhand, but based on context it must mean more than poly>, so a method of estimating entropies is used instead.
  15. <on to experimental results>
  16. In the first experimental result they place temperature sensors around their lab.  The quality of predictions with their method using 5 sensors is better than 20 sensors placed the standard entropy approach (that pushes sensors to boundaries and wastes part of their field)
  17. They had to lay down tiles to segment the continuous space of the lab <although in some segments there is more than one sensor, so I’m not sure exactly what they are doing.  Oh it looks there are candidate positions pre-selected, which have line-of-sight connectivity, so the hierarchy doesn’t impact that graph but sits on it.  Still not totally sure whats going on>
  18. Even when the assumptions of conditional independence are violated, the results are still good, worse than standard entropy with less than 5 sensors, but better with more than that
  19. Graph that shows linear run time <I thought it was poly, but they mention a simple and complex model, so one may be linear and the other poly?>
  20. Next results are on highway data.  Like in the temp results used a hierarchical model <although I’m still not sure what that means exactly>
  21. Here, they compared placement based on information gain, entropy, and random choice.  Random actually yielded more information gain than the entropy-based approach: “This indicates that maximizing entropy and information gain appear to be conflicting goals, resulting in qualitatively different behaviors.”