Won best paper award

- “Structured prediction is the problem of learning a function from structured inputs to structured outputs.”
- This paper deals with structured prediction applied to search (although it is not the first such paper)
- “Given a structured input, the framework uses a search procedure guided by a learned heuristic
*H*to uncover high quality candidate outputs and then uses a separate learned cost function*C*to select a final prediction among those outputs.” - The regret can be found as inability of
*H*to find good results, and then*C*not selecting the best output <not sure what these two things mean exactly yet> - Based on these two factors of regret, regret is minimized first by “training H to quickly uncover high quality outputs via imitation learning, and then training C to correctly rank the outputs generated via H according to their true losses.”
- <Claim> Excellent empirical results <this is probably true otherwise it would not win best paper award>
- Structured prediction is just supervised learning. <Although it doesn’t seem like it is always part of structured learning> They leverage a cost function <which I suppose must be equivalent to an error metric>
- In general minimizing the cost function is intractable, and in practice “… heuristic inference methods such as loopy belief propagation and variational inference have shown some success in practice. However, the learning algorithms generally assume exact inference and their behavior in the context of heuristic inference is not well understood.”
- <It seems like the problem is discrete, because they say that if the optimization problem is a tree then it is tractable, and also…>
- “Another approach to dealing with the Argmin problem is to eliminate it altogether and instead attempt to learn a classifier that can greedily produce a structured output by making a series of discrete decisions (…) Unfortunately, in many problems, some decisions are difficult to make by a greedy classifier, but are crucial for good performance.”
- Mention some method which seems to be state of the art at the moment, but mention that it has a number of serious limitations that will be addressed here
- <I honestly am not sure what type of paper I’m reading here so far. Is it search, optimization, or supervised learning?>
- The decomposition of the problem into
*H,C*“… the performance of the approaches with a single function can be arbitrarily bad when compared to that of*HC*-Search in the worst case.” - In practice,
*HC*-Search also works better<, so it seems like a case where assumptions in the theory hold up> - “Finally, we note that our methodology can be viewed as similar in spirit to Re-Ranking (Collins 2000), which uses a generative model to propose a k-best list of outputs, which are then ranked by a separate ranking function. In contrast, we search in efficient search spaces guided by a learned heuristic that has minimal restrictions on its representation.”
- <Seems like> the setting is like standard supervised learning, except you are also provided with a cost function that must be optimized <I suppose most supervised learning algorithms assert what cost they are optimizing, such as MSE, which might not be the case in the problem defined here>
- There is a sort of forward model given <seems like it can do multistep look ahead, starting with
*x*and proposing different possible solutions*y*> - <So it seems like there is a sort of combinatoric search problem>
- Both learned heuristic
*H*and learned cost function*C*are mappings from*X**× Y*to reals.- In addition there is some search strategy
*A*<seems like here they just assume greedy>

- In addition there is some search strategy
- Given some
*x*and some time bound,*A*is run using*H*and the best item found in terms of*C*during the whole search is returned - They “… consider the feasibility of an exact solution to this learning problem in the simplest setting of greedy search using linear heuristic and cost functions represented by their weight vectors
*w*and_{H}*w*respectively._{C} - They “… consider the
*HC-Search Consistency Problem*, where the input is a training set of structured examples, and we must decide whether or not there exists*w*and_{H}*w*such that_{C}*HC*-Search using greedy search will achieve zero loss on the training set.”**This problem is NP-Hard**. In fact, both problems of finding a zero-loss heuristic and cost are np-hard

- Even though the problem is NP-hard, the decomposition of
*H*and*C*still helps - Because the problem is intractable, they resort to greedy stagewise search for
*H*and*C* - Most generally, learning a heuristic can be viewed as a Reinforcement Learning (RL) problem where the heuristic is viewed as a policy for guiding “search actions” and rewards are received for uncovering high quality outputs. In fact, this approach has been explored for structured prediction in the case of greedy search (Wick et al. 2009) and was shown to be effective given a carefully designed reward function and action space. While this is a viable approach, general purpose RL can be quite sensitive to the algorithm parameters and specific definition of the reward function and actions, which can make designing an effective learner quite challenging. Indeed, recent work (Jiang et al. 2012), has shown that generic RL algorithms can struggle for some structured prediction problems, even with significant effort put forth by the designer.”
- They go with a non-RL approach, imitation learning

- Their formulation of imitation learning is distinct from supervised learning in that they only care that it helps the heuristic make correct choices. Basically the only concern is that it gets the ordering of the values of the options correct, as opposed to getting the values themselves right
- This type of strategy is called
*rank-based* - “Most common search procedures such as greedy search, beam search, and best-first search fall under this category.”

- This type of strategy is called
- The rank-based learner they used is called “the online passive-aggressive algorithm”
- The cost-function learner <seems> to be trained to accurately predict the data from the distribution that
*A*and*H*yield - Their experimental results cover large standard test sets, such as handwriting recognition, scene labeling and some others
- “The results in the scene labeling domain are the most significant improving the error rate from 27.05 [the previous state of the art] to 19.71 .”
- “One of the advantages of our approach compared to many frameworks for structured prediction is the ability to use more expressive feature spaces without paying a huge computational price.”
- Outperforms other methods basically across the board <are there other state of the art methods here that aren’t listed?>
- Plain
*C*-Search (as opposed to*H**C*-Search) doesn’t work as well, and even with 3rd order features works poorer than*H**C*-Search with 2nd order features. The difference in performance can be attributed to the fact that*H**C*-Search optimizies*C*for the distribution found by*H*, which plain*C*-Search does not - When giving the true loss function instead of the learned loss function
*H*, improvements in performance weren’t significant