## Hashing with graphs. Liu Wang Kumar Chang. ICML 2011

Basically, efficiently computes the graph Laplacian and then thresholds weights to be binary (-1, or +1), but also sets up a way to generalize beyond points in the corpus, which standard Laplacian eigenmaps cannot do (thats one of the benefits of SFA – it behaves as a Laplacian eigenmap run through a function approximator).

1. considers efficient hashing for high dimensional datasubdivides the rankings on a low d manifold
2. graph based. Uses anchor graphs
3. allows for constant time hashing by extrapolating graph laplacian eigenvectors to eigenfunctions
4. Also shows hierarchical threshold learning where each eigenfunction produces multiple bits increasing accuracy
5. For knn search
6. Kd trees are usually ineffective in high d data
7. Instead of kd trees use hashing
8. Early basis of work is on locality sensitive hashing but that approach has some problems that are addressed here
1. One problem is that approach is not data dependent
2. One type is called spectral hashing but that isn’t considered here
9. A problem with existing unsupervised hashing methods is they need a global distance measure. When dealing with manifolds you really want a different metric which this algorithm finds
1. To do this exactly it becomes a quadratic cost which is prohibitive
10. @Anchor graphs approximate the manifold by using a subset of points called anchors
1. Allows for linear time algorithm
11. The eigenvectors of the anchor graph laplacian can be generalized to functions in constant time
12. Hashing on anchor graphs capture semantics so points close in hamming space produced by agh will also have similar semantic labels
13. This works so well that similarity found by this approach outperforms a standard exhaustive search based on l2 norm
14. In an anchor graph you select some small number Kand then apply k means to the points to find the k anchors.
1. All points then have distance computed to all anchors as opposed to pairwise from all points to all points
15. This creates an approximate adjacency matrix that has properties of
1. Nonnegative
2. Sparse
3. Low rank
4. Is doubly stochastic meaning the graph laplacian L = I-\hat{A}, where the last part is simply the anchor approximation of adjacency matrix
16. These properties mean that eigen function extensions of the graph laplacian are possible
17. Because it’s low rank doing eigen decomposition is cheap
18. This process only computes values for points in corpus so generalization is needed in order to hash new data points
19. Eigenvectors are transformed to eigenfunctions by nystrom method which I’ve heard of but don’t know about
20. Setting up the eigenfunctions seems pretty simple
21. this allows for hashing that isn’t dependent on size of corpus
22. To generate r-bit codes r graph laplacian eigenvectors are used but you don’t simply just go up in r because if data is on a manifold you want to stay on lower eigenvectors.
1. A simple hierarchical scheme is introduced here that resamples low valued eigenvectors
23. the hierarchical scheme is that each subsequent bit subdivides the remaining points in that set.
1. More specifically, thresholding is still used, but the threshold is adjusted such that it splits the remaining points in the set in half.  This isn’t done in a way that continuously subdivides points – the opposite actually.  If two nearby points are split by a preceeding bit, it attempts to place them in the same following bit
2. By doing a 2-step bit assignment, only the lower half of eigenvectors are used (twice), which are better structured for this type of task
3. Hashing can be done in more than 2 steps of course, but a tradeoff is made between whether there is any useful information in adding new thresholds to old eigenvectors, or to using new eigenvectors
24. Empirical performance is excellent compared to competitors

Notes from the video at http://techtalks.tv/talks/hashing-with-graphs/54308/, separate slides at (http://www.ee.columbia.edu/~wliu/ICML11_agh_talk.pdf)

1. <Ok> The way to do NN lookup is to start with the hash code of the query item, and then flip a small number of bits, and see if those items exist in the database.  Because you only flip a few constant number of bits, the NN lookup is also constant cost
2. Concerned with large amounts of high-D data
3. Don’t preserve distance across all data points; just among a neighborhood <This is good in the case when data lies in a manifold because the global distance metric (L2) can actually lead to much poorer results>
1. Because only looking locally <perhaps at a constant number of neighbors?> the graph that represents connectivity between points is very sparse, <perhaps just a small constant number of entries per row in transition matrix>
4. Then work with the graph Laplacian of this sparse transition matrix
5. Constraints when doing the hashing are
1. #bits on/off should be balanced (this means bits contain maximum information>
2. bits <in the entire vector?> should be uncorrelated (no redundancy)
6. Doing this generally is NP-Hard, but doing it through spectral methods <the sparseification mentioned earlier> makes the problem tractable, although now it doesn’t optimally satisfy the problem
7. Even doing this through the sparsified Laplacian is still to expensive in terms of size of corpus, goal is to get costs from polynomial to linear during training, and from linear to constant for lookup
8. Earlier approach of spectral hashing makes assumptions that data is uniformly iid between dimensions, so cant capture manifold information, ignores other sorts of useful structure in data
9. Pretty neat illustration of what is going on <although I’m not sure what the distinction between “GH” and “our approach” is>
10. In order to avoid costs associated with full size of data, use anchor graphs, which is an aggressively subsampled graph (code for doing the anchor graphs was released)
11. Corpus is compared in a nearest-neighbor manner to the anchor points, and those points of the full corpus can now be related to each other in terms of their relationships to anchors
12. From this a low-rank affinity <similarity?> matrix is built.   Basically math for Laplacian eigenmaps.
1. Rank is at most # of anchor points.
2. Easy to solve because low rank
13. To evaluate a new data point, multiply it by eigenvectors of nearby anchor points <?>
14. “The higher eigenvec/eigenfunc corresponding to the higher graph Laplacian eigenvalue is of low quality for partitioning. […]  We propose two-layer hashing to revisit the lower (smoother) eigenfuncs to generate multiple hash bits.”
1. <Low-valued eigenvectors are reused?>