## Diffusion Kernels and Other Discrete Input Spaces. Kondor, Lafferty. ICML 2002.

1. Extends use of kernels from real-valued data to discrete data
2. Particular focus is on graphs which gets a particular type of kernel called the diffusion kernel “… which are based on the heat equation and can be regarded as the discretization of the familiar Gaussian kernel of Euclidian space.”
3. Kernels allow for a mapping into Hilbert space where the kernel is an inner product.  With respect to a basis a datapoint is mapped to many (potentially infinite) number of features – this expansion makes a number of operation simpler
4. “… adjacency graphs are sometimes used when data is expected to be confined to a manifold of lower dimensionality that the original space.”
5. Kernels must be symmetric and positive semi-definite
1. “Constructing appropriate positive definite kernels is not a simple task, and this has largely been the reason why, with a few exceptions, kernel methods have mostly been confined to Euclidian spaces…”
6. When working in discrete spaces, the most common approach with kernels has been to find some mapping from the discrete to continuous spaces (for example, transforming a text to a bag/count of words)
7. The kernels they propose for graphs they call diffusion kernels
8. Start out with a kernel for discrete data called exponential kernels
9. In discrete case, positive semi-definiteness means:
1. x ∈ Xx’ ∈ X  fx fx’ K(x,x’) ≥ 0
10. In the discrete case, for ﬁnite X , the kernel can be uniquely represented by an |X| × |X| matrix (which we shall denote by the same letter K) with rows and columns indexed by the elements of X , and related to the kernel by Kxx’ = K(x,x’).”
11. Defines the exponential of a square matrix – the exponential of any symmetric matrix is symmetric positive semi-definite so it may be used as a kernel
12. Tensor products and conjugacy <that passed me by>
13. Onto diffusion kernels on graphs
14. For a graph Γ, the heat equation H <is a matrix> s.t. Hij = :
1. 1 if i is connected by an edge to j
2. di if i=j, where di is the degree of vertex i
3. 0 otherwise
15. The negative of this is the graph Laplacian of Γ
16. H can be regarded as an approximator, and H/h2 approaches the continuous laplacian as h → 0
1. In physics this equation is used to approximate the diffusion of heat (and other substances) through a continuous media.
17. The resulting kernels HKβ are called diffusion kernels or heat kernels
18. “To the best of our knowledge, diffusion kernels have not been proposed previously for direct use in kernel-based learning algorithms.”
19. This extends simply to unweighted graphs as well as symmetric graphs
1. <But what about assymetric graphs?>
20. Equations for unweighted <I suppose undirected> graphs, closed chains, trees
1. These can be used “… as building blocks for tensor product kernels.”
2. For example, bit (or over any discrete alphabet) strings can be composed in this way to generate a Hamming distance
3. Other methods of string distances for looking for  contiguous substrings.  The kernel used approaches a result obtained by dynamic programming
21. In the first set of experiments diffusion kernels are used for classification “For such problems, it is often quite unnatural to encode the data as vectors in Euclidean space to allow the use of standard kernels. However as our experiments show, even simple diffusion kernels on the hypercube, as described in Section 4.4, can result in good performance for such data.”
1. “Data sets “having a majority of categorical variables were chose; any continuous features were ignored.”
2. Used “voted perceptron” for training <don’t know what that is>
3. Diffusion and Hamming kernels were used.  “The reduction in error rate varies, but the simple diffusion kernel generally performs well.”
4. <I’m not sure what conclusion I’m supposed to draw from these results – where are the comparisons to other algorithms?  Compares well to Hamming distance?  I would have liked to see results from other more impressive approaches.>
22. “We showed how the explicit calculation of diffusion kernels is possible for specific families of graphs, and how the kernels correspond to standard Gaussian kernels in a continuous limit.”
23. “While the tensor product construction allows one to incrementally build up more powerful kernels from simple components, explicit formulas will be difficult to come by in general.”