- “We present a computational method for extracting simple descriptions of high dimensional data sets in the form of simplicial complexes.”
- Discusses a tool
*Mapper*for doing TDA - “The purpose of this paper is to introduce a new method for the qualitative analysis, simplification and visualization of high dimensional data sets”
- One application is to help visualization in cases where data is extremely high-dimensional, and standard forms of visualization and dimension reduction don’t produce a reasonable result
- “Our construction provides a coordinatization not by using real valued coordinate functions, but by providing a more discrete and combinatorial object, a simplicial complex, to which the data set maps and which can represent the data set in a useful way”
- “In the simplest case one can imagine reducing high dimensional data sets to a graph which has nodes corresponding to clusters in the data”
- “Our method is based on topological ideas, by which we roughly mean that it preserves a notion of nearness, but can distort large scale distances. This is often a desirable property, because while distance functions often encode a notion of similarity or nearness, the large scale distances often carry little meaning.”
- <Reminds me a little of the intuition behind t-sne>

- Basic idea is called a
*partial clustering*where subsets of the data are clustered. If subsets overlap, clusters may also overlap which can be used to build a “simplicial complex” <basically a way of gluing together points so each face has a certain number of vertices, and then connecting the faces>- If simplices and topologies are robust to changes in data subset divisions, then the results are real

- “We do not attempt to obtain a fully accurate representation of a data set, but rather a low-dimensional image which is easy to understand, and which can point to areas of interest. Note that it is implicit in the method that one fixes a parameter space, and its dimension will be an upper bound on the dimension of the simplicial complex one studies.”
- Unlike other methods of dimension reduction (like isomap, mds) this method is less sensitive to metric
- <skipping most of the math because it uses terms Im not familiar with>
- “The main idea in passing from the topological version to the statistical version is that clustering should be regarded as the statistical version of the geometric notion of partitioning a space into its connected components.”
**Assumes there is a mapping from data points to Reals (the***filter)*, and that interpoint distances can be measured- “Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any conditions on the clustering algorithm. Thus any domain-specific clustering algorithm can be used.”
- Cluster based on vector of interpoint distances (this is the representation used, not Euclidian distances)
- Also want clustering algorithm that doesn’t require # of clusters to be specified in advance
- <I would have thought they would use Chinese-restaurant-processes, but they use something called single-linkage clustering>

- “Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any conditions on the clustering algorithm. Thus any domain-specific clustering algorithm can be used.”
- Seems like mapper can handle higher-dimensional filters

- “The outcome of Mapper is highly dependent on the function(s) chosen to partition (filter) the data set. In this section we identify a few functions which carry interesting geometric information about data sets in general.”
- “projection pursuit methods“
- Can use eigenfunctions of the Laplacian as filter functions
- Also discuss using meshes and computing distance based on dijkstras over the mesh
- “Different poses of the same shape have qualitatively similar Mapper results however different shapes produce significantly different results. This suggests that certain intrinsic information about the shapes, which is invariant to pose, is being retained by our procedure.”
- I think this falls out primarily from the distance metric used?

- Clusters different 3d models nicely, even though it samples down to just 100 points for each model, and many are similar (horse, camel, cat)

Advertisements