Diffusion Maps
Student Kye Taylor
Dates of Involvement 2007-2008
Faculty Advisor Francois Meyer
Given a high dimensional dataset, one would like to be able to represent this data using fewer parameters while preserving relevant information. If we assume the original data actually exists on a lower dimensional manifold embedded in a high dimensional feature space, then recently popularized approaches based in graph-theory and differential geometry allow us to ``learn'' the underlying manifold that generates the data.
Sometimes called Diffusion Maps or Laplacian Eigenmaps, these manifold-learning techniques preserve the local proximity between data points by first constructing a representation for the underlying manifold with vertices and edges. The vertices, or nodes, represent the data points, and the edges connected the vertices, represent the similarities between adjacent nodes. If properly normalized, these edge weights can be interpreted as transition probabilities for a random walk on the graph.
After representing the graph with a matrix, the spectral properties of this matrix are used to embed the data points into a lower dimensional space, and gain insight into the geometry of the dataset. Though these methods perform exceptionally well with clean, well-sampled data, problems arise with the addition of noise, or when the data exists and multiple submanifolds. I propose examining this dilemma and hope to offer solutions that can detect/remove noise, and also provide insight when there may be multiple manifolds intrinsic to the data.
Update:
This Spring, I have explored several attempts to accomplish my goal of removing noise or estimating the existence of multiple submanifolds in a dataset.The first is based on the idea that locally a $k$-manifold will resemble $k$-dimesional Euclidean space.Using this fact, small neighborhoods of data points are projected onto a $k$-dimensional linear subspace that is tangent to the approximate manifold as estimated by the neighborhood.$^1$ The noise removal begins when the points are projected back onto the manifold so that the new point minimizes the error between a physically realizable trajectory point and a replacement point. Although this method denoises data that originates from time series of observations of a dynamical system, data that originates from a steady-state solution or static manifold resists improvement.
<>
To appreciate this algorithm's performance, onsider the corrupted signal origination from the Henon map that is embedded in $\mathbb^9$. I have plotted the first through third coordinates in the following plots.


If I assume $k=3$, the algorithm discussed above yields the following modified signal after eight iterations.

Another attempt I have explored hinges on the fact that the pairwise-similarity matrix defined on the dataset represents a diffusion kernel on the manifold that produced the data. The eigenfunction of this operator can be interpreted as a harmonic basis for functions defined on the manifold.$^2$ Using the smoother basis functions, I was able to recreate a lowpass filter that removes noise from data which resides on periodic manifolds.
For example, consider data that originates from a torus. In the following figure, 10% added Gaussian noise has been use to distort the clean signal on the left. On the right is the results produced the the lowpass filter.


The improvement is most evident by examining the boundary of the torus. These results have led me to consider iterating this procedure, but because the sampling of the manifold is still very non-uniform, convergence cannot be guaranteed. Thus, I have been researching ways to resample the manifold produced by the lowpass filter, then repeat the process of computing the eigenfunctions and building the filter. In the future, I also plan on investigating what limits this algorithm near the boundaries of non-periodic manifolds.
