Locality-sensitive hashing (LSH)
In order to preform nearest neighbour search (NN) on large and high dimensional datasets you must overcome 2 big problems:
- ineffectiveness of distance measures in high dimensions
- computational intractability of exhaustive search on large datasets
By employing hash tables w/ a novel hashing function that optimises for collisions we reduce the search space and mitigate the curse of dimensionality for NN search.
Figure 1: The curse of dimensionality visualized: In high dimensions, every point is equally far away!.
Hash Function
Given a vector $x$ in $\mathbb{R^d}$ and a projection matrix $\mathrm{P}$ in $\mathbb{R^{r\times d}}$ we can formulate a hashing function $h(x)$:
where $h(x)$ is a vector in the $\mathbb{R^r}$ space and $\mathrm{P}$ is random matrix parameterised by $r$ $(r\ll d)$.
The key here is that $x$ is referenced in a smaller dimension than before $r$ vs $d$ .
Hash Table
Given a hashing function that maximises collisions and a dictionary data structure we can build a hash table $D$:
where $D$ is a mapping of $h(x)$ to $[x_1,…,x_n]$ where each $x$ has a degree of similarity.
Formally, we can find $x$ in $D$ in $O(1)$ compared to $O(n)$ for exhaustive search.
Nearest Neighbour
In order to preform NN in the pathological case, it would be $O(n)$ search w/ no upper bound on $n$. However, in the LSH case, we can put an upper bound on $n$ and guarantee that: $n_{LSH}\ll n_{NN}$ as we are guaranteed to never have to iterate the whole search space:
- Reduce search space by querying $\ell \leftarrow D : h(x)$
- Preform classical NN search on the drastically reduced search space $\ell$ where the upper bound on $n$ is $\ell$
Conclusion
In order to preform NN on large and high dimensional datasets you must overcome 2 big problems:
- ineffectiveness of distance measures in high dimensions
- computational intractability of exhaustive search on large datasets
By employing hash tables w/ a novel hashing function that optimises for collisions we reduce the search space and mitigate the curse of dimensionality for NN search.
Permalink: locality-sensitive-hashing-lsh
Tags: