You are given a collection of 1 million images provided by a customer. The images are pictures of different t-shirt designs and are not labeled. The customer - who sells clothes online - has asked you to build a system for her that given a picture of a t-shirt finds all other t-shirts in her collection that look visually similar to it. The customer is going to use your search functionality to power her website, so individual searches should be fast.
We need a way to semantically describe the images in order to do visual-simularity search on them. If we can find a way to semantically embed the images, then we can do nearest-neighbor search in the embedding space to return query results.
We can use a pre-trained CNN as a feature extractor for out images. Most CNN architectures are two parts: a feature extractor and a classifier. If we just use the feature extractor we can get a high dimensional semantic representation of the images.
We can validate this hypothesis by visualizing the embeddings in 2D or 3D space and verify that semantically simular items are clustered together
Because this is the t-shirt domain and the pre-trained model is probably an ImageNet model w/ might want to adjust the final weights of the network to fit our domain. So it could be beneficial to curate a small dataset of shirts and readjust the weights.
Nearest Neighbor Search
Now that we have a semantic representation of images, we can now preform nearest neighbor search.
First we need build our NN index. We need to do a forward pass of every image in our corpus through the image extractor and save the embeddings in some type of index/DB.
Using any distance measure, cosine for example, take an query image, forward pass it through the feature extractor and do KNN w/ the index.
Curse of Dimensionality
This naive implementation works in theory, but in practice we have two bottlenecks. First 1NN performance will degrade linearly w/ the size of the dataset. The second more sinster issue is that in the high dimension embedding space (typically 1-4k dimensions for modern CNNs) the distance measures break down: everything is equally far apart.
Approximate Nearest Neighbors
One way to mitigate the dimensionality problem and the size problem is to use an index that is to use a some type of semantic clustering. One technique is called Latent Semantic Hashing which uses a Random Projection Dimensionality Reduction technique to cluster the search space by a hashing function. This allows you to preform 1NN lookups in sub-linear time and to preserve the neighborhood distances between points.
Once your image search is working it would be cool to do Text -> Image search, so somehow, project your query into the same high-dimensional space as the images. For example, query “Giraffe” and giraffe pictures would appear.