by Josh Engels
In the traditional near neighbor problem, we are given a dataset \(D\) of \(N\) \(d\)-dimensional vectors. Our goal is to build an index that can quickly find the closest vector in \(D\) to an arbitrary \(d\)-dimensional query vector \(q\).
Many important search problems can be analyzed and solved with this framework through an embedding model. An embedding model is a machine learning model that has been trained to map the objects we want to search over (images, music, documents, products) into vectors (“embeddings”), such that similar items will have embeddings that are close together. By mapping all of the items we want to search over into embeddings, and doing the same for a query, we can find the most similar items to the query by finding the closest embeddings to the query embedding.
The research community has especially focused on the approximate near neighbor search problem, which is much more tractable than exact near neighbor search as \(N\) and \(d\) grow large. Competition is fierce: graph based algorithms like HNSW, LSH based algorithms like FLASH or our own FLINNG, and quantization based algorithms like FAISS all face off in different benchmarks.