Inside FAISS: Billion-Scale Similarity Search
1. Everything is a Vector
In modern AI, images, text, and audio are not understood by computers as "cat" or "symphony". Instead, they are converted into lists of numbers called embeddings or vectors.
These vectors live in a high-dimensional geometric space. The core idea is simple: items that are semantically similar are placed close together in this space.
1. The Challenge
Computers don't have eyes or ears. They only understand numbers. How do we explain the concept of a 'Dog' or a 'Cat' to a machine?
2. Vectorization
We use Neural Networks to transform raw data (pixels, text) into a list of numbers. This list is the 'Embedding'. It captures the semantic meaning.
3. The Vector Space
These numbers act as coordinates (X, Y, Z...). We can plot every image or word as a single point in a high-dimensional space.
4. Similarity is Distance
In this space, concepts that are similar are placed close together. 'Cat' is close to 'Dog', but far from 'Car'. We measure this using distance (Euclidean or Cosine).
Above, we visualize a simplified 2D slice of this space. Imagine this in 1024 dimensions. Finding the "nearest neighbor" for a query point is equivalent to finding the most similar item in your database.
2. The NN Problem at Scale
Formally, given a query and a database of vectors, exact nearest-neighbor search solves:
The naive Brute Force solution evaluates every distance explicitly. Cost per query: time and memory to hold the database in full precision. For SIFT descriptors (, 4 bytes per float), that is 512 GB of RAM and one billion distance evaluations per search, a non-starter for real-time systems like web-scale retrieval or live LLM memory.
The rest of this article explores the two escape routes FAISS exploits:
- Partitioning: skip most of at query time (IVF).
- Compression: make each comparison cheap, and make the database fit in RAM (Product Quantization).
Production systems combine both.
3. FAISS to the Rescue
FAISS (Facebook AI Similarity Search) introduces approximate search methods. By sacrificing a tiny bit of accuracy (maybe you get the 2nd best match instead of the 1st), we can speed up search by orders of magnitude.
Let's explore two key indexing strategies:
- Flat: The baseline brute-force. Slow but accurate.
- IVF (Inverted File):Partitions the space into "Voronoi cells". We only look in the most promising cells.
Try it yourself. Increase the dataset size and see how different indexes perform.
Interactive Index Benchmark
Compare the performance of different indexing algorithms on random datasets.
4. Partitioning with IVF
The Inverted File (IVF)index works like a library classification system. Instead of looking at every book to find a specific title, you go to the "Science Fiction" section first.
Mathematically, it uses K-Means clustering to partition the vector space into Voronoi cells. When we search, we first identify which cell our query vector falls into (using a "coarse quantizer"), and then we only calculate distances for vectors inside that cell (and perhaps a few neighboring ones).
IVF Clustering1. Raw Data
K-means partitions random vectors into the Voronoi cells that IVF searches.
Start: Here is our raw dataset. 80 vectors in a 2D space. Searching this linearly (Brute Force) would require calculating distance to 80 points.
5. Compressing with Product Quantization
IVF makes search fast by skipping most of the database, but leaves every vector uncompressed. One billion SIFT descriptors still cost 512 GB of RAM. Product Quantization (PQ), introduced by Jégou, Douze and Schmid (2011), is the compression trick FAISS builds on to shrink each vector to 8 bytes while keeping distance estimates meaningful.
Same centroids, a different job
§4 used centroids to partition the search space. PQ uses them to compresseach vector. Same K-Means math, new purpose: the centroid's index becomes the vector. If your codebook has centroids, any vector collapses to one integer in .
A pixel-sized analogy
A 24-bit RGB pixel can be any of 16.7 million colors. A GIF gives up some of that range: it picks a 256-color palette and replaces every pixel with an 8-bit index into it. Storage drops 3×, the picture still looks like the picture. That palette is a codebook; the index is a code.
A 128-D SIFT descriptor is just a very long pixel. We want the same trick: find a palette of representative vectors, replace every descriptor with its nearest palette index.
How many bits does an index cost?
To label centroids uniquely you need enough bits to distinguish all of them. Each bit doubles the number of labels, so the code length is bits. Concretely:
- centroids → bits per code (1 byte)
- centroids → bits
- centroids → 64 bits (8 bytes)
More centroids means the quantizer discriminates finer detail, and the code grows accordingly. The codebook designer is playing a bit-budget game.
The paper's aggressive target
Jégou et al. want each 128-D SIFT descriptor to become a 64-bit code, which is 0.5 bit per dimension, a 64× compression ratio against the 512-byte original. Half a bit per dimension is ambitious: it means the quantizer has to encode all the variation along each axis with a single binary-ish choice.
Hitting 64 bits with one flat codebook means picking centroids. That is roughly 18 quintillion, more than the number of grains of sand on Earth. Let's feel that number before we dismiss it.
Flat codebook feasibility
Pick a vector dimension and a bit budget per dimension to see the centroid count and storage footprint a single flat quantizer would need.
Why 264 is not a codebook
Three walls hit simultaneously when gets that large:
- Storage.Each centroid is 128 floats × 4 bytes = 512 bytes. Times centroids gives about 9.4 zettabytes, more than all cloud storage on Earth in 2025. You cannot persist the codebook, let alone page through it at query time.
- Training data.Lloyd's algorithm needs at least tens of samples per centroid to converge. That is training vectors. No dataset in existence is that large.
- Query cost.To encode one descriptor you compare it to every centroid. 18 quintillion distance computations per vector is not a latency you can ship.
The flat codebook dies on all three fronts. The paper's move is to keep the effective vocabulary the same size while shrinking the stored codebook by many orders of magnitude.
The factoring trick
Back to the GIF analogy. Instead of finding one 256-color palette that works for the whole image, cut the image into 8 tiles and give each tile its own 256-color palette. Each tile is now one byte; the image is 8 bytes; every tile had access to a full 256-color palette tuned to its own pixels. PQ does exactly this for vectors.
Formally: split into sub-vectors and learn one small sub-quantizer per block. The product quantizer is the tuple of their outputs:
With and centroids per block, the effective codebook has possible combinations (same order of magnitude as the impossible flat case), but we only store centroid vectors in each. Total codebook storage is (paper §II.B, Table I), small enough to live in L2 cache.
Each encoded vector fits in , one byte per block index. Ratio: 512 B to 8 B, a 64× drop in memory, with a codebook that actually fits on the machine.
The walkthrough below builds the full index on the paper's reference SIFT setting (), turning each concept above into a concrete operation on real numbers.
1. One descriptor, zoomed all the way in
A 128-D vector is 128 floats. Zoom into a single one and you find IEEE 754: 1 sign bit, 8 exponent bits, 23 mantissa bits.
Four bytes per dimension, 512 bytes per vector.
2. Cut it into 8 sub-vectors
Same 128 numbers, regrouped into 8 contiguous blocks of 16 dimensions each.
Each block gets its own colour because each one will get its own codebook.
3. Now multiply: a tiny demo database
A codebook needs a population, not a single vector. This demo uses 4 example vectors; the paper trains on millions.
Every database vector gets the same 8-block split, and each block contributes one point per vector to its own subspace.
4. Focus on one codebook: u₁
Each subspace gets its own codebook, trained alone. That independence is the product in Product Quantization.
To see the whole recipe, we zoom into just one subvector, u₁, rendered in 2-D for readability (the paper's subspaces live in d* = D/m = 16 dimensions, with D = 128 and m = 8). Four training vectors is a pedagogical cartoon; real PQ trains each subspace on roughly 10⁶ descriptors.
Lloyd's algorithm runs k-means on this subspace alone: assign every point to its nearest centroid, move each centroid to its cluster mean, repeat until the within-block distortion MSE(q₁) = 𝔼‖u₁(x) − q₁(u₁(x))‖² stops falling.
The 6 centroids you end with (k* = 256 in the paper) form the codebook C₁. Encoding any future vector's first block is then a nearest-centroid lookup: emit the index i of c₁,ᵢ.
5. Same thing, 8 times in parallel
All 8 subspaces are quantized the same way at once, each block over its own slice of the data with its own codebook.
Lloyd's runs per block, independently (k* = 256 clusters per codebook in the paper, shown here with 6).
6. Encoding a brand-new vector
A query x arrives at search time, it was never in the training set.
Split it into 8 sub-vectors and, in each plot, snap to the nearest centroid. The 8 indices concatenate into the 8-byte code.
7. Decoding and per-block error
Reconstruction is a table lookup: replace each index by its centroid.
The gap between the unknown raw point and its centroid is the per-block quantization error; the total reconstruction error is their sum of squares.
8. Distance without decoding: SDC vs ADC
Two ways to compute query-to-database distance from codes.
SDC quantises both points and reads centroid-to-centroid distances. ADC keeps the query in full precision and reads raw-to-centroid distances.
ADC has a tighter error bound; FAISS defaults to it.
Once encoded, how do we measure distances without decoding every vector back to its 512 bytes? The paper gives two options.
- SDC (symmetric) quantizes both the query and the database vector and reads a pairwise centroid distance from a precomputed table per sub-quantizer.
- ADC (asymmetric) leaves the query in full precision and precomputes only query-to-centroid distances per block, then sums one lookup per block:
Per-query cost is similar (Jégou et al., Table II), but ADC has a tighter error bound: the mean squared distance error satisfies for ADC(Eq 18) versus for SDC. FAISS defaults to ADC.
6. Combining: IVFPQ
IVFPQ (also called IVFADC in the paper) stacks the two previous techniques: a coarse quantizer prunes the database to a handful of cells, and a product quantizer compresses what remains to 8 bytes per vector. One subtle choice makes the combination work: PQ encodes not the raw vector but its residual with respect to the coarse centroid.
Indexing a database vector :
- coarse quantize , the nearest of the coarse centroids;
- compute the residual ;
- encode the residual with the product quantizer, giving 8 byte codes ;
- append the pair to the inverted list of cell , where is the database index of and is the m-byte PQ code produced in step 3.
Residuals concentrate near zero, so the PQ codebook spends its bits modeling variation the coarse quantizer did not already capture. Same byte budget, better reconstruction.
Searching for a query :
- find the nearest coarse centroids to ;
- compute the query residual for each selected cell;
- precompute the distance Look-Up Table (LUT). Index selects the PQ sub-block ( in our setup), and index selects one of the centroids of that sub-block's codebook (). Each entry is the squared distance between the query's sub-block residual and that centroid:
- scan the inverted list: for each stored entry with PQ codes (one sub-block index per ), read the precomputed distances from the LUT and sum them:table reads, additions per candidate. No float multiplications, no square roots.
- keep a fixed-capacity max-heap of size holding the best candidates seen so far, keyed by their ADC distance. While the heap is not full, push every scanned candidate. Once full, compare each new ADC distance against the root (the current worst of the best-): if smaller, pop the root and push the candidate, otherwise discard it. The heap is shared across the probed inverted lists. When the scan is done, drain the heap in ascending order to get the approximate nearest neighbors of .
The walkthrough below runs the full pipeline with real arithmetic at every stage.
1. Train the coarse quantizer
Run k-means on a training set to learn coarse centroids. They carve into Voronoi cells, and these are the neighborhoods of IVFADC.
Paper setting: for SIFT (Jégou et al. §IV.A). We render 8 cells here so each one stays readable.
2. From raw vectors to residuals
For every training vector , subtract its nearest coarse centroid:
Subtracting the centroid re-centers each cell's points on the origin of its own frame. Pooled together, all cells share one centered residual distribution, and that is what the product quantizer learns on.
3. Why residuals quantize better
Residuals carry much less energy than raw vectors: on average. Same centroid budget, smaller region to cover, finer quantization steps.
Analogy: video compression. First send a coarse frame, then spend your bits on the difference. You encode motion and detail, not the whole picture from scratch.
Toggle the overlay in the visual to superimpose both histograms and read the variance ratio.
4. Train PQ on residuals
Split each residual into sub-vectors . Run Lloyd's on each block independently, with one codebook per block.
The codebooks model variation the coarse quantizer already subtracted away. Same 8-byte footprint, sharper reconstructions. Training ends here; the next three chapters consume these codebooks at speed.
5. Assign the coarse bucket
A new database vector arrives. Compare it to every coarse centroid, pick the nearest. Call its index . This index identifies the inverted list where will be stored.
6. Encode the residual with PQ
Compute , split into eight sub-vectors, snap each to its nearest centroid in . Emit the 8 block indices .
bits per block, 8 blocks, one byte each. Eight bytes of code locate inside its inverted list, with no need to store the raw coordinates.
7. Append to the inverted list
Append to the inverted list . The raw coordinates of are discarded. Memory per vector:
Click a Voronoi cell on the left to inspect the entries stored in its list.
8. Query arrives, probe w cells
Your true nearest neighbour can land near a cell boundary. To catch it, check the closest coarse centroids to , not just the top one (multiple assignment).
Adjust w on the left to probe more cells and see the pruning ratio update live.
9. Build the ADC lookup tables
For each probed cell , compute a query-specific residual , split it into 8 blocks, then for (one row per sub-block) and (one column per centroid), fill
8 × 256 = 2,048 cells per LUT. Built once per probed cell, reused across every candidate in its inverted list.
Why is constant across the whole list?
The query residual relative to a specific coarse cell is
- is the query, fixed for this search.
- is the center of the coarse bucket we are currently scanning, fixed for this specific inverted list.
- Therefore is identical for every comparison we make within that list.
Why is the LUT reused?
For a candidate inside that list we estimate
Using the residual decomposition ,
- is the query's offset from the cell center: same for every in this list.
- (with ) is the candidate's offset from the cell center: different for every , but it is just a selection of centroids from the codebooks. For sub-block , the candidate stores an index , so the per-block contribution is
Key realization. The specific index changes from candidate to candidate, but the set of possible centroids is the same for the whole database, and is the same for the whole list. So we pre-compute the distance from the query sub-vector to every possible centroid and store it in the LUT. When the scan moves to the next candidate, there is no math: we look up the index the candidate already points to (say ) and read .
“For each subquantizer the distances between the partial residual vector and all the centroids of are preliminarily computed and stored.” (Jégou et al., 2011)
| Component | Status during list scan |
|---|---|
| Query | Fixed |
| Cell center | Fixed |
| Query residual | Fixed |
| Sub-codebook centroids | Fixed (permanent) |
| LUT values | Fixed (computed once at start of list scan) |
| Candidate | Changes (iterating through the list) |
| Candidate indices | Changes (used to look up the fixed LUT values) |
10. Scan: 8 lookups, 7 adds
Each stored code is 8 table indices. The ADC distance estimate is:
No float multiplications, no square roots. The scan loop is eight random reads from an L1-resident table plus seven adds per candidate. That is the inner loop of IVFADC.
11. Merge top-K across w lists
A fixed-capacity max-heap of size tracks the current best. Whenever a new estimate beats the heap root, the root is popped and the candidate is pushed. After scanning all lists, drain the heap in ascending order; those are the approximate nearest neighbours.
12. Why it scales
Scanning only entries instead of , with each candidate costing 8 look-ups and 7 adds. Training runs once; encoding and search run forever.
Dispatched on GPU with WarpSelect (Johnson et al. 2017, §4.2), the same pipeline handles SIFT1B at 17.7 µsper query on a single Titan X.
Assuming balanced inverted lists, each query scans only about entries instead of . Jégou et al. (§IV.A) recommend between 1,000 and 1,000,000 for SIFT; their Table V uses and with . With at , that is roughly 7.8 million scanned entries per query, each costing 8 LUT look-ups and 7 additions.
On CPU, the 2011 paper reports 8.8 ms per query on a 1-million-vector GIST benchmark with these parameters (Table V). Billion-scale CPU latencies are an order of magnitude higher; the sub-millisecond regime belongs to the GPU implementation of Johnson, Douze and Jégou (2017), covered in §7.
7. Accelerating IVFPQ on GPU
Going from CPU to GPU shifts the bottleneck. The hard question is no longer “how fast can we sum these numbers” but “how fast can we stream thousands of tiny PQ codes from DRAM into the compute units.” The answer is about memory bandwidth, not arithmetic.
Each inverted list is one aisle in a huge warehouse. A query is a shopping list split into features. The LUT is a tiny price sheet, one row per feature. A CPU sends one worker down the aisle; a GPU sends a warp: 32 threads moving in lock-step.
Four tricks turn that crew into a billion-vector scanner:
- Coalesced reads. Transpose the list so the warp picks up 32 sub-codes in one 128-byte DRAM burst instead of 32 scattered ones.
- Shared-memory LUT. Copy the table into SRAM once per list. Every thread then reads it at ~25 cycles instead of ~400 (§5.3).
- Warp scan. One thread per vector, 32 distances per tick, partial sums held in registers.
- WarpSelect. Top-K without a global lock: warp queue, then block merge, then global merge (§4.2).
Step through the four tricks below; click any figure to zoom in on the labels.
Memory layout
How the inverted list is stored decides how many DRAM transactions a warp costs.
Takeaway.Row-major puts all codes of one vector together. Transposed groups all codes of one sub-quantizer together. With the transpose, 32 threads asking for sub-q j land on contiguous bytes and the memory controller answers with a single 128-byte transaction.
Scaling past one GPU is an explicit knob (§5.4): replication splits the query stream across devices, sharding splits the database across devices, and the two compose to GPUs.
8. Real-World Use Cases
FAISS isn't just for benchmarks. Here are six production scenarios where vector search with FAISS powers real applications:
Semantic Search
Find documents by meaning, not just keywords. Query embeddings retrieve the most relevant passages from millions of documents.
index.search(embed("climate change effects"), k=10)Image Similarity
Reverse image search and visual deduplication. Find visually similar products, detect copyright infringement, or organize photo libraries.
index.search(clip_embed(query_image), k=5)
RAG / LLM Memory
Retrieval-Augmented Generation grounds LLM responses in real data. FAISS serves as the fast retrieval backbone for context injection.
context = index.search(embed(user_query), k=3) llm.generate(prompt + context)
Recommendation Systems
Item-to-item and user-to-item recommendations. Embed user preferences and product features, then find nearest neighbors.
similar = index.search(item_embedding, k=20)
Document Deduplication
Detect near-duplicate documents at scale. Hash or embed content, then use range search to find items within a similarity threshold.
lims, D, I = index.range_search(emb, thresh=0.95)
Anomaly Detection
Score outliers by distance to their nearest neighbors. Points far from any cluster are likely anomalies or novel inputs.
D, _ = index.search(point, k=5) anomaly_score = D.mean()
9. Conclusion
Four ideas, layered on top of each other. Flat gives perfect recall at impossible cost. IVF partitions the space so most of it can be skipped at query time. PQ compresses every vector into roughly 8 bytes so a billion of them fit in RAM. IVFPQ combines the two, and on a single GPU the pipeline returns top-K in microseconds.
The move underneath is always the same: embed your data into a geometry where close means similar, then pick the index whose trade-offs match your constraints (latency, memory, recall). Graph indexes (HNSW, NSG) and hardware-aware accelerators (FastScan, IMI) reach the same goal through different tricks and will get their own articles.
Once the geometry and the trade-offs click, billion-scale nearest-neighbor search stops being a research problem and starts being a dial you turn.
References
- Billion-scale similarity search with GPUs (Johnson et al., 2017) - The original FAISS paper.
- Product quantization for nearest neighbor search (Jégou et al., 2011) - The foundation of PQ.
- Least squares quantization in PCM (Lloyd, 1982, IEEE Trans. Inf. Theory) - Lloyd optimality conditions used in §5.
- Improved Residual Vector Quantization for High-dimensional Approximate Nearest Neighbor Search (Liu et al., 2015) - Residual quantization.
- FAISS GitHub Repository
Get in touch
We hope you enjoyed this article. If you have questions, found a bug, or just want to chat about AI and engineering, we'd love to hear from you.
Future articles?
We're planning more deep dives into:
• Graph Neural Networks
• Transformer Architecture
• System Design for ML