Inside FAISS: Billion-Scale Similarity Search

January 23, 202645 min readBy Tom SalembienDiscussed on Hacker News

FAISSVector SearchOptimizationMachine Learning

Author's note

Before going further, two papers are worth reading. Douze et al. (2024), "The Faiss library" is the reference for the library as a whole; Johnson, Douze & Jégou (2019), "Billion-scale similarity search with GPUs" is the canonical work behind its GPU search. Both are excellent and remain the references we point to for FAISS.

This article is a humble visual companion: a few hand-picked parts of the design that we wanted to illustrate from our own perspective, with interactive schemas to make the geometry tangible. The actual FAISS implementation is far more optimized, complete, and complex than what we cover here, with many more methods, tuning knobs, and engineering details than fit in a single read.

Treat this as an entry point. For the full picture, the paper is the source of truth.

1. Everything is a Vector

In modern AI, images, text, and audio are not understood by computers as "cat" or "symphony". Instead, they are converted into lists of numbers called embeddings or vectors.

These vectors live in a high-dimensional geometric space. The core idea is simple: items that are semantically similar are placed close together in this space.

Above, we visualize a simplified 2D slice of this space. Imagine this in 1024 dimensions. Finding the "nearest neighbor" for a query point is equivalent to finding the most similar item in your database.

2. The NN Problem at Scale

Formally, given a query $x \in R^{D}$ and a database $Y = {y_{1}, \dots, y_{n}}$ of $n$ vectors, exact nearest-neighbor search solves:

NN (x) = ar g y \in Y min d (x, y), d (x, y) = ∥ x - y ∥_{2}

The naive Brute Force solution evaluates every distance explicitly. Cost per query: $O (n D)$ time and $O (n D)$ memory to hold the database in full precision. For $n = 1 0^{9}$ SIFT descriptors ( $D = 128$ , 4 bytes per float), that is 512 GB of RAM and one billion distance evaluations per search, a non-starter for real-time systems like web-scale retrieval or live LLM memory.

The rest of this article explores the two escape routes FAISS exploits:

Partitioning: skip most of $Y$ at query time (IVF).
Compression: make each comparison cheap, and make the database fit in RAM (Product Quantization).

Production systems combine both.

3. FAISS to the Rescue

FAISS (Facebook AI Similarity Search) introduces approximate search methods. By sacrificing a tiny bit of accuracy (maybe you get the 2nd best match instead of the 1st), we can speed up search by orders of magnitude.

Let's explore two key indexing strategies:

Flat: The baseline brute-force. Slow but accurate.
IVF (Inverted File):Partitions the space into "Voronoi cells". We only look in the most promising cells.

Try it yourself. Increase the dataset size and see how different indexes perform.

4. Partitioning with IVF

The Inverted File (IVF)index works like a library classification system. Instead of looking at every book to find a specific title, you go to the "Science Fiction" section first.

Mathematically, it uses K-Means clustering to partition the vector space into Voronoi cells. When we search, we first identify which cell our query vector falls into (using a "coarse quantizer"), and then we only calculate distances for vectors inside that cell (and perhaps a few neighboring ones).

5. Compressing with Product Quantization

IVF makes search fast by skipping most of the database, but leaves every vector uncompressed. One billion SIFT descriptors still cost 512 GB of RAM. Product Quantization (PQ), introduced by Jégou, Douze and Schmid (2011), is the compression trick FAISS builds on to shrink each vector to 8 bytes while keeping distance estimates meaningful.

Same centroids, a different job

§4 used centroids to partition the search space. PQ uses them to compresseach vector. Same K-Means math, new purpose: the centroid's index becomes the vector. If your codebook has $k$ centroids, any vector collapses to one integer in ${0, 1, \dots, k - 1}$ .

A pixel-sized analogy

A 24-bit RGB pixel can be any of 16.7 million colors. A GIF gives up some of that range: it picks a 256-color palette and replaces every pixel with an 8-bit index into it. Storage drops 3×, the picture still looks like the picture. That palette is a codebook; the index is a code.

A 128-D SIFT descriptor is just a very long pixel. We want the same trick: find a palette of representative vectors, replace every descriptor with its nearest palette index.

How many bits does an index cost?

To label $k$ centroids uniquely you need enough bits to distinguish all of them. Each bit doubles the number of labels, so the code length is $⌈ lo g_{2} k ⌉$ bits. Concretely:

$k = 256$ centroids → $lo g_{2} 256 = 8$ bits per code (1 byte)
$k = 1, 024$ centroids → $lo g_{2} 1024 = 10$ bits
$k = 2^{64}$ centroids → 64 bits (8 bytes)

More centroids means the quantizer discriminates finer detail, and the code grows accordingly. The codebook designer is playing a bit-budget game.

The paper's aggressive target

Jégou et al. want each 128-D SIFT descriptor to become a 64-bit code, which is 0.5 bit per dimension, a 64× compression ratio against the 512-byte original. Half a bit per dimension is ambitious: it means the quantizer has to encode all the variation along each axis with a single binary-ish choice.

Hitting 64 bits with one flat codebook means picking $k = 2^{64} \approx 1.8 \times 1 0^{19}$ centroids. That is roughly 18 quintillion, more than the number of grains of sand on Earth. Let's feel that number before we dismiss it.

Why 2⁶⁴ is not a codebook

Three walls hit simultaneously when $k$ gets that large:

Storage.Each centroid is 128 floats × 4 bytes = 512 bytes. Times $2^{64}$ centroids gives about 9.4 zettabytes, more than all cloud storage on Earth in 2025. You cannot persist the codebook, let alone page through it at query time.
Training data.Lloyd's algorithm needs at least tens of samples per centroid to converge. That is $≳ 30 \cdot 2^{64}$ training vectors. No dataset in existence is that large.
Query cost.To encode one descriptor you compare it to every centroid. 18 quintillion distance computations per vector is not a latency you can ship.

The flat codebook dies on all three fronts. The paper's move is to keep the effective vocabulary the same size while shrinking the stored codebook by many orders of magnitude.

The factoring trick

Back to the GIF analogy. Instead of finding one 256-color palette that works for the whole image, cut the image into 8 tiles and give each tile its own 256-color palette. Each tile is now one byte; the image is 8 bytes; every tile had access to a full 256-color palette tuned to its own pixels. PQ does exactly this for vectors.

Formally: split $x$ into $m$ sub-vectors $u_{j} \in R^{D / m}$ and learn one small sub-quantizer $q_{j}$ per block. The product quantizer is the tuple of their outputs:

q (x) = (q_{1} (u_{1} (x)), q_{2} (u_{2} (x)), \dots, q_{m} (u_{m} (x)))

With $m = 8$ and $k^{*} = 256$ centroids per block, the effective codebook has $(k^{*})^{m} = 25 6^{8} \approx 1.8 \times 1 0^{19}$ possible combinations (same order of magnitude as the impossible flat case), but we only store $m \cdot k^{*} = 2, 048$ centroid vectors in $R^{16}$ each. Total codebook storage is $m k^{*} D^{*} \cdot 4 B = 128 KB$ (paper §II.B, Table I), small enough to live in L2 cache.

Each encoded vector fits in $m lo g_{2} k^{*} = 8 \times 8 = 64 bits = 8 bytes$ , one byte per block index. Ratio: 512 B to 8 B, a 64× drop in memory, with a codebook that actually fits on the machine.

The walkthrough below builds the full index on the paper's reference SIFT setting ( $D = 128, m = 8, k^{*} = 256$ ), turning each concept above into a concrete operation on real numbers.

Once encoded, how do we measure distances without decoding every vector back to its 512 bytes? The paper gives two options.

SDC (symmetric) quantizes both the query and the database vector and reads a pairwise centroid distance from a precomputed $k^{*} \times k^{*}$ table per sub-quantizer.
ADC (asymmetric) leaves the query in full precision and precomputes only $k^{*}$ query-to-centroid distances per block, then sums one lookup per block:
$\tilde{d}_{ADC} (x, y)^{2} = j = 1 \sum m d (u_{j} (x), q_{j} (u_{j} (y)))^{2}$

Per-query cost is similar (Jégou et al., Table II), but ADC has a tighter error bound: the mean squared distance error satisfies $MSDE (q) \leq MSE (q)$ for ADC(Eq 18) versus $2 MSE (q)$ for SDC. FAISS defaults to ADC.

6. Combining: IVFPQ

IVFPQ (also called IVFADC in the paper) stacks the two previous techniques: a coarse quantizer prunes the database to a handful of cells, and a product quantizer compresses what remains to 8 bytes per vector. One subtle choice makes the combination work: PQ encodes not the raw vector but its residual with respect to the coarse centroid.

Indexing a database vector $y$ :

coarse quantize $c = q_{c} (y)$ , the nearest of the $k^{'}$ coarse centroids;
compute the residual $r (y) = y - q_{c} (y)$ ;
encode the residual with the product quantizer, giving 8 byte codes $(i_{1}, \dots, i_{m})$ ;
append the pair $(id, codes)$ to the inverted list of cell $c$ , where $id$ is the database index of $y$ and $codes = (i_{1}, \dots, i_{m})$ is the m-byte PQ code produced in step 3.

Residuals concentrate near zero, so the PQ codebook spends its bits modeling variation the coarse quantizer did not already capture. Same byte budget, better reconstruction.

Searching for a query $x$ :

find the $w$ nearest coarse centroids to $x$ ;
compute the query residual $r (x) = x - q_{c} (x)$ for each selected cell;
precompute the $m \times k^{*}$ distance Look-Up Table (LUT). Index $j \in {1, \dots, m}$ selects the PQ sub-block ( $m = 8$ in our setup), and index $i \in {0, \dots, k^{*} - 1}$ selects one of the $k^{*}$ centroids of that sub-block's codebook ( $k^{*} = 256$ ). Each entry is the squared distance between the query's sub-block residual and that centroid:
$LUT [j] [i] = u_{j} (r (x)) - c_{j, i}^{2}$
scan the inverted list: for each stored entry with PQ codes $(i_{1}, \dots, i_{m})$ (one sub-block index per $j$ ), read the $m$ precomputed distances from the LUT and sum them:
$\tilde{d}_{ADC}^{2} = j = 1 \sum m LUT [j] [i_{j}]$
$m$ table reads, $m - 1$ additions per candidate. No float multiplications, no square roots.
keep a fixed-capacity max-heap of size $K$ holding the best candidates seen so far, keyed by their ADC distance. While the heap is not full, push every scanned candidate. Once full, compare each new ADC distance against the root (the current worst of the best- $K$ ): if smaller, pop the root and push the candidate, otherwise discard it. The heap is shared across the $w$ probed inverted lists. When the scan is done, drain the heap in ascending order to get the $K$ approximate nearest neighbors of $x$ .

The walkthrough below runs the full pipeline with real arithmetic at every stage.

Assuming balanced inverted lists, each query scans only about $n \cdot w / k^{'}$ entries instead of $n$ . Jégou et al. (§IV.A) recommend $k^{'}$ between 1,000 and 1,000,000 for SIFT; their Table V uses $k^{'} = 1024$ and $k^{'} = 8192$ with $w \in {1, 8, 64}$ . With $k^{'} = 1024, w = 8$ at $n = 1 0^{9}$ , that is roughly 7.8 million scanned entries per query, each costing 8 LUT look-ups and 7 additions.

On CPU, the 2011 paper reports 8.8 ms per query on a 1-million-vector GIST benchmark with these parameters (Table V). Billion-scale CPU latencies are an order of magnitude higher; the sub-millisecond regime belongs to the GPU implementation of Johnson, Douze and Jégou (2017), covered in §7.

7. Accelerating IVFPQ on GPU

Going from CPU to GPU shifts the bottleneck. The hard question is no longer “how fast can we sum these numbers” but “how fast can we stream thousands of tiny PQ codes from DRAM into the compute units.” The answer is about memory bandwidth, not arithmetic.

Mental model

Each inverted list is one aisle in a huge warehouse. A query is a shopping list split into $m$ features. The LUT is a tiny price sheet, one row per feature. A CPU sends one worker down the aisle; a GPU sends a warp: 32 threads moving in lock-step.

Four tricks turn that crew into a billion-vector scanner:

Coalesced reads. Transpose the list so the warp picks up 32 sub-codes in one 128-byte DRAM burst instead of 32 scattered ones.
Shared-memory LUT. Copy the $m \times k^{*}$ table into SRAM once per list. Every thread then reads it at ~25 cycles instead of ~400 (§5.3).
Warp scan. One thread per vector, 32 distances per tick, partial sums held in registers.
WarpSelect. Top-K without a global lock: warp queue, then block merge, then global merge (§4.2).

Step through the four tricks below; click any figure to zoom in on the labels.

Scaling past one GPU is an explicit knob (§5.4): replication splits the query stream across $R$ devices, sharding splits the database across $S$ devices, and the two compose to $S \times R$ GPUs.

8. Real-World Use Cases

FAISS isn't just for benchmarks. Here are six production scenarios where vector search with FAISS powers real applications:

Semantic Search

Find documents by meaning, not just keywords. Query embeddings retrieve the most relevant passages from millions of documents.

index.search(embed("climate change effects"), k=10)

Image Similarity

Reverse image search and visual deduplication. Find visually similar products, detect copyright infringement, or organize photo libraries.

index.search(clip_embed(query_image), k=5)

RAG / LLM Memory

Retrieval-Augmented Generation grounds LLM responses in real data. FAISS serves as the fast retrieval backbone for context injection.

context = index.search(embed(user_query), k=3)
llm.generate(prompt + context)

Recommendation Systems

Item-to-item and user-to-item recommendations. Embed user preferences and product features, then find nearest neighbors.

similar = index.search(item_embedding, k=20)

Document Deduplication

Detect near-duplicate documents at scale. Hash or embed content, then use range search to find items within a similarity threshold.

lims, D, I = index.range_search(emb, thresh=0.95)

Anomaly Detection

Score outliers by distance to their nearest neighbors. Points far from any cluster are likely anomalies or novel inputs.

D, _ = index.search(point, k=5)
anomaly_score = D.mean()

9. Conclusion

Four ideas, layered on top of each other. Flat gives perfect recall at impossible cost. IVF partitions the space so most of it can be skipped at query time. PQ compresses every vector into roughly 8 bytes so a billion of them fit in RAM. IVFPQ combines the two, and on a single GPU the pipeline returns top-K in microseconds.

The move underneath is always the same: embed your data into a geometry where close means similar, then pick the index whose trade-offs match your constraints (latency, memory, recall). Graph indexes (HNSW, NSG) and hardware-aware accelerators (FastScan, IMI) reach the same goal through different tricks and will get their own articles.

Once the geometry and the trade-offs click, billion-scale nearest-neighbor search stops being a research problem and starts being a dial you turn.

References

Billion-scale similarity search with GPUs (Johnson et al., 2017) - The original FAISS paper.
Product quantization for nearest neighbor search (Jégou et al., 2011) - The foundation of PQ.
Least squares quantization in PCM (Lloyd, 1982, IEEE Trans. Inf. Theory) - Lloyd optimality conditions used in §5.
Improved Residual Vector Quantization for High-dimensional Approximate Nearest Neighbor Search (Liu et al., 2015) - Residual quantization.
FAISS GitHub Repository

Get in touch

We hope you enjoyed this article. If you have questions, found a bug, or just want to chat about AI and engineering, we'd love to hear from you.

Future articles?

We're planning more deep dives into:
• Graph Neural Networks
• Transformer Architecture
• System Design for ML

Get them in your inbox

Share this article

All Articles