Back to BlogAI

Inside FAISS: Billion-Scale Similarity Search

January 23, 202615 min readBy Tom Salembien
FAISSVector SearchOptimizationMachine Learning

1. Everything is a Vector

In modern AI, images, text, and audio are not understood by computers as "cat" or "symphony". Instead, they are converted into lists of numbers called embeddings or vectors.

These vectors live in a high-dimensional geometric space. The core idea is simple: items that are semantically similar are placed close together in this space.

Cat
Dog
Pizza
Car

1. The Challenge

Computers don't have eyes or ears. They only understand numbers. How do we explain the concept of a 'Dog' or a 'Cat' to a machine?

2. Vectorization

We use Neural Networks to transform raw data (pixels, text) into a list of numbers. This list is the 'Embedding'. It captures the semantic meaning.

3. The Vector Space

These numbers act as coordinates (X, Y, Z...). We can plot every image or word as a single point in a high-dimensional space.

4. Similarity is Distance

In this space, concepts that are similar are placed close together. 'Cat' is close to 'Dog', but far from 'Car'. We measure this using distance (Euclidean or Cosine).

Above, we visualize a simplified 2D slice of this space. Imagine this in 1024 dimensions. Finding the "nearest neighbor" for a query point is equivalent to finding the most similar item in your database.

2. The Scale Problem

The simplest way to find the nearest neighbor is Brute Force (or Exact Search). You take your query vector and calculate the distance to every single other vector in the database.

This works perfectly for 10,000 items. But what about 1 billion? If you have 1B items, you perform 1B distance calculations for every search. That's O(N). It's too slow for real-time applications like Google Search or ChatGPT memory.

3. FAISS to the Rescue

FAISS (Facebook AI Similarity Search) introduces approximate search methods. By sacrificing a tiny bit of accuracy (maybe you get the 2nd best match instead of the 1st), we can speed up search by orders of magnitude.

Let's explore three key indexing strategies:

  • Flat: The baseline brute-force. Slow but accurate.
  • IVF (Inverted File): Partitions the space into "Voronoi cells". We only look in the most promising cells.
  • HNSW (Hierarchical Navigable Small World): A multi-layered graph structure. Like a highway system for vectors.

Try it yourself. Increase the dataset size and see how different indexes perform.

Interactive Index Benchmark

Compare the performance of different indexing algorithms on random datasets.

Backend Offline

4. Deep Dive into Optimizations

Let's look under the hood. How do these algorithms actually work to achieve such speed?

IVF: Divide and Conquer

The Inverted File (IVF)index works like a library classification system. Instead of looking at every book to find a specific title, you go to the "Science Fiction" section first.

Mathematically, it uses K-Means clustering to partition the vector space into Voronoi cells. When we search, we first identify which cell our query vector falls into (using a "coarse quantizer"), and then we only calculate distances for vectors inside that cell (and perhaps a few neighboring ones).

IVF Clustering Animation1. Raw Data

Watch how random data is organized into efficient search cells.

Start: Here is our raw dataset. 80 vectors in a 2D space. Searching this linearly (Brute Force) would require calculating distance to 80 points.

HNSW: The Highway System

Hierarchical Navigable Small World (HNSW)graphs are currently the state-of-the-art for vector search. It solves the "Small World" navigation problem using a multi-layered graph.

Think of it like a map with different zoom levels. The top layer has very few cities (nodes) connected by long highways. You start there to quickly get to the right region. Then you drop down to regional roads (middle layers) and finally to local streets (base layer) to find the exact address.

HNSW (Hierarchical Navigable Small World)

Layer 2: Top Layer (Few nodes)
Layer 1: Middle Layer
Layer 0: Base Layer (All nodes)

Search starts at top

Drills down

HNSW builds a multi-layer graph. Like an express highway system:
Layer 2: High-speed connections, long jumps.
Layer 1: Regional roads.
Layer 0: Local streets (contains all data).
The search starts at the top, finds the best entry point, then moves down to refine the search. This logarithmic scaling is what makes it so fast.

Product Quantization: Compressing Space

Even with fast indexing, storing billions of vectors requires massive amounts of RAM.Product Quantization (PQ) is a lossy compression technique that solves this.

It splits a high-dimensional vector (e.g., 1024 floats) into smaller chunks (e.g., 8 chunks of 128 floats). Each chunk is then approximated by the ID of a "centroid" in that sub-space. Instead of storing the full float values, we just store the centroid IDs (often 1 byte each).

Product Quantization (PQ)

Original Vector (High Dim)

0.42
0.91
0.18
0.67
0.35
0.78
0.24
0.53
0.88
0.12
0.61
0.46
0.29
0.74
0.57
0.33
Sub-vector 0
Lookup Closest CentroidID: 100
Sub-vector 1
Lookup Closest CentroidID: 123
Sub-vector 2
Lookup Closest CentroidID: 146
Sub-vector 3
Lookup Closest CentroidID: 169

Compressed Code (PQ Code)

100
123
146
169

Compression: Instead of storing 1,024 float numbers (4KB), we split the vector into M sub-vectors.
Each sub-vector is replaced by the ID of its nearest centroid (a simple integer, usually 1 byte).
This can reduce memory usage by 90-95% while maintaining good search accuracy.

Putting it Together: IVF + PQ

In practice, billion-scale systems often combine these methods. IVF-PQ is the industry standard for efficient memory usage and fast search. We use IVF to narrow the search space, and PQ to compress the vectors so they fit in RAM.

Here is the complete end-to-end journey of a vector in an IVF-PQ index.

1. The Scale Problem

Searching millions of high-dimensional vectors linearly is too slow. We need a way to narrow down the search space and compress the data simultaneously.

2. Inverted File (IVF)

First, we cluster the entire dataset into 'cells' (Voronoi regions). We store vectors in buckets based on their nearest cluster centroid. At search time, we only look at a few promising buckets.

3. Product Quantization (PQ): Splitting

Even with IVF, storing full float vectors (e.g., 1024 dims) is heavy. PQ splits each high-dimensional vector into 'M' smaller sub-vectors.

4. PQ: Quantization (Codebooks)

Each sub-vector is replaced by the ID of the nearest 'sub-centroid' from a learned Codebook. We don't store the vector anymore, just a short list of IDs (e.g., 4 bytes vs 4KB).

5. The Stored Index

This is what we actually store in RAM: An Inverted List where each cell contains the Compressed PQ Codes of its vectors. It's tiny and fast.

6. Search: Coarse Step (IVF)

When a Query comes in, we first compare it to the main IVF Centroids. We pick the closest one(s) (e.g., 'nprobe=1'). We ignore the rest of the dataset.

7. Search: Fine Step (PQ)

Inside the target cell, we don't reconstruct vectors. We compare the Query to the Codebooks directly to estimate distance. This is extremely fast via lookup tables.

5. Advanced Indexing Methods

Beyond the basics, FAISS implements several advanced algorithms that push the boundaries of scale and speed. Let's explore some of the most powerful techniques in modern vector search.

Inverted Multi-Index (IMI)

Standard IVF has a limitation: if you want 65,536 cells, you need to train 65,536 centroids. The Inverted Multi-Index cleverly uses a Cartesian product of two smaller codebooks.

Instead of one index with 216 cells, IMI uses two indexes with 28 cells each. The combination gives us 216 = 65,536 cells, but we only train 2 × 256 = 512 centroids!

Inverted Multi-Index (IMI)

🏙️ Analogy:Like a 2D zip code system! Instead of having 65,536 unique codes, use 256 "row codes" × 256 "column codes". Same coverage, far fewer codes to train!

4 bits = 256 cells
Cells to search: 4 of 256
Search Fraction
1.56%

2D Multi-Index Grid

Query
Probed
Quantizer 2
Quantizer 1

Why IMI is Powerful

Standard IVF256 centroids
IMI 2×432 centroids
8×fewer centroids!

How It Works

  1. 1Split vector into 2 sub-vectors
  2. 2Quantize each to nearest centroid
  3. 3Cell = (centroid1, centroid2)
  4. 4Search probes nearby cells in 2D

FAISS Factory String

IMI2x4,PQ16

Multi-index with 2×4 bits = 256 cells, combined with PQ encoding

Best For

  • ✓ Very large-scale (100M+ vectors)
  • ✓ When training many centroids is expensive
  • ✓ Fast/less-accurate operating points

NSG: Navigating Spreading-out Graph

While HNSW uses multiple layers, NSG (Navigating Spreading-out Graph)achieves similar performance with a single-layer graph. It guarantees a "monotonic" search path from any starting point to any target.

The key insight is the "navigating node" a central hub that can reach any other node. Combined with carefully chosen edges that "spread out" in all directions, NSG achieves excellent search efficiency.

NSG: Navigating Spreading-out Graph

🗺️ Analogy:Like navigating a social network start from a central "hub" person, ask "who do you know that's closer to the target?", repeat until you find them.

More neighbors = faster search, more memory

Click on the graph to set query position

Query
Start (Nav Node)

Search Path

Click "Search" to start

Performance

0
Nodes Visited
0
Total Nodes

NSG vs HNSW

NSG: Single layer, monotonic search, smaller memory
HNSW: Multi-layer, logarithmic jump, faster but more memory

Key Insight

NSG guarantees monotonic pathto any node from the navigating node. The "spreading-out" property ensures good coverage with minimal edges.

6. Advanced Quantization

Product Quantization is powerful, but researchers have developed even more sophisticated compression techniques. These methods achieve better accuracy at the same memory budget.

Residual Quantization (RQ)

Instead of splitting the vector into independent parts like PQ, Residual Quantizationencodes the entire vector iteratively. Each stage encodes the "residual" (error) from the previous stage.

Think of it like progressive refinement: first approximation → compute error → encode error → repeat. This often achieves better reconstruction quality than PQ at the same byte budget.

Residual Quantization

🎯 Analogy:Like GPS refining your location first country, then city, then street, then building. Each stage encodes the "error" from the previous stage, progressively improving accuracy.

Encoding Process

Original Vector
x

Reconstruction Error

Initial
S1
S2
S3
S4
S5

Compressed Code

Original: 256 bytesCompressed: 1 bytes

💡 Key Insight

Unlike PQ which splits the vector into independent parts, RQ iteratively refinesthe entire vector. Each stage focuses on what the previous stages "missed" (the residual). This often achieves better accuracy at the same compression level.

PQ
Split → Encode parts
RQ
Encode → Refine error

FAISS supports several additive quantizers: RQ (Residual Quantizer),LSQ (Local Search Quantizer), and PRQ (Product Residual Quantizer). Use the factory string RQ5x8_Nqint8 for a 5-stage residual quantizer with 8-bit norm quantization.

7. Hardware Acceleration

Software optimizations only go so far. To truly scale, we need to leverage modern hardware: GPUs for massive parallelism, and SIMD instructions for CPU efficiency.

GPU Implementation

Vector search is embarrassingly parallel every distance calculation is independent. GPUs with thousands of cores can process these calculations simultaneously, achieving 5-10× speedup over CPUs.

FAISS GPU supports GpuIndexFlat,GpuIndexIVFFlat, andGpuIndexIVFPQ. The conversion is simple: faiss.index_cpu_to_gpu(res, 0, index).

GPU vs CPU Performance

⚡ Why GPU? Vector search is embarrassingly parallel each distance calculation is independent. GPUs have thousands of cores that can process vectors simultaneously!

100K10M

CPU (8 cores)

Processing Cores
Search Progress0%
16.0ms
Search Time
6,250
QPS

GPU (5,120 cores)

CUDA Cores
Search Progress0%
12.80ms
Search Time
7,813
QPS
Massive Parallelism

GPUs process thousands of vectors in parallel. Each CUDA core handles a portion of the distance calculations.

High Bandwidth

HBM2 memory provides ~900 GB/s bandwidth vs ~50 GB/s for DDR4. Critical for loading vectors quickly.

Fast k-Selection

Custom parallel merge sort for finding top-k results without sorting all distances.

Parallel k-Selection (Top-10 from 32 threads)

32 threads
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
Merge 2→1
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
Final
Top 10 Results

Each thread finds its local top-10, then parallel merge combines results. O(log n) merge depth instead of O(n) sequential.

FAISS GPU Usage

import faiss

# Create CPU index
index_cpu = faiss.IndexFlatL2(128)
index_cpu.add(vectors)

# Move to GPU
res = faiss.StandardGpuResources()
index_gpu = faiss.index_cpu_to_gpu(res, 0, index_cpu)

# Search on GPU (5-10x faster!)
D, I = index_gpu.search(queries, k=10)

FastScan: SIMD-Accelerated PQ

On CPU, FastScan uses SIMD (Single Instruction, Multiple Data) instructions to compute PQ distances much faster. By using 4-bit codes, lookup tables fit perfectly in CPU registers, and AVX2/AVX512 can process multiple vectors simultaneously.

This technique, described in the "Quicker ADC" paper, achieves up to 4× speedup over scalar PQ distance computation. Enable it with the fs suffix:PQ16x4fs.

FastScan: SIMD-Accelerated PQ

🚀 Key Idea:Process multiple vectors' lookups simultaneously using SIMD registers. AVX2 can pack 32 bytes (32 lookups) in one instruction!

Lookup Tables (LUT)

M0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
M1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
M2
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
M3
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

+ 4 more tables...

Each table has 16 pre-computed distances (4-bit code = 16 centroids)

SIMD Processing

AVX2 Register (256 bits = 32 bytes)
4 vectors × 8 sub-quantizers = 32 lookups/batch
Sub-quantizer: 1/8
Vec 0
3
Vec 1
8
Vec 2
1
Vec 3
12
All 4 vectors processed simultaneously!

Computed Distances

Vec 0
Vec 1
Vec 2
Vec 3

Performance

32
Scalar ops
8
SIMD ops
faster with SIMD

Why 4-bit codes?

  • • 16 values fit in lookup table
  • • Table fits in L1 cache
  • • 2 codes pack in 1 byte
  • • Perfect for SIMD gather
Progress1 / 8 operations
index = faiss.index_factory(128, "IVF1024,PQ16x4fs")

fs = FastScan mode with 4-bit codes. Up to 4× faster than regular PQ!

8. Choosing the Right Index

With so many options, how do you choose? Here's a practical guide based on your constraints:

Need Exact Results?

Use Flat. It's the only option that guarantees 100% recall.

Have Plenty of RAM?

Use HNSW32. Fastest search, but stores full vectors + graph.

Memory Constrained?

Use OPQ16_64,IVF65536,PQ16. Compresses to ~16 bytes/vector.

Billion-Scale?

Use OPQ16,IVF1048576_HNSW,PQ16x4fsr with GPU training.

The FAISS index factory makes it easy to experiment. Pass a string like faiss.index_factory(128, "IVF1024,PQ16x4fs")and FAISS builds the entire index structure for you.

Try the interactive builder below to understand how different components combine:

Index Factory Builder

Build FAISS index strings interactively

Quick Presets:
Available Components:
Your Index:
Click components to add them
Generated Index String:
faiss.index_factory(d, "")

9. Real-World Use Cases

FAISS isn't just for benchmarks. Here are six production scenarios where vector search with FAISS powers real applications:

Semantic Search

Find documents by meaning, not just keywords. Query embeddings retrieve the most relevant passages from millions of documents.

index.search(embed("climate change effects"), k=10)

Image Similarity

Reverse image search and visual deduplication. Find visually similar products, detect copyright infringement, or organize photo libraries.

index.search(clip_embed(query_image), k=5)

RAG / LLM Memory

Retrieval-Augmented Generation grounds LLM responses in real data. FAISS serves as the fast retrieval backbone for context injection.

context = index.search(embed(user_query), k=3)
llm.generate(prompt + context)

Recommendation Systems

Item-to-item and user-to-item recommendations. Embed user preferences and product features, then find nearest neighbors.

similar = index.search(item_embedding, k=20)

Document Deduplication

Detect near-duplicate documents at scale. Hash or embed content, then use range search to find items within a similarity threshold.

lims, D, I = index.range_search(emb, thresh=0.95)

Anomaly Detection

Score outliers by distance to their nearest neighbors. Points far from any cluster are likely anomalies or novel inputs.

D, _ = index.search(point, k=5)
anomaly_score = D.mean()

Conclusion

Optimization is a trade-off. Flat indexes are perfect but unscalable.IVF offers a great balance of speed and memory.HNSW is the speed king but consumes more RAM.

Advanced techniques like IMI, Residual Quantization, and FastScanpush the boundaries even further. GPU acceleration provides massive speedups for large-scale deployments.

By understanding the geometry of our data and the trade-offs of each method, we can navigate billion-scale datasets in milliseconds.

References

Get in Touch

I hope you enjoyed this article. If you have questions, found a bug, or just want to chat about AI and engineering, I'd love to hear from you.

Future Articles?

I'm planning more deep dives into:
• Graph Neural Networks
• Transformer Architecture
• System Design for ML

Send Feedback

Share this article