Inside FAISS: Billion-Scale Similarity Search
1. Everything is a Vector
In modern AI, images, text, and audio are not understood by computers as "cat" or "symphony". Instead, they are converted into lists of numbers called embeddings or vectors.
These vectors live in a high-dimensional geometric space. The core idea is simple: items that are semantically similar are placed close together in this space.
1. The Challenge
Computers don't have eyes or ears. They only understand numbers. How do we explain the concept of a 'Dog' or a 'Cat' to a machine?
2. Vectorization
We use Neural Networks to transform raw data (pixels, text) into a list of numbers. This list is the 'Embedding'. It captures the semantic meaning.
3. The Vector Space
These numbers act as coordinates (X, Y, Z...). We can plot every image or word as a single point in a high-dimensional space.
4. Similarity is Distance
In this space, concepts that are similar are placed close together. 'Cat' is close to 'Dog', but far from 'Car'. We measure this using distance (Euclidean or Cosine).
Above, we visualize a simplified 2D slice of this space. Imagine this in 1024 dimensions. Finding the "nearest neighbor" for a query point is equivalent to finding the most similar item in your database.
2. The Scale Problem
The simplest way to find the nearest neighbor is Brute Force (or Exact Search). You take your query vector and calculate the distance to every single other vector in the database.
This works perfectly for 10,000 items. But what about 1 billion? If you have 1B items, you perform 1B distance calculations for every search. That's O(N). It's too slow for real-time applications like Google Search or ChatGPT memory.
3. FAISS to the Rescue
FAISS (Facebook AI Similarity Search) introduces approximate search methods. By sacrificing a tiny bit of accuracy (maybe you get the 2nd best match instead of the 1st), we can speed up search by orders of magnitude.
Let's explore three key indexing strategies:
- Flat: The baseline brute-force. Slow but accurate.
- IVF (Inverted File): Partitions the space into "Voronoi cells". We only look in the most promising cells.
- HNSW (Hierarchical Navigable Small World): A multi-layered graph structure. Like a highway system for vectors.
Try it yourself. Increase the dataset size and see how different indexes perform.
Interactive Index Benchmark
Compare the performance of different indexing algorithms on random datasets.
4. Deep Dive into Optimizations
Let's look under the hood. How do these algorithms actually work to achieve such speed?
IVF: Divide and Conquer
The Inverted File (IVF)index works like a library classification system. Instead of looking at every book to find a specific title, you go to the "Science Fiction" section first.
Mathematically, it uses K-Means clustering to partition the vector space into Voronoi cells. When we search, we first identify which cell our query vector falls into (using a "coarse quantizer"), and then we only calculate distances for vectors inside that cell (and perhaps a few neighboring ones).
IVF Clustering Animation1. Raw Data
Watch how random data is organized into efficient search cells.
Start: Here is our raw dataset. 80 vectors in a 2D space. Searching this linearly (Brute Force) would require calculating distance to 80 points.
HNSW: The Highway System
Hierarchical Navigable Small World (HNSW)graphs are currently the state-of-the-art for vector search. It solves the "Small World" navigation problem using a multi-layered graph.
Think of it like a map with different zoom levels. The top layer has very few cities (nodes) connected by long highways. You start there to quickly get to the right region. Then you drop down to regional roads (middle layers) and finally to local streets (base layer) to find the exact address.
HNSW (Hierarchical Navigable Small World)
Search starts at top
Drills down
HNSW builds a multi-layer graph. Like an express highway system:
• Layer 2: High-speed connections, long jumps.
• Layer 1: Regional roads.
• Layer 0: Local streets (contains all data).
The search starts at the top, finds the best entry point, then moves down to refine the search. This logarithmic scaling is what makes it so fast.
Product Quantization: Compressing Space
Even with fast indexing, storing billions of vectors requires massive amounts of RAM.Product Quantization (PQ) is a lossy compression technique that solves this.
It splits a high-dimensional vector (e.g., 1024 floats) into smaller chunks (e.g., 8 chunks of 128 floats). Each chunk is then approximated by the ID of a "centroid" in that sub-space. Instead of storing the full float values, we just store the centroid IDs (often 1 byte each).
Product Quantization (PQ)
Original Vector (High Dim)
Compressed Code (PQ Code)
Compression: Instead of storing 1,024 float numbers (4KB), we split the vector into M sub-vectors.
Each sub-vector is replaced by the ID of its nearest centroid (a simple integer, usually 1 byte).
This can reduce memory usage by 90-95% while maintaining good search accuracy.
Putting it Together: IVF + PQ
In practice, billion-scale systems often combine these methods. IVF-PQ is the industry standard for efficient memory usage and fast search. We use IVF to narrow the search space, and PQ to compress the vectors so they fit in RAM.
Here is the complete end-to-end journey of a vector in an IVF-PQ index.
1. The Scale Problem
Searching millions of high-dimensional vectors linearly is too slow. We need a way to narrow down the search space and compress the data simultaneously.
2. Inverted File (IVF)
First, we cluster the entire dataset into 'cells' (Voronoi regions). We store vectors in buckets based on their nearest cluster centroid. At search time, we only look at a few promising buckets.
3. Product Quantization (PQ): Splitting
Even with IVF, storing full float vectors (e.g., 1024 dims) is heavy. PQ splits each high-dimensional vector into 'M' smaller sub-vectors.
4. PQ: Quantization (Codebooks)
Each sub-vector is replaced by the ID of the nearest 'sub-centroid' from a learned Codebook. We don't store the vector anymore, just a short list of IDs (e.g., 4 bytes vs 4KB).
5. The Stored Index
This is what we actually store in RAM: An Inverted List where each cell contains the Compressed PQ Codes of its vectors. It's tiny and fast.
6. Search: Coarse Step (IVF)
When a Query comes in, we first compare it to the main IVF Centroids. We pick the closest one(s) (e.g., 'nprobe=1'). We ignore the rest of the dataset.
7. Search: Fine Step (PQ)
Inside the target cell, we don't reconstruct vectors. We compare the Query to the Codebooks directly to estimate distance. This is extremely fast via lookup tables.
5. Advanced Indexing Methods
Beyond the basics, FAISS implements several advanced algorithms that push the boundaries of scale and speed. Let's explore some of the most powerful techniques in modern vector search.
Inverted Multi-Index (IMI)
Standard IVF has a limitation: if you want 65,536 cells, you need to train 65,536 centroids. The Inverted Multi-Index cleverly uses a Cartesian product of two smaller codebooks.
Instead of one index with 216 cells, IMI uses two indexes with 28 cells each. The combination gives us 216 = 65,536 cells, but we only train 2 × 256 = 512 centroids!
Inverted Multi-Index (IMI)
🏙️ Analogy:Like a 2D zip code system! Instead of having 65,536 unique codes, use 256 "row codes" × 256 "column codes". Same coverage, far fewer codes to train!
2D Multi-Index Grid
Why IMI is Powerful
How It Works
- 1Split vector into 2 sub-vectors
- 2Quantize each to nearest centroid
- 3Cell = (centroid1, centroid2)
- 4Search probes nearby cells in 2D
FAISS Factory String
IMI2x4,PQ16Multi-index with 2×4 bits = 256 cells, combined with PQ encoding
Best For
- ✓ Very large-scale (100M+ vectors)
- ✓ When training many centroids is expensive
- ✓ Fast/less-accurate operating points
NSG: Navigating Spreading-out Graph
While HNSW uses multiple layers, NSG (Navigating Spreading-out Graph)achieves similar performance with a single-layer graph. It guarantees a "monotonic" search path from any starting point to any target.
The key insight is the "navigating node" a central hub that can reach any other node. Combined with carefully chosen edges that "spread out" in all directions, NSG achieves excellent search efficiency.
NSG: Navigating Spreading-out Graph
🗺️ Analogy:Like navigating a social network start from a central "hub" person, ask "who do you know that's closer to the target?", repeat until you find them.
More neighbors = faster search, more memory
Click on the graph to set query position
Search Path
Click "Search" to start
Performance
NSG vs HNSW
Key Insight
NSG guarantees monotonic pathto any node from the navigating node. The "spreading-out" property ensures good coverage with minimal edges.
6. Advanced Quantization
Product Quantization is powerful, but researchers have developed even more sophisticated compression techniques. These methods achieve better accuracy at the same memory budget.
Residual Quantization (RQ)
Instead of splitting the vector into independent parts like PQ, Residual Quantizationencodes the entire vector iteratively. Each stage encodes the "residual" (error) from the previous stage.
Think of it like progressive refinement: first approximation → compute error → encode error → repeat. This often achieves better reconstruction quality than PQ at the same byte budget.
Residual Quantization
🎯 Analogy:Like GPS refining your location first country, then city, then street, then building. Each stage encodes the "error" from the previous stage, progressively improving accuracy.
Encoding Process
Reconstruction Error
Compressed Code
💡 Key Insight
Unlike PQ which splits the vector into independent parts, RQ iteratively refinesthe entire vector. Each stage focuses on what the previous stages "missed" (the residual). This often achieves better accuracy at the same compression level.
FAISS supports several additive quantizers: RQ (Residual Quantizer),LSQ (Local Search Quantizer), and PRQ (Product Residual Quantizer). Use the factory string RQ5x8_Nqint8 for a 5-stage residual quantizer with 8-bit norm quantization.
7. Hardware Acceleration
Software optimizations only go so far. To truly scale, we need to leverage modern hardware: GPUs for massive parallelism, and SIMD instructions for CPU efficiency.
GPU Implementation
Vector search is embarrassingly parallel every distance calculation is independent. GPUs with thousands of cores can process these calculations simultaneously, achieving 5-10× speedup over CPUs.
FAISS GPU supports GpuIndexFlat,GpuIndexIVFFlat, andGpuIndexIVFPQ. The conversion is simple: faiss.index_cpu_to_gpu(res, 0, index).
GPU vs CPU Performance
⚡ Why GPU? Vector search is embarrassingly parallel each distance calculation is independent. GPUs have thousands of cores that can process vectors simultaneously!
CPU (8 cores)
GPU (5,120 cores)
Massive Parallelism
GPUs process thousands of vectors in parallel. Each CUDA core handles a portion of the distance calculations.
High Bandwidth
HBM2 memory provides ~900 GB/s bandwidth vs ~50 GB/s for DDR4. Critical for loading vectors quickly.
Fast k-Selection
Custom parallel merge sort for finding top-k results without sorting all distances.
Parallel k-Selection (Top-10 from 32 threads)
Each thread finds its local top-10, then parallel merge combines results. O(log n) merge depth instead of O(n) sequential.
FAISS GPU Usage
import faiss # Create CPU index index_cpu = faiss.IndexFlatL2(128) index_cpu.add(vectors) # Move to GPU res = faiss.StandardGpuResources() index_gpu = faiss.index_cpu_to_gpu(res, 0, index_cpu) # Search on GPU (5-10x faster!) D, I = index_gpu.search(queries, k=10)
FastScan: SIMD-Accelerated PQ
On CPU, FastScan uses SIMD (Single Instruction, Multiple Data) instructions to compute PQ distances much faster. By using 4-bit codes, lookup tables fit perfectly in CPU registers, and AVX2/AVX512 can process multiple vectors simultaneously.
This technique, described in the "Quicker ADC" paper, achieves up to 4× speedup over scalar PQ distance computation. Enable it with the fs suffix:PQ16x4fs.
FastScan: SIMD-Accelerated PQ
🚀 Key Idea:Process multiple vectors' lookups simultaneously using SIMD registers. AVX2 can pack 32 bytes (32 lookups) in one instruction!
Lookup Tables (LUT)
+ 4 more tables...
Each table has 16 pre-computed distances (4-bit code = 16 centroids)
SIMD Processing
Computed Distances
Performance
Why 4-bit codes?
- • 16 values fit in lookup table
- • Table fits in L1 cache
- • 2 codes pack in 1 byte
- • Perfect for SIMD gather
index = faiss.index_factory(128, "IVF1024,PQ16x4fs")fs = FastScan mode with 4-bit codes. Up to 4× faster than regular PQ!
8. Choosing the Right Index
With so many options, how do you choose? Here's a practical guide based on your constraints:
Need Exact Results?
Use Flat. It's the only option that guarantees 100% recall.
Have Plenty of RAM?
Use HNSW32. Fastest search, but stores full vectors + graph.
Memory Constrained?
Use OPQ16_64,IVF65536,PQ16. Compresses to ~16 bytes/vector.
Billion-Scale?
Use OPQ16,IVF1048576_HNSW,PQ16x4fsr with GPU training.
The FAISS index factory makes it easy to experiment. Pass a string like faiss.index_factory(128, "IVF1024,PQ16x4fs")and FAISS builds the entire index structure for you.
Try the interactive builder below to understand how different components combine:
Index Factory Builder
Build FAISS index strings interactively
9. Real-World Use Cases
FAISS isn't just for benchmarks. Here are six production scenarios where vector search with FAISS powers real applications:
Semantic Search
Find documents by meaning, not just keywords. Query embeddings retrieve the most relevant passages from millions of documents.
index.search(embed("climate change effects"), k=10)Image Similarity
Reverse image search and visual deduplication. Find visually similar products, detect copyright infringement, or organize photo libraries.
index.search(clip_embed(query_image), k=5)
RAG / LLM Memory
Retrieval-Augmented Generation grounds LLM responses in real data. FAISS serves as the fast retrieval backbone for context injection.
context = index.search(embed(user_query), k=3) llm.generate(prompt + context)
Recommendation Systems
Item-to-item and user-to-item recommendations. Embed user preferences and product features, then find nearest neighbors.
similar = index.search(item_embedding, k=20)
Document Deduplication
Detect near-duplicate documents at scale. Hash or embed content, then use range search to find items within a similarity threshold.
lims, D, I = index.range_search(emb, thresh=0.95)
Anomaly Detection
Score outliers by distance to their nearest neighbors. Points far from any cluster are likely anomalies or novel inputs.
D, _ = index.search(point, k=5) anomaly_score = D.mean()
Conclusion
Optimization is a trade-off. Flat indexes are perfect but unscalable.IVF offers a great balance of speed and memory.HNSW is the speed king but consumes more RAM.
Advanced techniques like IMI, Residual Quantization, and FastScanpush the boundaries even further. GPU acceleration provides massive speedups for large-scale deployments.
By understanding the geometry of our data and the trade-offs of each method, we can navigate billion-scale datasets in milliseconds.
References
- Billion-scale similarity search with GPUs (Johnson et al., 2017) - The original FAISS paper.
- Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (Malkov & Yashunin, 2016) - The HNSW algorithm.
- Product quantization for nearest neighbor search (Jegou et al., 2011) - The foundation of PQ.
- Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph (Fu et al., 2019) - The NSG algorithm.
- Quicker ADC: Unlocking the Hidden Potential of Product Quantization with SIMD (André et al., 2019) - FastScan technique.
- Improved Residual Vector Quantization for High-dimensional Approximate Nearest Neighbor Search (Liu et al., 2015) - Residual quantization.
- The inverted multi-index (Babenko & Lempitsky, 2012) - IMI paper.
- FAISS GitHub Repository
Get in Touch
I hope you enjoyed this article. If you have questions, found a bug, or just want to chat about AI and engineering, I'd love to hear from you.
Future Articles?
I'm planning more deep dives into:
• Graph Neural Networks
• Transformer Architecture
• System Design for ML