How ScaNN Got Good at Billion-Scale MIPS: AVQ and SOAR

April 21, 2026

ann mips quantization retrieval scann

Introduction

Approximate nearest neighbor (ANN) search asks: given a query vector q and a large database X, return the k database vectors that are most similar to q without comparing q against every x ∈ X. Exact search is easy to define but too expensive at modern scale, so ANN systems trade a little recall for much lower latency and memory cost.

This post is about the maximum inner product search (MIPS) version of ANN, where similarity is q · x. That is the retrieval regime behind dense retrieval, recommendation, and extreme classification. ScaNN is one of the main systems for this setting: it partitions the database, scores the partitions to find the promising ones, and then reranks only the vectors inside them using compressed codes. It first uses approximate inner products to narrow the candidate set, then uses exact inner products on a much smaller subset.

The two papers discussed here are Accelerating Large-Scale Inference with Anisotropic Vector Quantization and SOAR: Improved Indexing for Approximate Nearest Neighbor Search.

There are other major ANN families too, including graph-based methods like HNSW and hashing-based methods like LSH, but the goal here is narrower: to explain the quantization-based line of ideas behind ScaNN.

Quantization is the step that makes those compressed codes possible. Instead of storing each vector, or each residual, in full precision, the index replaces it with a short code that points to a rough approximation. This is not just rounding each coordinate independently. In ANN systems, this is usually a learned encoding problem: the system builds codebooks from data and stores, for each vector, the index of the nearest codeword or combination of codewords. That loses some accuracy, but it saves memory and makes approximate scoring much faster.

ScaNN-style search in pseudocode

BuildIndex(X):
  learn partition centroids C
  for each x in X:
    p = partition(x, C)
    code = quantize(x - centroid(p))
    store (x, code) in p

Search(q, k):
  for each centroid c in C:
    partition_score[c] = q · c
  P = top_partitions(partition_score, leaves_to_search)

  candidates = []
  for each partition p in P:
    for each (x, code) in p:
      approx_score = q · centroid(p) + approx_inner_product(q, code)
      add (x, approx_score) to candidates

  R = top_candidates(candidates, candidates_to_rerank)
  for each x in R:
    exact_score[x] = q · x
  return top_k(exact_score, k)

Anisotropic Vector Quantization (AVQ): the standard error is the wrong objective

Quantization-based ANN systems were historically trained to minimize ordinary reconstruction error, ||x - code||². AVQ’s first move is to argue this is the wrong objective for MIPS.

In MIPS, only the top-scoring pairs affect the answer. Errors on database points that are plausible top matches matter much more than errors on low-scoring pairs — reconstructing a vector well in every direction wastes bits on regions of space that will never appear in a top-k list.

The score-aware loss

AVQ replaces reconstruction loss with a score-aware loss. The idea is simple: if a query would give x a high score, then quantization should preserve that score carefully; if a query would never retrieve x, errors matter much less.

Let r = x - code. Then q · r is the score error caused by quantization, and w(q · x) says how important that query-datapoint pair is:

l(x, code) = E_q[ w(q · x) · (q · r)² ]

So AVQ is not asking “Did we reconstruct the vector well everywhere?” It is asking “Did we preserve the inner products that affect the top of the ranking?”

For a fixed function w and a query distribution, this can be rewritten as a quadratic form:

l(x, code) = r^T A_x r
A_x = E_q[w(q · x) qq^T]

Under the spherical-query assumption used in the paper, we can write an explicit decomposition that looks like:

||r_perp||^2 + λ ||r_parallel||^2

where r_parallel is the projection onto x and r_perp is the rest.

That is the core AVQ idea: error parallel to x is more harmful for MIPS than error orthogonal to x, so the quantizer should pay more attention to the parallel component. That is why the method is called anisotropic.

How the PQ codes work

For the compressed codes, ScaNN still uses the usual product-quantization structure. It splits the residual vector into buckets, learns a small codebook for each bucket, and stores one codeword ID per bucket. The codebook is learned using Lloyd’s algorithm for k-means under a modified metric.

Spilling with Orthogonality-Amplified Residuals (SOAR): redundancy that actually helps

A VQ index can still miss true neighbors even with a well-trained quantizer. If a point x is truly relevant to a query q but its residual r = x - c happens to align with q, the centroid score q · c will be too small and the partition containing x won’t be visited. One fix is to replicate x in a second partition — a classical idea.

SOAR’s claim is that naive redundancy barely helps. Storing x at its second-closest centroid, or building two VQ indices with different random seeds, produces highly correlated failures: if the primary partition ranks x poorly for q, the backup usually does too. The backup centroid is a slightly worse version of the first one.

The same geometric trick, applied to assignments

SOAR asks the same kind of question as AVQ, but at the level of partition assignments rather than codewords. Let r = x - c_primary be the residual to the primary centroid and r' = x - c_spill the residual to a backup centroid. If q aligns too strongly with r, then q · c_primary = q · x - q · r becomes too small and the primary partition may be skipped.

Under the same uniform-query-on-the-sphere assumption, and using the weight function w(t) = |t|^λ applied to the true score t = q · x, the expected weighted score error for the spilled assignment reduces to a form proportional to:

||r'||² + λ ||proj_r r'||²

In practice that means: keep the spilled centroid reasonably close to the point, but penalize candidate residuals that are too parallel to the original residual. This has the same shape as the AVQ objective: a baseline squared-error term plus an extra penalty on one special direction. In AVQ, the expensive direction was the component of the quantization error parallel to x. In SOAR, the expensive direction is the component of the backup residual parallel to the primary residual r.

The result is an orthogonality-amplified backup assignment that is explicitly trained to compensate for the primary assignment’s blind spots rather than duplicating them.

The Shared Idea

Both papers exploit the same geometric fact: for inner-product ranking, not all components of a residual are equal. Whichever residual you’re analyzing — the quantization error relative to the datapoint (AVQ) or the secondary assignment relative to the primary residual (SOAR) — the parallel component is the expensive one.

The uniform-query assumption is the same simplification in both papers, and the same caveat: real query distributions are rarely that clean. In both cases the assumption makes the objective decomposable into something you can actually train against, and the empirical gains suggest the approximation is good enough to be useful.

Time per Query and Index Memory

Let n be the number of database vectors, d the embedding dimension, C the number of partitions, L the number of partitions searched per query, m the number of PQ buckets, K the codebook size per bucket, R the rerank set size, and ρ the average number of extra spilled assignments per vector. Without spilling, ρ = 0.

Per-query time has four main pieces:

score all partition centroids: O(Cd)
build the query-side PQ lookup tables: O(Kd)
scan compressed codes in the visited partitions: O(Bm)
rerank the short list exactly: O(Rd)

If the index is roughly balanced, then the number of scanned compressed codes is approximately

B ≈ (1 + ρ) · L · n / C

so a reasonable approximation is

T_query = O(Cd + Kd + Bm + Rd)

The index memory is dominated by three pieces: the partition centroids O(Cd), the PQ codebooks O(Kd), and the stored PQ codes O((1 + ρ)nm). So a simple size estimate is

M_index = O(Cd + Kd + (1 + ρ)nm)

Takeaway

For MIPS-style retrieval at scale, two of the biggest practical gains have come from the same geometric observation applied twice:

When you compress a vector, spend bits preferentially on the direction that changes inner-product ranking.
When you replicate a vector across partitions, make the second copy’s residual point somewhere the first copy’s residual doesn’t.

Both steps end up penalizing a specific kind of parallel residual, and together they’re a big part of why VQ-based MIPS indices work as well as they do at scale.