Skip to main content

Vector Quantization

Quantization trades a small amount of recall for large reductions in memory and faster search. All three methods in vectlite use a 2-stage pipeline: a fast quantized scan picks candidates, then exact float32 distance computation rescores the top hits.

Quantization parameters persist in a .vdb.quant sidecar file and reload automatically on open. Indexes rebuild on inserts, upserts, and bulk ingestion.

What gets compressed

The compression numbers describe the in-memory candidate index used during search. The .vdb file on disk still keeps the original float32 vectors so the exact-rescoring stage can do its job. If you want to reduce disk footprint, quantization is the wrong tool — look at column-level compaction or external storage.

Methods at a glance

MethodMemoryRecallBest for
scalar (int8)~4x reduction~99%The safe default — minimal recall loss.
binary~32x reduction90-98% (depends on dataset)Normalized embeddings, very large collections.
productconfigurable (typ. 8-64x)85-97% (depends on params)Massive datasets where memory dominates.

Scalar quantization (int8)

4x compression. Calibrates per-dimension min/max ranges and stores each component as a signed byte.

Python

db.enable_quantization("scalar")
# Search is transparently faster; recall is essentially unchanged
results = db.search(query, k=10)

Node.js

db.enableQuantization('scalar')
const results = db.search(query, { k: 10 })

Binary quantization

32x compression. Each component becomes a single bit (sign). Best for L2-normalized embeddings (most modern text encoders).

The 2-stage pipeline fetches rescore_multiplier × k candidates via fast Hamming-distance scan, then rescores with exact float32 cosine.

Python

db.enable_quantization("binary", rescore_multiplier=10)

Node.js

db.enableQuantization('binary', { rescoreMultiplier: 10 })

Tune rescore_multiplier if you see recall drop. The engine fetches k × rescore_multiplier candidates from the quantized scan and rescores them with exact float32 distances. Higher = better recall but slower search.

Product quantization (PQ)

Configurable compression via k-means sub-vector clustering. Trains codebooks on your data — call enable_quantization() after you have inserted a representative sample (e.g. after the first 10k records).

"pq" is an alias for "product", and method names are parsed case-insensitively. Omit num_sub_vectors and vectlite picks a dimension-compatible default; or call db.valid_num_sub_vectors() / db.validNumSubVectors() to list the legal values for your dimension before choosing one.

Python

# Dimension-aware default for num_sub_vectors
db.enable_quantization("pq")

# Or set it explicitly
db.enable_quantization(
"product",
num_sub_vectors=16, # split the vector into 16 chunks
num_centroids=256, # 256 centroids per chunk (1 byte per chunk)
)

# Inspect legal values for the current dimension
print(db.valid_num_sub_vectors()) # e.g. [2, 4, 8, 16, 24, 32, 48, ...] for dim=384

Node.js

db.enableQuantization('pq')
db.enableQuantization('product', { numSubVectors: 16, numCentroids: 256 })
console.log(db.validNumSubVectors())

For a 768-dim float32 vector: 768×4 = 3072 bytes raw → 16 bytes quantized in the candidate index = 192x compression of the in-memory index entry. The .vdb keeps the original floats for rescoring.

Inspecting state

is_quantized / isQuantized is a method (since 0.9.1) — call it with parentheses. quantization_method / quantizationMethod stays a property.

print(db.is_quantized())        # True
print(db.quantization_method) # "scalar", "binary", or "product"
db.disable_quantization() # remove sidecar and revert to raw vectors
console.log(db.isQuantized())
console.log(db.quantizationMethod)
db.disableQuantization()

Recommendation flow

  1. <100k records: don't bother, raw vectors are fine.
  2. 100k-1M records: scalar is the no-brainer default.
  3. 1M-10M records, normalized embeddings: try binary with rescore_multiplier=10. If recall is acceptable, ship it.
  4. >10M records or memory-bound: product with num_sub_vectors=16-32. Tune on your eval set.

Combining with payload indexes

Quantization and payload indexes are orthogonal — use both together. The filter narrows candidates via the payload index, then the quantized scan ranks them, then the float32 rescore picks the top-k.