Vector Quantization
Quantization trades a small amount of recall for large reductions in memory and faster search. All three methods in vectlite use a 2-stage pipeline: a fast quantized scan picks candidates, then exact float32 distance computation rescores the top hits.
Quantization parameters persist in a .vdb.quant sidecar file and reload automatically on open. Indexes rebuild on inserts, upserts, and bulk ingestion.
The compression numbers describe the in-memory candidate index used during search. The .vdb file on disk still keeps the original float32 vectors so the exact-rescoring stage can do its job. If you want to reduce disk footprint, quantization is the wrong tool — look at column-level compaction or external storage.
Methods at a glance
| Method | Memory | Recall | Best for |
|---|---|---|---|
scalar (int8) | ~4x reduction | ~99% | The safe default — minimal recall loss. |
binary | ~32x reduction | 90-98% (depends on dataset) | Normalized embeddings, very large collections. |
product | configurable (typ. 8-64x) | 85-97% (depends on params) | Massive datasets where memory dominates. |
Scalar quantization (int8)
4x compression. Calibrates per-dimension min/max ranges and stores each component as a signed byte.
Python
db.enable_quantization("scalar")
# Search is transparently faster; recall is essentially unchanged
results = db.search(query, k=10)
Node.js
db.enableQuantization('scalar')
const results = db.search(query, { k: 10 })
Binary quantization
32x compression. Each component becomes a single bit (sign). Best for L2-normalized embeddings (most modern text encoders).
The 2-stage pipeline fetches rescore_multiplier × k candidates via fast Hamming-distance scan, then rescores with exact float32 cosine.
Python
db.enable_quantization("binary", rescore_multiplier=10)
Node.js
db.enableQuantization('binary', { rescoreMultiplier: 10 })
Tune rescore_multiplier if you see recall drop. The engine fetches k × rescore_multiplier candidates from the quantized scan and rescores them with exact float32 distances. Higher = better recall but slower search.
Product quantization (PQ)
Configurable compression via k-means sub-vector clustering. Trains codebooks on your data — call enable_quantization() after you have inserted a representative sample (e.g. after the first 10k records).
"pq" is an alias for "product", and method names are parsed case-insensitively. Omit num_sub_vectors and vectlite picks a dimension-compatible default; or call db.valid_num_sub_vectors() / db.validNumSubVectors() to list the legal values for your dimension before choosing one.
Python
# Dimension-aware default for num_sub_vectors
db.enable_quantization("pq")
# Or set it explicitly
db.enable_quantization(
"product",
num_sub_vectors=16, # split the vector into 16 chunks
num_centroids=256, # 256 centroids per chunk (1 byte per chunk)
)
# Inspect legal values for the current dimension
print(db.valid_num_sub_vectors()) # e.g. [2, 4, 8, 16, 24, 32, 48, ...] for dim=384
Node.js
db.enableQuantization('pq')
db.enableQuantization('product', { numSubVectors: 16, numCentroids: 256 })
console.log(db.validNumSubVectors())
For a 768-dim float32 vector: 768×4 = 3072 bytes raw → 16 bytes quantized in the candidate index = 192x compression of the in-memory index entry. The .vdb keeps the original floats for rescoring.
Inspecting state
is_quantized / isQuantized is a method (since 0.9.1) — call it with parentheses. quantization_method / quantizationMethod stays a property.
print(db.is_quantized()) # True
print(db.quantization_method) # "scalar", "binary", or "product"
db.disable_quantization() # remove sidecar and revert to raw vectors
console.log(db.isQuantized())
console.log(db.quantizationMethod)
db.disableQuantization()
Recommendation flow
- <100k records: don't bother, raw vectors are fine.
- 100k-1M records:
scalaris the no-brainer default. - 1M-10M records, normalized embeddings: try
binarywithrescore_multiplier=10. If recall is acceptable, ship it. - >10M records or memory-bound:
productwithnum_sub_vectors=16-32. Tune on your eval set.
Combining with payload indexes
Quantization and payload indexes are orthogonal — use both together. The filter narrows candidates via the payload index, then the quantized scan ranks them, then the float32 rescore picks the top-k.