Skip to main content

Zero-Copy Bulk Ingest

Added in vectlite 0.12.0.

For large ingestion jobs, the dict-based bulk_ingest(list_of_dicts) / bulkIngest(records) path is no longer the fastest option. It still works (and is preserved for compat), but every record pays for Python → Rust object marshalling (Python) or a JSON serialise / deserialise round-trip (Node).

0.12.0 adds bulk_ingest_array (Python) and bulkIngestArray (Node) which take the vector data as a single contiguous native array — NumPy float32 for Python, Float32Array for Node — and hand it to the Rust core by reference. 10–30× faster than the dict path on large batches.

Python (NumPy)

import numpy as np
import vectlite

N, D = 50_000, 384

ids = [f"doc-{i}" for i in range(N)]
vectors = np.asarray(embeddings, dtype=np.float32) # shape (N, D), C-contiguous
metadata = [{"source": "blog", "lang": "en"} for _ in range(N)]

with vectlite.open("knowledge.vdb", dimension=D) as db:
db.bulk_ingest_array(ids, vectors, metadata)

Things to know:

  • vectors must be dtype=np.float32 and shape (N, D) where D == db.dimension. Pass np.ascontiguousarray(...) if your array came from a slice that isn't C-contiguous.
  • metadata is optional. Skip it if you only need ids + vectors.
  • The GIL is released during the Rust call (since 0.12.0 on all write paths), so other Python threads make progress.
  • numpy >= 1.23 is now a runtime dependency of the Python package — no extra install step.

All the same HNSW knobs (m, ef_construction, ef_search, parallel_insert_threshold, tombstone_rebuild_pct, segment_size_threshold) are accepted as kwargs.

Node (Float32Array)

const { open } = require('vectlite')

const N = 50_000
const D = 384

const ids = []
const flat = new Float32Array(N * D) // row-major

for (let i = 0; i < N; i++) {
ids.push(`doc-${i}`)
flat.set(embeddings[i], i * D)
}

const db = open('knowledge.vdb', { dimension: D })
db.bulkIngestArray(ids, flat, D, {
metadata: ids.map(() => ({ source: 'blog', lang: 'en' })),
})

Things to know:

  • flat must be a Float32Array of length N * D, row-major (one record per D slots).
  • napi-rs gives the Rust side a direct reference into the underlying ArrayBuffer — there's no JSON.stringify of the vector data.
  • The third argument is the vector dimension, not the record count.
  • options.metadata is an array parallel to ids; pass undefined to skip.

All HNSW knobs available on bulkIngest (m, efConstruction, efSearch, parallelInsertThreshold, tombstoneRebuildPct, segmentSizeThreshold) are accepted in options.

When NOT to use it

  • Small batches (a few hundred records). The marshalling cost is negligible there; use insert / upsert directly. Since 0.12.0, single-record db.insert() and db.upsert() route to the incremental WAL path in both bindings (no full HNSW rebuild per insert), which fixed a long-standing 150 vec/s ceiling.
  • Records without a precomputed vector, e.g. when you're using upsert_text with a built-in embedder. The array path is for vectors that already exist as floats.
  • Heterogeneous vector lengths. All rows in the same call must match db.dimension.

HNSW segments

0.12.0 also segments the HNSW index LSM-tree style: new inserts land in the active segment until it hits segment_size_threshold (default 50_000), then a new segment is started. This caps per-insert HNSW cost at O(log segment_size) instead of O(log total) — streaming throughput stays flat as the corpus grows.

Inspect the segment count:

print(db.ann_segment_count())              # default namespace
print(db.ann_segment_count(vector_name="colbert"))
console.log(db.annSegmentCount())
console.log(db.annSegmentCount(undefined, 'colbert'))

The manifest format bumped to ANN3. Old ANN1 / ANN2 databases still load — they report as 1-segment indexes and re-segment naturally as new inserts land.

See also