Text Analyzers

The vectlite.analyzers module provides a configurable text processing pipeline for generating sparse term vectors. This is useful for fine-tuning BM25 keyword search behavior.

info

Analyzers are currently available in the Python binding only.

Basic Usage

from vectlite.analyzers import Analyzer

analyzer = Analyzer().lowercase().stopwords("en").stemmer("english")
terms = analyzer.sparse_terms("How to authenticate users with SSO")
# {'authent': 0.333, 'user': 0.333, 'sso': 0.333}

Pipeline Steps

The analyzer applies steps in the order they are added:

Tokenizer

Replace the default tokenizer (alphanumeric word splitting):

analyzer = Analyzer().tokenizer(lambda text: text.split("-"))

Lowercase

Convert all tokens to lowercase:

analyzer = Analyzer().lowercase()

Stopwords

Remove common words. Built-in lists for English and French:

analyzer = Analyzer().stopwords("en")     # English stopwords
analyzer = Analyzer().stopwords("fr")     # French stopwords
analyzer = Analyzer().stopwords({"my", "custom", "words"})  # Custom set

Stemmer

Reduce words to their root form using Snowball stemming. Requires the PyStemmer package:

pip install PyStemmer

analyzer = Analyzer().stemmer("english")

N-grams

Generate character n-grams from tokens:

analyzer = Analyzer().ngrams(3)
# "hello" -> ["hel", "ell", "llo"]

Custom Filters

Add any function that transforms a token list:

def remove_short(tokens):
    return [t for t in tokens if len(t) > 2]

analyzer = Analyzer().filter(remove_short)

Weighted Fields

Generate sparse vectors from multiple text fields with different weights:

analyzer = Analyzer().lowercase().stopwords("en")

terms = analyzer.sparse_terms_weighted(
    fields={"title": "Auth Setup Guide", "body": "How to configure SSO for your organization"},
    weights={"title": 2.0, "body": 1.0},
)

Using with Search

Pass analyzer-generated terms to the search API:

analyzer = Analyzer().lowercase().stopwords("en").stemmer("english")

# Index
terms = analyzer.sparse_terms("How to configure SSO authentication")
db.upsert("doc1", embedding, {"text": "..."}, sparse=terms)

# Search
query_terms = analyzer.sparse_terms("SSO setup guide")
results = db.search(query_embedding, sparse=query_terms, k=10)

Basic Usage​

Pipeline Steps​

Tokenizer​

Lowercase​

Stopwords​

Stemmer​

N-grams​

Custom Filters​

Weighted Fields​

Using with Search​