circuitforge-core/docs/modules/text.md
pyr0ball 383897f990
Some checks are pending
CI / test (push) Waiting to run
Mirror / mirror (push) Waiting to run
Release — PyPI / release (push) Waiting to run
feat: platforms module + docs + scripts
- platforms/: eBay platform adapter (snipe integration layer)
- docs/: developer guide, module reference, getting-started docs
- scripts/: utility scripts for development and deployment
2026-04-24 15:23:16 -07:00

2.1 KiB

text

Text processing utilities. Normalization, truncation, chunking, and token estimation — shared across all products that manipulate text before or after LLM inference.

from circuitforge_core.text import normalize, chunk, truncate, estimate_tokens

normalize(text: str) -> str

Strips excess whitespace, normalizes unicode (NFC), and removes null bytes and control characters that can cause downstream issues with SQLite FTS5 or LLM tokenizers.

from circuitforge_core.text import normalize

clean = normalize("  Hello\u00a0world\x00  ")
# → "Hello world"

truncate(text: str, max_tokens: int, model: str = "default") -> str

Truncates text to approximately max_tokens tokens, breaking at sentence or paragraph boundaries where possible. Uses a simple byte-based heuristic (1 token ≈ 4 bytes) unless a specific model tokenizer is requested.

excerpt = truncate(long_doc, max_tokens=2048)

chunk(text: str, chunk_size: int, overlap: int = 0) -> list[str]

Splits text into overlapping chunks for RAG (retrieval-augmented generation) pipelines. Respects paragraph boundaries.

chunks = chunk(article_text, chunk_size=512, overlap=64)

estimate_tokens(text: str, model: str = "default") -> int

Estimates token count without loading a full tokenizer. Accurate enough for context window budget planning (within ~10%).

FTS5 helpers

SQLite FTS5 has quirks with special characters in MATCH expressions. The text module provides helpers used by the recipe engine and other FTS5 consumers:

from circuitforge_core.text import fts_quote, strip_apostrophes

# Always double-quote FTS5 terms — bare tokens break on brand names
query = " ".join(fts_quote(term) for term in tokens)
# → '"chicken" "breast" "lemon"'

# Strip apostrophes before FTS5 queries
clean = strip_apostrophes("O'Doul's")
# → "ODoulS"

!!! warning "FTS5 gotcha" Always quote ALL terms in MATCH expressions. Bare tokens break on brand names (e.g., O'Doul's), plant-based ingredient names, and anything with punctuation.