# text Text processing utilities. Normalization, truncation, chunking, and token estimation — shared across all products that manipulate text before or after LLM inference. ```python from circuitforge_core.text import normalize, chunk, truncate, estimate_tokens ``` ## `normalize(text: str) -> str` Strips excess whitespace, normalizes unicode (NFC), and removes null bytes and control characters that can cause downstream issues with SQLite FTS5 or LLM tokenizers. ```python from circuitforge_core.text import normalize clean = normalize(" Hello\u00a0world\x00 ") # → "Hello world" ``` ## `truncate(text: str, max_tokens: int, model: str = "default") -> str` Truncates text to approximately `max_tokens` tokens, breaking at sentence or paragraph boundaries where possible. Uses a simple byte-based heuristic (1 token ≈ 4 bytes) unless a specific model tokenizer is requested. ```python excerpt = truncate(long_doc, max_tokens=2048) ``` ## `chunk(text: str, chunk_size: int, overlap: int = 0) -> list[str]` Splits text into overlapping chunks for RAG (retrieval-augmented generation) pipelines. Respects paragraph boundaries. ```python chunks = chunk(article_text, chunk_size=512, overlap=64) ``` ## `estimate_tokens(text: str, model: str = "default") -> int` Estimates token count without loading a full tokenizer. Accurate enough for context window budget planning (within ~10%). ## FTS5 helpers SQLite FTS5 has quirks with special characters in MATCH expressions. The `text` module provides helpers used by the recipe engine and other FTS5 consumers: ```python from circuitforge_core.text import fts_quote, strip_apostrophes # Always double-quote FTS5 terms — bare tokens break on brand names query = " ".join(fts_quote(term) for term in tokens) # → '"chicken" "breast" "lemon"' # Strip apostrophes before FTS5 queries clean = strip_apostrophes("O'Doul's") # → "ODoulS" ``` !!! warning "FTS5 gotcha" Always quote ALL terms in MATCH expressions. Bare tokens break on brand names (e.g., `O'Doul's`), plant-based ingredient names, and anything with punctuation.