# text Text processing utilities. Normalization, truncation, chunking, and token estimation — shared across all products that manipulate text before or after LLM inference. ```python from circuitforge_core.text import normalize, chunk, truncate, estimate_tokens ``` ## `normalize(text: str) -> str` Strips excess whitespace, normalizes unicode (NFC), and removes null bytes and control characters that can cause downstream issues with SQLite FTS5 or LLM tokenizers. ```python from circuitforge_core.text import normalize clean = normalize(" Hello\u00a0world\x00 ") # → "Hello world" ``` ## `truncate(text: str, max_tokens: int, model: str = "default") -> str` Truncates text to approximately `max_tokens` tokens, breaking at sentence or paragraph boundaries where possible. Uses a simple byte-based heuristic (1 token ≈ 4 bytes) unless a specific model tokenizer is requested. ```python excerpt = truncate(long_doc, max_tokens=2048) ``` ## `chunk(text: str, chunk_size: int, overlap: int = 0) -> list[str]` Splits text into overlapping chunks for RAG (retrieval-augmented generation) pipelines. Respects paragraph boundaries. ```python chunks = chunk(article_text, chunk_size=512, overlap=64) ``` ## `estimate_tokens(text: str, model: str = "default") -> int` Estimates token count without loading a full tokenizer. Accurate enough for context window budget planning (within ~10%). ## FTS5 helpers SQLite FTS5 has quirks with special characters in MATCH expressions. The `text` module provides helpers used by the recipe engine and other FTS5 consumers: ```python from circuitforge_core.text import fts_quote, strip_apostrophes # Always double-quote FTS5 terms — bare tokens break on brand names query = " ".join(fts_quote(term) for term in tokens) # → '"chicken" "breast" "lemon"' # Strip apostrophes before FTS5 queries clean = strip_apostrophes("O'Doul's") # → "ODoulS" ``` !!! warning "FTS5 gotcha" Always quote ALL terms in MATCH expressions. Bare tokens break on brand names (e.g., `O'Doul's`), plant-based ingredient names, and anything with punctuation. --- ## LLM inference service `circuitforge_core.text.app` is a self-contained FastAPI inference server. It exposes a local LLM (or PII classifier) over HTTP so that products can call it via `CF_TEXT_URL` without bundling heavy ML dependencies themselves. ### What are you running? Three independent paths — pick one before installing: | Path | Use case | Extra | |---|---|---| | **LLM inference** | Chat, completion, summarisation using a GGUF or HuggingFace model | `text-llamacpp` or `text-transformers` | | **VLM inference** | Vision-language model that accepts images alongside text | `text-llamacpp` (GGUF with `--mmproj`) or `text-transformers` | | **Classifier / PII filter** | NER-based PII detection and redaction | `text-transformers` | --- ### LLM inference (GGUF via llama.cpp) ```bash pip install "circuitforge-core[text-llamacpp]" ``` ```bash python -m circuitforge_core.text.app \ --model /path/to/model.gguf \ --port 8006 \ --gpu-id 0 ``` 4-bit quantisation (GGUF files ending in `q4_k_m`, `q4_0`, etc.) runs well on 6–8GB VRAM. Full-precision (`f16`) requires more. Multi-GPU (splits across two GPUs via `device_map=auto`): ```bash python -m circuitforge_core.text.app \ --model /path/to/large-model \ --port 8006 \ --gpu-ids 0,1 ``` --- ### LLM inference (HuggingFace transformers) ```bash pip install "circuitforge-core[text-transformers]" # 4-bit quantisation (bitsandbytes): pip install "circuitforge-core[text-transformers-4bit]" ``` ```bash python -m circuitforge_core.text.app \ --model /path/to/model-or-hf-repo \ --backend transformers \ --port 8006 ``` --- ### VLM inference (GGUF with mmproj) LLaVA-style models (LLaVA, BakLLaVA, llava-phi) require a separate projector file (`--mmproj`): ```bash python -m circuitforge_core.text.app \ --model /path/to/llava-model.gguf \ --mmproj /path/to/mmproj.gguf \ --port 8006 \ --gpu-id 0 ``` Embedded VLMs (Qwen2-VL, MiniCPM-V, Moondream) have the projector baked in — no `--mmproj` needed. Sending images via the multimodal API: ```json POST /chat { "messages": [ { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,"}}, {"type": "text", "text": "What is in this document?"} ] } ] } ``` Sending an image to a text-only model returns HTTP 422. --- ### Classifier / PII filter ```bash pip install "circuitforge-core[text-transformers]" ``` ```bash python -m circuitforge_core.text.app \ --backend classifier \ --model dslim/bert-base-NER \ --port 8006 ``` Recommended model for English PII detection: `dslim/bert-base-NER`. Substituting other HuggingFace NER models is supported. Calling the filter endpoint: ```json POST /filter { "text": "Please contact John Smith at john@example.com.", "mode": "redact" } ``` Modes: `redact` (replace spans with `[REDACTED]`), `detect` (return boolean), `spans` (return span list with labels and confidence). --- ### Mock mode (no model required) ```bash CF_TEXT_MOCK=1 python -m circuitforge_core.text.app --port 8006 ``` Returns deterministic canned responses for all endpoints. No GPU, no model download. Suitable for CI and integration testing. --- ### Configuration | Variable | Default | Description | |---|---|---| | `CF_TEXT_URL` | — | URL products use to reach cf-text (e.g. `http://localhost:8006`) | | `CF_TEXT_MOCK` | — | Set to `1` to enable mock mode | CLI flags: `--model`, `--backend` (`llamacpp`/`transformers`/`classifier`/`mock`), `--port`, `--gpu-id`, `--gpu-ids`, `--mmproj`. --- ### API endpoints | Endpoint | Backend | Description | |---|---|---| | `GET /health` | all | `{"status":"ok","model":str,"backend":str,"vram_mb":int}` | | `POST /generate` | text-gen | Single prompt completion | | `POST /chat` | text-gen | OpenAI-compatible chat (supports multimodal content blocks) | | `POST /v1/chat/completions` | text-gen | OpenAI-compatible alias for `/chat` | | `POST /filter` | classifier | PII detection and redaction | --- ### Connecting from a product ```bash CF_TEXT_URL=http://localhost:8006 ``` Products using cf-core's LLM router pick this up automatically when the `text` backend is enabled in `config/llm.yaml`.