# Architecture ## Overview ``` Browser (Vue 3 SPA) | nginx (static + /api proxy) | FastAPI backend ├── BM25Index (in-process, rank-bm25) ├── Retriever (BM25 + optional vector) ├── Synthesizer (LLMRouter → Ollama) └── SQLite (page_chunks + metadata) + sqlite-vec (vectors) ``` ## Ingest pipeline ``` PDF / EPUB file │ ├─ PDFExtractor (pdfminer + OCR fallback) ← circuitforge_core │ or └─ EPUBExtractor (BeautifulSoup + heading chunking) │ text_clean.py (strip artifacts) │ INSERT INTO page_chunks │ Ollama embed (batches of 64) ← BYOK gate │ sqlite-vec upsert ``` ## Retrieval Hybrid search merges BM25 and semantic results with a 50/50 score blend: 1. BM25 queries the in-process index (no round-trip to DB) 2. Semantic query embeds the user query via Ollama, fetches `top_k * 20` nearest vectors, filters by `doc_id` in Python 3. Hits are merged: BM25 scores and vector scores combined; BM25 hits take priority 4. Top `k` results are ranked, then adjacent pages (page ± 1) are fetched to restore context for mid-sentence chunk boundaries ## Storage | File | Format | Contents | |------|--------|---------| | `pagepiper.db` | SQLite | `documents`, `page_chunks`, `chat_feedback` | | `pagepiper_vecs.db` | sqlite-vec | `page_vecs` virtual table + `page_vecs_meta` | The vector database stores one row per page chunk. If the embedding model changes, Pagepiper detects the dimension mismatch at startup (reads `CREATE VIRTUAL TABLE` DDL from `sqlite_master`), deletes the vec DB, and queues a background re-embed. ## Licensing boundary | Component | License | |-----------|---------| | BM25 search, ingest pipeline, library API | MIT | | Hybrid vector search, RAG chat, embedding | BSL 1.1 (BYOK unlocked on Free tier) |