Architecture

Overview

Browser (Vue 3 SPA)
        |
  nginx (static + /api proxy)
        |
  FastAPI backend
    ├── BM25Index (in-process, rank-bm25)
    ├── Retriever (BM25 + optional vector)
    ├── Synthesizer (LLMRouter → Ollama)
    └── SQLite (page_chunks + metadata)
              +
         sqlite-vec (vectors)

Ingest pipeline

PDF / EPUB file
    │
    ├─ PDFExtractor (pdfminer + OCR fallback)  ← circuitforge_core
    │   or
    └─ EPUBExtractor (BeautifulSoup + heading chunking)
            │
     text_clean.py (strip artifacts)
            │
     INSERT INTO page_chunks
            │
     Ollama embed (batches of 64)   ← BYOK gate
            │
     sqlite-vec upsert

Retrieval

Hybrid search merges BM25 and semantic results with a 50/50 score blend:

BM25 queries the in-process index (no round-trip to DB)
Semantic query embeds the user query via Ollama, fetches top_k * 20 nearest vectors, filters by doc_id in Python
Hits are merged: BM25 scores and vector scores combined; BM25 hits take priority
Top k results are ranked, then adjacent pages (page ± 1) are fetched to restore context for mid-sentence chunk boundaries

Storage

File	Format	Contents
`pagepiper.db`	SQLite	`documents`, `page_chunks`, `chat_feedback`
`pagepiper_vecs.db`	sqlite-vec	`page_vecs` virtual table + `page_vecs_meta`

The vector database stores one row per page chunk. If the embedding model changes, Pagepiper detects the dimension mismatch at startup (reads CREATE VIRTUAL TABLE DDL from sqlite_master), deletes the vec DB, and queues a background re-embed.

Licensing boundary

Component	License
BM25 search, ingest pipeline, library API	MIT
Hybrid vector search, RAG chat, embedding	BSL 1.1 (BYOK unlocked on Free tier)

1.9 KiB Raw Blame History