60 lines
1.9 KiB
Markdown
60 lines
1.9 KiB
Markdown
# Architecture
|
|
|
|
## Overview
|
|
|
|
```
|
|
Browser (Vue 3 SPA)
|
|
|
|
|
nginx (static + /api proxy)
|
|
|
|
|
FastAPI backend
|
|
├── BM25Index (in-process, rank-bm25)
|
|
├── Retriever (BM25 + optional vector)
|
|
├── Synthesizer (LLMRouter → Ollama)
|
|
└── SQLite (page_chunks + metadata)
|
|
+
|
|
sqlite-vec (vectors)
|
|
```
|
|
|
|
## Ingest pipeline
|
|
|
|
```
|
|
PDF / EPUB file
|
|
│
|
|
├─ PDFExtractor (pdfminer + OCR fallback) ← circuitforge_core
|
|
│ or
|
|
└─ EPUBExtractor (BeautifulSoup + heading chunking)
|
|
│
|
|
text_clean.py (strip artifacts)
|
|
│
|
|
INSERT INTO page_chunks
|
|
│
|
|
Ollama embed (batches of 64) ← BYOK gate
|
|
│
|
|
sqlite-vec upsert
|
|
```
|
|
|
|
## Retrieval
|
|
|
|
Hybrid search merges BM25 and semantic results with a 50/50 score blend:
|
|
|
|
1. BM25 queries the in-process index (no round-trip to DB)
|
|
2. Semantic query embeds the user query via Ollama, fetches `top_k * 20` nearest vectors, filters by `doc_id` in Python
|
|
3. Hits are merged: BM25 scores and vector scores combined; BM25 hits take priority
|
|
4. Top `k` results are ranked, then adjacent pages (page ± 1) are fetched to restore context for mid-sentence chunk boundaries
|
|
|
|
## Storage
|
|
|
|
| File | Format | Contents |
|
|
|------|--------|---------|
|
|
| `pagepiper.db` | SQLite | `documents`, `page_chunks`, `chat_feedback` |
|
|
| `pagepiper_vecs.db` | sqlite-vec | `page_vecs` virtual table + `page_vecs_meta` |
|
|
|
|
The vector database stores one row per page chunk. If the embedding model changes, Pagepiper detects the dimension mismatch at startup (reads `CREATE VIRTUAL TABLE` DDL from `sqlite_master`), deletes the vec DB, and queues a background re-embed.
|
|
|
|
## Licensing boundary
|
|
|
|
| Component | License |
|
|
|-----------|---------|
|
|
| BM25 search, ingest pipeline, library API | MIT |
|
|
| Hybrid vector search, RAG chat, embedding | BSL 1.1 (BYOK unlocked on Free tier) |
|