Retrieval:
- Add _fetch_adjacent() to retriever: fetches page ± 1 chunks from DB
after ranking so mid-sentence EPUB chunk boundaries don't lose context
- Fix vec DB doc-filter: oversample to top_k*20 before Python filter
instead of post-filtering an already-small global pool (fixes wrong-book
results when searching within a single document)
- top_k default 5 → 10; context per chunk 500 → 1500 chars; citation
snippet 200 → 400 chars
Artifact cleaning:
- Add scripts/text_clean.py: strips ABC Amber LIT Converter watermarks,
processtext.com URLs, bare page numbers, piracy stamps from extracted text
- Wire clean_paragraph() into ingest_pdf.py and new ingest_epub.py
Startup validation:
- _check_vec_schema() at boot: detects embedding dimension mismatch,
deletes stale vec DB, and queues sequential re-embed in background thread
- Sequential _reembed_docs() prevents SQLite lock races on startup re-embed
cf-orch integration:
- Wire CF_ORCH_URL / CF_LICENSE_KEY into LLMRouter backend config so
allocate() fires and keeps the Ollama model warm between requests
Ingestion progress UI:
- GET /api/library/{doc_id}/status now returns vec_count from page_vecs_meta
- DocumentCard.vue polls status every 3 s while processing and shows
two-phase progress: indeterminate animation during extraction,
determinate "Embedding N/M pages" bar once vectors start landing
Other:
- Chat feedback endpoint + thumbs up/down UI (FeedbackButton.vue)
- EPUB ingest script (ingest_epub.py) with heading-based chunking
- migration 002: chat_feedback table
- README.md with setup and feature overview
|
||
|---|---|---|
| app | ||
| config | ||
| docker/web | ||
| docs | ||
| migrations | ||
| scripts | ||
| tests | ||
| web | ||
| .env.cloud.example | ||
| .env.example | ||
| .gitignore | ||
| compose.cloud.yml | ||
| compose.override.yml.example | ||
| compose.yml | ||
| Dockerfile | ||
| environment.yml | ||
| manage.sh | ||
| pyproject.toml | ||
| README.md | ||
Pagepiper
v0.1.0 | Self-hosted PDF and EPUB search for your personal library
Pagepiper lets you drop PDFs and EPUBs into a library, index them, and search across the full text. With Ollama configured, you also get hybrid vector search and an LLM (large language model) chat interface that cites specific page numbers when it answers.
Built for TTRPG (tabletop roleplaying game) players tired of ctrl-F'ing through Pathfinder core rulebooks. Works equally well for fan fiction EPUB collections, AO3 exports, and any personal document library.
Try it: pagepiper.circuitforge.tech
Features
| Feature | Free tier | Paid (BYOK) |
|---|---|---|
| PDF and EPUB upload via browser drag-and-drop | Yes | Yes |
| Directory scan for existing files | Yes | Yes |
| BM25 full-text search (no LLM required) | Yes | Yes |
| Unlimited local ingestion | Yes | Yes |
| Hybrid BM25 + k-NN vector search | No | Yes (local Ollama) |
| LLM chat with page-level citations | No | Yes (local Ollama) |
| Thumbs up / down feedback on answers | No | Yes |
BYOK (bring your own key) means you supply your own Ollama instance. No cloud API keys, no usage billing.
BM25 (Best Match 25) is a keyword ranking algorithm. It works without any LLM and runs entirely inside the Docker container. k-NN (k-nearest neighbor) vector search uses embeddings to find passages that are semantically similar to your question, even when the exact words don't match.
Tech Stack
- Backend: FastAPI + SQLite (BM25 via custom BM25Index, vectors via sqlite-vec)
- Frontend: Vue 3 SPA served by nginx
- Embedding model:
nomic-embed-textvia Ollama (1024-dim, optional) - Chat LLM:
mistral:7bvia Ollama (optional, any Ollama model works) - Deployment: Docker Compose
Quick Start (Self-Hosting)
Prerequisites
- Docker and Docker Compose
- PDFs or EPUBs you want to search
- Optional: Ollama for semantic search and RAG (retrieval-augmented generation) chat
1. Clone the repo
git clone https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
cd pagepiper
2. Configure
cp .env.example .env
Open .env and set your paths:
# Directory to scan for PDFs/EPUBs (used by the "Scan" button in the UI)
PAGEPIPER_BOOKS_DIR=/path/to/your/pdfs
# Where Pagepiper stores its SQLite index and uploaded files
PAGEPIPER_DATA_DIR=data
To unlock hybrid search and LLM chat, uncomment and set the Ollama block:
PAGEPIPER_OLLAMA_URL=http://localhost:11434
PAGEPIPER_CHAT_MODEL=mistral:7b
PAGEPIPER_EMBED_MODEL=nomic-embed-text
3. Start
./manage.sh start
Open http://localhost:8521.
4. Add documents
Two ways to add files:
Upload via browser (easiest for small collections): Click Upload in the Library view and select a PDF or EPUB. The file saves to data/uploads/ and begins indexing automatically.
Scan a directory (best for large collections): Set PAGEPIPER_BOOKS_DIR in your .env to a folder of PDFs/EPUBs, then click Scan in the Library view. Pagepiper finds all files recursively and queues them for indexing.
5. Search and chat
Switch to the Chat tab and ask questions. On the free tier, BM25 keyword search returns matching passages. With Ollama configured, you get semantic search and an LLM-generated answer with page-number citations.
Ollama Setup (optional)
Install Ollama from ollama.com, then pull the models:
ollama pull mistral:7b
ollama pull nomic-embed-text
On a headless Linux server, make Ollama listen on all interfaces so the Docker container can reach it:
OLLAMA_HOST=0.0.0.0 ollama serve
On Docker Desktop (Linux or Mac), host.docker.internal resolves automatically. No extra network config needed.
Environment Variables
| Variable | Default | Description |
|---|---|---|
PAGEPIPER_BOOKS_DIR |
./books |
Host directory to scan for PDFs and EPUBs |
PAGEPIPER_DATA_DIR |
./data |
SQLite index and uploaded files live here |
PAGEPIPER_OLLAMA_URL |
(unset) | Ollama base URL; leave blank for BM25-only mode |
PAGEPIPER_EMBED_MODEL |
nomic-embed-text |
Ollama embedding model (1024-dim default) |
PAGEPIPER_EMBED_DIMS |
1024 |
Must match the embedding model's output dimensions |
PAGEPIPER_CHAT_MODEL |
mistral:7b |
Ollama chat model; any Ollama model name works |
PAGEPIPER_CHAT_FEEDBACK |
(unset) | Set to true to enable thumbs up/down on chat answers |
Management
./manage.sh start # Build and start (dev)
./manage.sh stop # Stop
./manage.sh restart # Restart
./manage.sh status # Show container status
./manage.sh logs [svc] # Tail logs (default: all services; pass 'api' or 'web' to filter)
./manage.sh open # Open the UI in your browser
./manage.sh build # Rebuild images without cache
./manage.sh cloud:start # Start the cloud managed instance (port 8533)
./manage.sh cloud:stop
./manage.sh cloud:restart
./manage.sh cloud:status
./manage.sh cloud:logs [svc]
./manage.sh cloud:build
Cloud Managed Instance
The cloud deployment runs at pagepiper.circuitforge.tech and at menagerie.circuitforge.tech/pagepiper. It uses compose.cloud.yml with LLM inference routed through the cf-orch coordinator.
To run your own cloud-style deployment:
cp .env.cloud.example .env
# Edit .env: set PAGEPIPER_OLLAMA_URL and data paths
./manage.sh cloud:start
Cloud instance listens on port 8533. The API is internal-only; nginx proxies /api/ to the backend.
Data and Backups
The data/ directory contains the SQLite index database and all uploaded files. Back it up to preserve your index. Pagepiper indexes documents at ingest time. If you modify or replace a source file, use the re-index button on the document card to rebuild its entry.
Large PDFs (hundreds of pages) can take a few minutes to index. The status badge on the document card updates as indexing progresses.
Licensing
Pagepiper uses a split license:
- MIT: BM25 full-text search, document library management, ingest pipeline, EPUB support
- BSL 1.1: Hybrid vector search (embedding + k-NN), RAG chat, LLM integration
BSL 1.1 is free for personal non-commercial self-hosting. SaaS re-hosting or commercial redistribution requires a license from CircuitForge. BSL 1.1 converts to MIT after four years.
License keys: circuitforge.tech
Contributing
Issues and PRs welcome at git.opensourcesolarpunk.com/Circuit-Forge/pagepiper.
The ingest pipeline and BM25 index are MIT-licensed. If you build a better PDF parser or add support for additional formats (CBZ, MOBI, etc.), the community benefits directly.