Self-hosted document library manager with BM25 keyword search and RAG chat with page-level citations

Find a file

pyr0ball e52bdb5128 feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI Retrieval: - Add _fetch_adjacent() to retriever: fetches page ± 1 chunks from DB after ranking so mid-sentence EPUB chunk boundaries don't lose context - Fix vec DB doc-filter: oversample to top_k*20 before Python filter instead of post-filtering an already-small global pool (fixes wrong-book results when searching within a single document) - top_k default 5 → 10; context per chunk 500 → 1500 chars; citation snippet 200 → 400 chars Artifact cleaning: - Add scripts/text_clean.py: strips ABC Amber LIT Converter watermarks, processtext.com URLs, bare page numbers, piracy stamps from extracted text - Wire clean_paragraph() into ingest_pdf.py and new ingest_epub.py Startup validation: - _check_vec_schema() at boot: detects embedding dimension mismatch, deletes stale vec DB, and queues sequential re-embed in background thread - Sequential _reembed_docs() prevents SQLite lock races on startup re-embed cf-orch integration: - Wire CF_ORCH_URL / CF_LICENSE_KEY into LLMRouter backend config so allocate() fires and keeps the Ollama model warm between requests Ingestion progress UI: - GET /api/library/{doc_id}/status now returns vec_count from page_vecs_meta - DocumentCard.vue polls status every 3 s while processing and shows two-phase progress: indeterminate animation during extraction, determinate "Embedding N/M pages" bar once vectors start landing Other: - Chat feedback endpoint + thumbs up/down UI (FeedbackButton.vue) - EPUB ingest script (ingest_epub.py) with heading-based chunking - migration 002: chat_feedback table - README.md with setup and feature overview		2026-05-06 08:25:58 -07:00
app	feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI	2026-05-06 08:25:58 -07:00
config	chore: initial pagepiper repo scaffold	2026-05-04 16:54:08 -07:00
docker/web	fix: use http_host for proxy Host header to preserve port in redirects	2026-05-05 12:04:56 -07:00
docs	feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI	2026-05-06 08:25:58 -07:00
migrations	feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI	2026-05-06 08:25:58 -07:00
scripts	feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI	2026-05-06 08:25:58 -07:00
tests	feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI	2026-05-06 08:25:58 -07:00
web	feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI	2026-05-06 08:25:58 -07:00
.env.cloud.example	feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI	2026-05-06 08:25:58 -07:00
.env.example	feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI	2026-05-06 08:25:58 -07:00
.gitignore	fix(scaffold): split api:8522/web:8521, fix nginx proxy to host.docker.internal	2026-05-04 17:02:41 -07:00
compose.cloud.yml	feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI	2026-05-06 08:25:58 -07:00
compose.override.yml.example	chore: initial pagepiper repo scaffold	2026-05-04 16:54:08 -07:00
compose.yml	fix: switch dev compose to bridge network, configurable API_PORT and WEB_PORT	2026-05-05 11:46:45 -07:00
Dockerfile	fix: switch dev compose to bridge network, configurable API_PORT and WEB_PORT	2026-05-05 11:46:45 -07:00
environment.yml	feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI	2026-05-06 08:25:58 -07:00
manage.sh	feat(deploy): add cloud deploy config for pagepiper.circuitforge.tech	2026-05-05 07:12:48 -07:00
pyproject.toml	chore: initial pagepiper repo scaffold	2026-05-04 16:54:08 -07:00
README.md	feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI	2026-05-06 08:25:58 -07:00

README.md

Pagepiper

v0.1.0 | Self-hosted PDF and EPUB search for your personal library

Pagepiper lets you drop PDFs and EPUBs into a library, index them, and search across the full text. With Ollama configured, you also get hybrid vector search and an LLM (large language model) chat interface that cites specific page numbers when it answers.

Built for TTRPG (tabletop roleplaying game) players tired of ctrl-F'ing through Pathfinder core rulebooks. Works equally well for fan fiction EPUB collections, AO3 exports, and any personal document library.

Try it: pagepiper.circuitforge.tech

Features

Feature	Free tier	Paid (BYOK)
PDF and EPUB upload via browser drag-and-drop	Yes	Yes
Directory scan for existing files	Yes	Yes
BM25 full-text search (no LLM required)	Yes	Yes
Unlimited local ingestion	Yes	Yes
Hybrid BM25 + k-NN vector search	No	Yes (local Ollama)
LLM chat with page-level citations	No	Yes (local Ollama)
Thumbs up / down feedback on answers	No	Yes

BYOK (bring your own key) means you supply your own Ollama instance. No cloud API keys, no usage billing.

BM25 (Best Match 25) is a keyword ranking algorithm. It works without any LLM and runs entirely inside the Docker container. k-NN (k-nearest neighbor) vector search uses embeddings to find passages that are semantically similar to your question, even when the exact words don't match.

Tech Stack

Backend: FastAPI + SQLite (BM25 via custom BM25Index, vectors via sqlite-vec)
Frontend: Vue 3 SPA served by nginx
Embedding model: nomic-embed-text via Ollama (1024-dim, optional)
Chat LLM: mistral:7b via Ollama (optional, any Ollama model works)
Deployment: Docker Compose

Quick Start (Self-Hosting)

Prerequisites

Docker and Docker Compose
PDFs or EPUBs you want to search
Optional: Ollama for semantic search and RAG (retrieval-augmented generation) chat

1. Clone the repo

git clone https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
cd pagepiper

2. Configure

cp .env.example .env

Open .env and set your paths:

# Directory to scan for PDFs/EPUBs (used by the "Scan" button in the UI)
PAGEPIPER_BOOKS_DIR=/path/to/your/pdfs

# Where Pagepiper stores its SQLite index and uploaded files
PAGEPIPER_DATA_DIR=data

To unlock hybrid search and LLM chat, uncomment and set the Ollama block:

PAGEPIPER_OLLAMA_URL=http://localhost:11434
PAGEPIPER_CHAT_MODEL=mistral:7b
PAGEPIPER_EMBED_MODEL=nomic-embed-text

3. Start

./manage.sh start

Open http://localhost:8521.

4. Add documents

Two ways to add files:

Upload via browser (easiest for small collections): Click Upload in the Library view and select a PDF or EPUB. The file saves to data/uploads/ and begins indexing automatically.

Scan a directory (best for large collections): Set PAGEPIPER_BOOKS_DIR in your .env to a folder of PDFs/EPUBs, then click Scan in the Library view. Pagepiper finds all files recursively and queues them for indexing.

5. Search and chat

Switch to the Chat tab and ask questions. On the free tier, BM25 keyword search returns matching passages. With Ollama configured, you get semantic search and an LLM-generated answer with page-number citations.

Ollama Setup (optional)

Install Ollama from ollama.com, then pull the models:

ollama pull mistral:7b
ollama pull nomic-embed-text

On a headless Linux server, make Ollama listen on all interfaces so the Docker container can reach it:

OLLAMA_HOST=0.0.0.0 ollama serve

On Docker Desktop (Linux or Mac), host.docker.internal resolves automatically. No extra network config needed.

Environment Variables

Variable	Default	Description
`PAGEPIPER_BOOKS_DIR`	`./books`	Host directory to scan for PDFs and EPUBs
`PAGEPIPER_DATA_DIR`	`./data`	SQLite index and uploaded files live here
`PAGEPIPER_OLLAMA_URL`	(unset)	Ollama base URL; leave blank for BM25-only mode
`PAGEPIPER_EMBED_MODEL`	`nomic-embed-text`	Ollama embedding model (1024-dim default)
`PAGEPIPER_EMBED_DIMS`	`1024`	Must match the embedding model's output dimensions
`PAGEPIPER_CHAT_MODEL`	`mistral:7b`	Ollama chat model; any Ollama model name works
`PAGEPIPER_CHAT_FEEDBACK`	(unset)	Set to `true` to enable thumbs up/down on chat answers

Management

./manage.sh start          # Build and start (dev)
./manage.sh stop           # Stop
./manage.sh restart        # Restart
./manage.sh status         # Show container status
./manage.sh logs [svc]     # Tail logs (default: all services; pass 'api' or 'web' to filter)
./manage.sh open           # Open the UI in your browser
./manage.sh build          # Rebuild images without cache

./manage.sh cloud:start    # Start the cloud managed instance (port 8533)
./manage.sh cloud:stop
./manage.sh cloud:restart
./manage.sh cloud:status
./manage.sh cloud:logs [svc]
./manage.sh cloud:build

Cloud Managed Instance

The cloud deployment runs at pagepiper.circuitforge.tech and at menagerie.circuitforge.tech/pagepiper. It uses compose.cloud.yml with LLM inference routed through the cf-orch coordinator.

To run your own cloud-style deployment:

cp .env.cloud.example .env
# Edit .env: set PAGEPIPER_OLLAMA_URL and data paths
./manage.sh cloud:start

Cloud instance listens on port 8533. The API is internal-only; nginx proxies /api/ to the backend.

Data and Backups

The data/ directory contains the SQLite index database and all uploaded files. Back it up to preserve your index. Pagepiper indexes documents at ingest time. If you modify or replace a source file, use the re-index button on the document card to rebuild its entry.

Large PDFs (hundreds of pages) can take a few minutes to index. The status badge on the document card updates as indexing progresses.

Licensing

Pagepiper uses a split license:

MIT: BM25 full-text search, document library management, ingest pipeline, EPUB support
BSL 1.1: Hybrid vector search (embedding + k-NN), RAG chat, LLM integration

BSL 1.1 is free for personal non-commercial self-hosting. SaaS re-hosting or commercial redistribution requires a license from CircuitForge. BSL 1.1 converts to MIT after four years.

License keys: circuitforge.tech

Contributing

Issues and PRs welcome at git.opensourcesolarpunk.com/Circuit-Forge/pagepiper.

The ingest pipeline and BM25 index are MIT-licensed. If you build a better PDF parser or add support for additional formats (CBZ, MOBI, etc.), the community benefits directly.