Self-hosted document library manager with BM25 keyword search and RAG chat with page-level citations
Find a file
pyr0ball 347b391c6e fix: prevent LLM hallucination when retrieval returns low-signal results
- Strengthen synthesizer system prompt: hard 'respond with exactly' constraint
  instead of soft 'say so'; removes any wiggle room for the model to supplement
  from training data
- Add early return in synthesize() when chunks is empty (belt-and-suspenders
  alongside the existing guard in chat.py)
- Add MIN_SIGNAL threshold (0.01) in retriever: if the top combined score is
  below the threshold, return empty so the caller's no-results path fires instead
  of sending noise chunks to the LLM
2026-05-06 10:17:51 -07:00
app fix: prevent LLM hallucination when retrieval returns low-signal results 2026-05-06 10:17:51 -07:00
config chore: initial pagepiper repo scaffold 2026-05-04 16:54:08 -07:00
docker/web fix: use http_host for proxy Host header to preserve port in redirects 2026-05-05 12:04:56 -07:00
docs docs: add Playwright screenshots for library and chat views 2026-05-06 08:39:57 -07:00
migrations feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
scripts feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
tests feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
web feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
.env.cloud.example feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
.env.example feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
.gitignore fix(scaffold): split api:8522/web:8521, fix nginx proxy to host.docker.internal 2026-05-04 17:02:41 -07:00
compose.cloud.yml feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
compose.override.yml.example chore: initial pagepiper repo scaffold 2026-05-04 16:54:08 -07:00
compose.yml fix: switch dev compose to bridge network, configurable API_PORT and WEB_PORT 2026-05-05 11:46:45 -07:00
Dockerfile fix: switch dev compose to bridge network, configurable API_PORT and WEB_PORT 2026-05-05 11:46:45 -07:00
environment.yml feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
manage.sh feat(deploy): add cloud deploy config for pagepiper.circuitforge.tech 2026-05-05 07:12:48 -07:00
mkdocs.yml docs: add MkDocs site (getting-started, user-guide, reference) 2026-05-06 08:33:37 -07:00
pyproject.toml chore: initial pagepiper repo scaffold 2026-05-04 16:54:08 -07:00
README.md docs(readme): landing page rewrite — screenshots, quick start, formats table, tiers, Forgejo-primary, split license 2026-05-06 08:51:38 -07:00

Pagepiper

Search your document library. Get answers with exact page citations.

License: MIT / BSL 1.1 Version

Self-hosted PDF and EPUB search with BM25 (Best Match 25) full-text indexing and LLM (large language model) synthesis. Drop your documents in, ask a question, get an answer that tells you exactly which page to turn to.

Built for TTRPG (tabletop roleplaying game) players who are tired of ctrl-F'ing through six-hundred-page rulebooks. Works equally well for legal research, technical manuals, academic papers, or any personal document library you want to query in plain language.

No cloud required. Your files stay on your machine.


Screenshots

Library

Library view — documents listed with ingest status and page counts

Chat with citations

Chat view — answer with source document and page number for every claim


Why Pagepiper?

  • Your library, not ours. Documents are indexed and stored locally. Nothing is sent to a third-party service unless you explicitly configure a cloud LLM.
  • Works without an LLM. BM25 full-text search runs entirely inside the Docker container. No Ollama, no API key, no GPU required for keyword search.
  • Answers cite their sources. Every LLM response includes the document name and page number it drew from. You can verify or dispute every answer.
  • Hybrid search when you want it. Connect a local Ollama instance to unlock semantic (vector) search that finds relevant passages even when your question doesn't use the exact words in the text.
  • Open ingest pipeline. The indexing and search layer is MIT-licensed. Add support for new formats, improve the PDF parser, contribute — the community benefits directly.

Quick Start

Prerequisites: Docker and Docker Compose. Optionally Ollama for LLM-synthesized answers.

git clone https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
cd pagepiper
cp .env.example .env
./manage.sh start

Open http://localhost:8521.

Configure

Open .env and set your paths:

# Where Pagepiper stores its SQLite index and uploaded files
PAGEPIPER_DATA_DIR=./data

# Directory to scan for existing PDFs/EPUBs (used by the Scan button)
PAGEPIPER_BOOKS_DIR=/path/to/your/documents

To unlock LLM synthesis and semantic search, add your Ollama endpoint:

PAGEPIPER_OLLAMA_URL=http://localhost:11434
PAGEPIPER_CHAT_MODEL=mistral:7b
PAGEPIPER_EMBED_MODEL=nomic-embed-text

Add documents

Upload via browser — click Upload in the Library view. Files save to data/uploads/ and index automatically.

Scan a directory — set PAGEPIPER_BOOKS_DIR in .env, then click Scan. Pagepiper finds all files recursively and queues them.


Supported Formats

Format Ingest Page-level citations
PDF Yes Yes
EPUB Yes Yes (chapter/location)

Stack

Layer Technology
Backend API FastAPI + SQLite
Full-text search BM25 (custom index, no external service)
Vector search sqlite-vec + Ollama embeddings (optional)
LLM synthesis Ollama (local, any model)
Frontend Vue 3 SPA served by nginx
Deployment Docker Compose

Default ports: Web UI 8521, API 8540.


Management

./manage.sh start          # Build and start
./manage.sh stop           # Stop
./manage.sh restart        # Restart
./manage.sh status         # Show container status
./manage.sh logs [svc]     # Tail logs (pass 'api' or 'web' to filter)
./manage.sh open           # Open UI in browser
./manage.sh build          # Rebuild images
./manage.sh test           # Run test suite

Tiers

Feature Free Paid (BYOK)
PDF and EPUB upload Yes Yes
Directory scan Yes Yes
BM25 full-text search Yes Yes
Unlimited local ingestion Yes Yes
Hybrid BM25 + vector search Yes (local Ollama)
LLM synthesis with page citations Yes (local Ollama)

BYOK means you supply your own Ollama instance. No cloud API keys, no usage metering.


Forgejo-primary

Pagepiper is developed and hosted at git.opensourcesolarpunk.com/Circuit-Forge/pagepiper. GitHub mirrors exist for discoverability only. File issues and submit pull requests on Forgejo.


License

Pagepiper uses a split license:

  • MIT: Document ingest pipeline, BM25 full-text index, library management, EPUB support — the core discovery and retrieval layer.
  • BSL 1.1 (Business Source License): Hybrid vector search, LLM synthesis, RAG (retrieval-augmented generation) chat interface — free for personal non-commercial self-hosting; commercial use or SaaS re-hosting requires a license. Converts to MIT after four years.

A Circuit Forge LLC product. Privacy · Safety · Accessibility — co-equal, non-negotiable.