Self-hosted document library manager with BM25 keyword search and RAG chat with page-level citations
Find a file
pyr0ball bcd321367e feat: GET /api/library/sample-chunks for Avocet embed bench (closes #6)
Returns up to N randomly sampled page chunks (default 50, max 200) with
chunk_id, doc_id, page_number, and text fields. No tier gate — internal
tool endpoint for same-host corpus benchmarking. Returns [] on empty library.
2026-05-13 23:01:16 -07:00
app feat: GET /api/library/sample-chunks for Avocet embed bench (closes #6) 2026-05-13 23:01:16 -07:00
config chore: initial pagepiper repo scaffold 2026-05-04 16:54:08 -07:00
docker/web fix: use http_host for proxy Host header to preserve port in redirects 2026-05-05 12:04:56 -07:00
docs docs: add Playwright screenshots for library and chat views 2026-05-06 08:39:57 -07:00
migrations feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
scripts feat: encryption at rest infrastructure for cloud user data (closes #5) 2026-05-13 18:35:17 -07:00
tests feat: GET /api/library/sample-chunks for Avocet embed bench (closes #6) 2026-05-13 23:01:16 -07:00
web feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
.env.cloud.example feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
.env.example feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
.gitignore fix(scaffold): split api:8522/web:8521, fix nginx proxy to host.docker.internal 2026-05-04 17:02:41 -07:00
compose.cloud.yml feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
compose.override.yml.example chore: initial pagepiper repo scaffold 2026-05-04 16:54:08 -07:00
compose.yml fix: switch dev compose to bridge network, configurable API_PORT and WEB_PORT 2026-05-05 11:46:45 -07:00
Dockerfile fix: switch dev compose to bridge network, configurable API_PORT and WEB_PORT 2026-05-05 11:46:45 -07:00
environment.yml feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
manage.sh chore: standardize cloud commands to hyphen syntax + add update command 2026-05-13 16:12:12 -07:00
mkdocs.yml docs: add MkDocs site (getting-started, user-guide, reference) 2026-05-06 08:33:37 -07:00
pyproject.toml chore: initial pagepiper repo scaffold 2026-05-04 16:54:08 -07:00
README.md docs(readme): landing page rewrite — screenshots, quick start, formats table, tiers, Forgejo-primary, split license 2026-05-06 08:51:38 -07:00

Pagepiper

Search your document library. Get answers with exact page citations.

License: MIT / BSL 1.1 Version

Self-hosted PDF and EPUB search with BM25 (Best Match 25) full-text indexing and LLM (large language model) synthesis. Drop your documents in, ask a question, get an answer that tells you exactly which page to turn to.

Built for TTRPG (tabletop roleplaying game) players who are tired of ctrl-F'ing through six-hundred-page rulebooks. Works equally well for legal research, technical manuals, academic papers, or any personal document library you want to query in plain language.

No cloud required. Your files stay on your machine.


Screenshots

Library

Library view — documents listed with ingest status and page counts

Chat with citations

Chat view — answer with source document and page number for every claim


Why Pagepiper?

  • Your library, not ours. Documents are indexed and stored locally. Nothing is sent to a third-party service unless you explicitly configure a cloud LLM.
  • Works without an LLM. BM25 full-text search runs entirely inside the Docker container. No Ollama, no API key, no GPU required for keyword search.
  • Answers cite their sources. Every LLM response includes the document name and page number it drew from. You can verify or dispute every answer.
  • Hybrid search when you want it. Connect a local Ollama instance to unlock semantic (vector) search that finds relevant passages even when your question doesn't use the exact words in the text.
  • Open ingest pipeline. The indexing and search layer is MIT-licensed. Add support for new formats, improve the PDF parser, contribute — the community benefits directly.

Quick Start

Prerequisites: Docker and Docker Compose. Optionally Ollama for LLM-synthesized answers.

git clone https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
cd pagepiper
cp .env.example .env
./manage.sh start

Open http://localhost:8521.

Configure

Open .env and set your paths:

# Where Pagepiper stores its SQLite index and uploaded files
PAGEPIPER_DATA_DIR=./data

# Directory to scan for existing PDFs/EPUBs (used by the Scan button)
PAGEPIPER_BOOKS_DIR=/path/to/your/documents

To unlock LLM synthesis and semantic search, add your Ollama endpoint:

PAGEPIPER_OLLAMA_URL=http://localhost:11434
PAGEPIPER_CHAT_MODEL=mistral:7b
PAGEPIPER_EMBED_MODEL=nomic-embed-text

Add documents

Upload via browser — click Upload in the Library view. Files save to data/uploads/ and index automatically.

Scan a directory — set PAGEPIPER_BOOKS_DIR in .env, then click Scan. Pagepiper finds all files recursively and queues them.


Supported Formats

Format Ingest Page-level citations
PDF Yes Yes
EPUB Yes Yes (chapter/location)

Stack

Layer Technology
Backend API FastAPI + SQLite
Full-text search BM25 (custom index, no external service)
Vector search sqlite-vec + Ollama embeddings (optional)
LLM synthesis Ollama (local, any model)
Frontend Vue 3 SPA served by nginx
Deployment Docker Compose

Default ports: Web UI 8521, API 8540.


Management

./manage.sh start          # Build and start
./manage.sh stop           # Stop
./manage.sh restart        # Restart
./manage.sh status         # Show container status
./manage.sh logs [svc]     # Tail logs (pass 'api' or 'web' to filter)
./manage.sh open           # Open UI in browser
./manage.sh build          # Rebuild images
./manage.sh test           # Run test suite

Tiers

Feature Free Paid (BYOK)
PDF and EPUB upload Yes Yes
Directory scan Yes Yes
BM25 full-text search Yes Yes
Unlimited local ingestion Yes Yes
Hybrid BM25 + vector search Yes (local Ollama)
LLM synthesis with page citations Yes (local Ollama)

BYOK means you supply your own Ollama instance. No cloud API keys, no usage metering.


Forgejo-primary

Pagepiper is developed and hosted at git.opensourcesolarpunk.com/Circuit-Forge/pagepiper. GitHub mirrors exist for discoverability only. File issues and submit pull requests on Forgejo.


License

Pagepiper uses a split license:

  • MIT: Document ingest pipeline, BM25 full-text index, library management, EPUB support — the core discovery and retrieval layer.
  • BSL 1.1 (Business Source License): Hybrid vector search, LLM synthesis, RAG (retrieval-augmented generation) chat interface — free for personal non-commercial self-hosting; commercial use or SaaS re-hosting requires a license. Converts to MIT after four years.

A Circuit Forge LLC product. Privacy · Safety · Accessibility — co-equal, non-negotiable.