Self-hosted document library manager with BM25 keyword search and RAG chat with page-level citations
Find a file
2026-05-06 08:33:37 -07:00
app feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
config chore: initial pagepiper repo scaffold 2026-05-04 16:54:08 -07:00
docker/web fix: use http_host for proxy Host header to preserve port in redirects 2026-05-05 12:04:56 -07:00
docs docs: add MkDocs site (getting-started, user-guide, reference) 2026-05-06 08:33:37 -07:00
migrations feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
scripts feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
tests feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
web feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
.env.cloud.example feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
.env.example feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
.gitignore fix(scaffold): split api:8522/web:8521, fix nginx proxy to host.docker.internal 2026-05-04 17:02:41 -07:00
compose.cloud.yml feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
compose.override.yml.example chore: initial pagepiper repo scaffold 2026-05-04 16:54:08 -07:00
compose.yml fix: switch dev compose to bridge network, configurable API_PORT and WEB_PORT 2026-05-05 11:46:45 -07:00
Dockerfile fix: switch dev compose to bridge network, configurable API_PORT and WEB_PORT 2026-05-05 11:46:45 -07:00
environment.yml feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
manage.sh feat(deploy): add cloud deploy config for pagepiper.circuitforge.tech 2026-05-05 07:12:48 -07:00
mkdocs.yml docs: add MkDocs site (getting-started, user-guide, reference) 2026-05-06 08:33:37 -07:00
pyproject.toml chore: initial pagepiper repo scaffold 2026-05-04 16:54:08 -07:00
README.md feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00

Pagepiper

v0.1.0 | Self-hosted PDF and EPUB search for your personal library

Pagepiper lets you drop PDFs and EPUBs into a library, index them, and search across the full text. With Ollama configured, you also get hybrid vector search and an LLM (large language model) chat interface that cites specific page numbers when it answers.

Built for TTRPG (tabletop roleplaying game) players tired of ctrl-F'ing through Pathfinder core rulebooks. Works equally well for fan fiction EPUB collections, AO3 exports, and any personal document library.

Try it: pagepiper.circuitforge.tech


Features

Feature Free tier Paid (BYOK)
PDF and EPUB upload via browser drag-and-drop Yes Yes
Directory scan for existing files Yes Yes
BM25 full-text search (no LLM required) Yes Yes
Unlimited local ingestion Yes Yes
Hybrid BM25 + k-NN vector search No Yes (local Ollama)
LLM chat with page-level citations No Yes (local Ollama)
Thumbs up / down feedback on answers No Yes

BYOK (bring your own key) means you supply your own Ollama instance. No cloud API keys, no usage billing.

BM25 (Best Match 25) is a keyword ranking algorithm. It works without any LLM and runs entirely inside the Docker container. k-NN (k-nearest neighbor) vector search uses embeddings to find passages that are semantically similar to your question, even when the exact words don't match.


Tech Stack

  • Backend: FastAPI + SQLite (BM25 via custom BM25Index, vectors via sqlite-vec)
  • Frontend: Vue 3 SPA served by nginx
  • Embedding model: nomic-embed-text via Ollama (1024-dim, optional)
  • Chat LLM: mistral:7b via Ollama (optional, any Ollama model works)
  • Deployment: Docker Compose

Quick Start (Self-Hosting)

Prerequisites

  • Docker and Docker Compose
  • PDFs or EPUBs you want to search
  • Optional: Ollama for semantic search and RAG (retrieval-augmented generation) chat

1. Clone the repo

git clone https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
cd pagepiper

2. Configure

cp .env.example .env

Open .env and set your paths:

# Directory to scan for PDFs/EPUBs (used by the "Scan" button in the UI)
PAGEPIPER_BOOKS_DIR=/path/to/your/pdfs

# Where Pagepiper stores its SQLite index and uploaded files
PAGEPIPER_DATA_DIR=data

To unlock hybrid search and LLM chat, uncomment and set the Ollama block:

PAGEPIPER_OLLAMA_URL=http://localhost:11434
PAGEPIPER_CHAT_MODEL=mistral:7b
PAGEPIPER_EMBED_MODEL=nomic-embed-text

3. Start

./manage.sh start

Open http://localhost:8521.

4. Add documents

Two ways to add files:

Upload via browser (easiest for small collections): Click Upload in the Library view and select a PDF or EPUB. The file saves to data/uploads/ and begins indexing automatically.

Scan a directory (best for large collections): Set PAGEPIPER_BOOKS_DIR in your .env to a folder of PDFs/EPUBs, then click Scan in the Library view. Pagepiper finds all files recursively and queues them for indexing.

5. Search and chat

Switch to the Chat tab and ask questions. On the free tier, BM25 keyword search returns matching passages. With Ollama configured, you get semantic search and an LLM-generated answer with page-number citations.


Ollama Setup (optional)

Install Ollama from ollama.com, then pull the models:

ollama pull mistral:7b
ollama pull nomic-embed-text

On a headless Linux server, make Ollama listen on all interfaces so the Docker container can reach it:

OLLAMA_HOST=0.0.0.0 ollama serve

On Docker Desktop (Linux or Mac), host.docker.internal resolves automatically. No extra network config needed.


Environment Variables

Variable Default Description
PAGEPIPER_BOOKS_DIR ./books Host directory to scan for PDFs and EPUBs
PAGEPIPER_DATA_DIR ./data SQLite index and uploaded files live here
PAGEPIPER_OLLAMA_URL (unset) Ollama base URL; leave blank for BM25-only mode
PAGEPIPER_EMBED_MODEL nomic-embed-text Ollama embedding model (1024-dim default)
PAGEPIPER_EMBED_DIMS 1024 Must match the embedding model's output dimensions
PAGEPIPER_CHAT_MODEL mistral:7b Ollama chat model; any Ollama model name works
PAGEPIPER_CHAT_FEEDBACK (unset) Set to true to enable thumbs up/down on chat answers

Management

./manage.sh start          # Build and start (dev)
./manage.sh stop           # Stop
./manage.sh restart        # Restart
./manage.sh status         # Show container status
./manage.sh logs [svc]     # Tail logs (default: all services; pass 'api' or 'web' to filter)
./manage.sh open           # Open the UI in your browser
./manage.sh build          # Rebuild images without cache

./manage.sh cloud:start    # Start the cloud managed instance (port 8533)
./manage.sh cloud:stop
./manage.sh cloud:restart
./manage.sh cloud:status
./manage.sh cloud:logs [svc]
./manage.sh cloud:build

Cloud Managed Instance

The cloud deployment runs at pagepiper.circuitforge.tech and at menagerie.circuitforge.tech/pagepiper. It uses compose.cloud.yml with LLM inference routed through the cf-orch coordinator.

To run your own cloud-style deployment:

cp .env.cloud.example .env
# Edit .env: set PAGEPIPER_OLLAMA_URL and data paths
./manage.sh cloud:start

Cloud instance listens on port 8533. The API is internal-only; nginx proxies /api/ to the backend.


Data and Backups

The data/ directory contains the SQLite index database and all uploaded files. Back it up to preserve your index. Pagepiper indexes documents at ingest time. If you modify or replace a source file, use the re-index button on the document card to rebuild its entry.

Large PDFs (hundreds of pages) can take a few minutes to index. The status badge on the document card updates as indexing progresses.


Licensing

Pagepiper uses a split license:

  • MIT: BM25 full-text search, document library management, ingest pipeline, EPUB support
  • BSL 1.1: Hybrid vector search (embedding + k-NN), RAG chat, LLM integration

BSL 1.1 is free for personal non-commercial self-hosting. SaaS re-hosting or commercial redistribution requires a license from CircuitForge. BSL 1.1 converts to MIT after four years.

License keys: circuitforge.tech


Contributing

Issues and PRs welcome at git.opensourcesolarpunk.com/Circuit-Forge/pagepiper.

The ingest pipeline and BM25 index are MIT-licensed. If you build a better PDF parser or add support for additional formats (CBZ, MOBI, etc.), the community benefits directly.