pagepiper/README.md

149 lines
5.4 KiB
Markdown

# Pagepiper
**Search your document library. Get answers with exact page citations.**
[![Status](https://img.shields.io/badge/status-beta-blue)](https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper)
[![License: MIT / BSL 1.1](https://img.shields.io/badge/license-MIT%20%2F%20BSL%201.1-blue)](LICENSE)
[![Version](https://img.shields.io/badge/version-v0.1.0-orange)](https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper/releases)
Self-hosted PDF and EPUB search with BM25 (Best Match 25) full-text indexing and LLM (large language model) synthesis. Drop your documents in, ask a question, get an answer that tells you exactly which page to turn to.
Built for TTRPG (tabletop roleplaying game) players who are tired of ctrl-F'ing through six-hundred-page rulebooks. Works equally well for legal research, technical manuals, academic papers, or any personal document library you want to query in plain language.
No cloud required. Your files stay on your machine.
---
## Screenshots
### Library
![Library view — documents listed with ingest status and page counts](docs/screenshots/01-library.png)
### Chat with citations
![Chat view — answer with source document and page number for every claim](docs/screenshots/02-chat.png)
---
## Why Pagepiper?
- **Your library, not ours.** Documents are indexed and stored locally. Nothing is sent to a third-party service unless you explicitly configure a cloud LLM.
- **Works without an LLM.** BM25 full-text search runs entirely inside the Docker container. No Ollama, no API key, no GPU required for keyword search.
- **Answers cite their sources.** Every LLM response includes the document name and page number it drew from. You can verify or dispute every answer.
- **Hybrid search when you want it.** Connect a local Ollama instance to unlock semantic (vector) search that finds relevant passages even when your question doesn't use the exact words in the text.
- **Open ingest pipeline.** The indexing and search layer is MIT-licensed. Add support for new formats, improve the PDF parser, contribute — the community benefits directly.
---
## Quick Start
**Prerequisites:** [Docker](https://docs.docker.com/get-docker/) and Docker Compose. Optionally [Ollama](https://ollama.com) for LLM-synthesized answers.
```bash
git clone https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
cd pagepiper
cp .env.example .env
./manage.sh start
```
Open [http://localhost:8521](http://localhost:8521).
### Configure
Open `.env` and set your paths:
```dotenv
# Where Pagepiper stores its SQLite index and uploaded files
PAGEPIPER_DATA_DIR=./data
# Directory to scan for existing PDFs/EPUBs (used by the Scan button)
PAGEPIPER_BOOKS_DIR=/path/to/your/documents
```
To unlock LLM synthesis and semantic search, add your Ollama endpoint:
```dotenv
PAGEPIPER_OLLAMA_URL=http://localhost:11434
PAGEPIPER_CHAT_MODEL=mistral:7b
PAGEPIPER_EMBED_MODEL=nomic-embed-text
```
### Add documents
**Upload via browser** — click **Upload** in the Library view. Files save to `data/uploads/` and index automatically.
**Scan a directory** — set `PAGEPIPER_BOOKS_DIR` in `.env`, then click **Scan**. Pagepiper finds all files recursively and queues them.
---
## Supported Formats
| Format | Ingest | Page-level citations |
|--------|--------|----------------------|
| PDF | Yes | Yes |
| EPUB | Yes | Yes (chapter/location) |
---
## Stack
| Layer | Technology |
|-------|-----------|
| Backend API | FastAPI + SQLite |
| Full-text search | BM25 (custom index, no external service) |
| Vector search | sqlite-vec + Ollama embeddings (optional) |
| LLM synthesis | Ollama (local, any model) |
| Frontend | Vue 3 SPA served by nginx |
| Deployment | Docker Compose |
Default ports: Web UI `8521`, API `8540`.
---
## Management
```bash
./manage.sh start # Build and start
./manage.sh stop # Stop
./manage.sh restart # Restart
./manage.sh status # Show container status
./manage.sh logs [svc] # Tail logs (pass 'api' or 'web' to filter)
./manage.sh open # Open UI in browser
./manage.sh build # Rebuild images
./manage.sh test # Run test suite
```
---
## Tiers
| Feature | Free | Paid (BYOK) |
|---------|------|-------------|
| PDF and EPUB upload | Yes | Yes |
| Directory scan | Yes | Yes |
| BM25 full-text search | Yes | Yes |
| Unlimited local ingestion | Yes | Yes |
| Hybrid BM25 + vector search | — | Yes (local Ollama) |
| LLM synthesis with page citations | — | Yes (local Ollama) |
BYOK means you supply your own Ollama instance. No cloud API keys, no usage metering.
---
## Forgejo-primary
Pagepiper is developed and hosted at [git.opensourcesolarpunk.com/Circuit-Forge/pagepiper](https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper). GitHub mirrors exist for discoverability only. File issues and submit pull requests on Forgejo.
---
## License
Pagepiper uses a split license:
- **MIT:** Document ingest pipeline, BM25 full-text index, library management, EPUB support — the core discovery and retrieval layer.
- **BSL 1.1 (Business Source License):** Hybrid vector search, LLM synthesis, RAG (retrieval-augmented generation) chat interface — free for personal non-commercial self-hosting; commercial use or SaaS re-hosting requires a license. Converts to MIT after four years.
---
*A [Circuit Forge LLC](https://circuitforge.tech) product. Privacy · Safety · Accessibility — co-equal, non-negotiable.*