docs(readme): landing page rewrite — screenshots, quick start, formats table, tiers, Forgejo-primary, split license

This commit is contained in:
pyr0ball 2026-05-06 08:51:38 -07:00
parent b105a0fc14
commit 895d0b6129

191
README.md
View file

@ -1,75 +1,66 @@
# Pagepiper
**v0.1.0** | Self-hosted PDF and EPUB search for your personal library
**Search your document library. Get answers with exact page citations.**
Pagepiper lets you drop PDFs and EPUBs into a library, index them, and search across the full text. With [Ollama](https://ollama.com) configured, you also get hybrid vector search and an LLM (large language model) chat interface that cites specific page numbers when it answers.
[![License: MIT / BSL 1.1](https://img.shields.io/badge/license-MIT%20%2F%20BSL%201.1-blue)](LICENSE)
[![Version](https://img.shields.io/badge/version-v0.1.0-orange)](https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper/releases)
Built for TTRPG (tabletop roleplaying game) players tired of ctrl-F'ing through Pathfinder core rulebooks. Works equally well for fan fiction EPUB collections, AO3 exports, and any personal document library.
Self-hosted PDF and EPUB search with BM25 (Best Match 25) full-text indexing and LLM (large language model) synthesis. Drop your documents in, ask a question, get an answer that tells you exactly which page to turn to.
Try it: [pagepiper.circuitforge.tech](https://pagepiper.circuitforge.tech)
Built for TTRPG (tabletop roleplaying game) players who are tired of ctrl-F'ing through six-hundred-page rulebooks. Works equally well for legal research, technical manuals, academic papers, or any personal document library you want to query in plain language.
No cloud required. Your files stay on your machine.
---
## Features
## Screenshots
| Feature | Free tier | Paid (BYOK) |
|---------|-----------|-------------|
| PDF and EPUB upload via browser drag-and-drop | Yes | Yes |
| Directory scan for existing files | Yes | Yes |
| BM25 full-text search (no LLM required) | Yes | Yes |
| Unlimited local ingestion | Yes | Yes |
| Hybrid BM25 + k-NN vector search | No | Yes (local Ollama) |
| LLM chat with page-level citations | No | Yes (local Ollama) |
| Thumbs up / down feedback on answers | No | Yes |
### Library
BYOK (bring your own key) means you supply your own Ollama instance. No cloud API keys, no usage billing.
![Library view — documents listed with ingest status and page counts](docs/screenshots/01-library.png)
**BM25** (Best Match 25) is a keyword ranking algorithm. It works without any LLM and runs entirely inside the Docker container. **k-NN** (k-nearest neighbor) vector search uses embeddings to find passages that are semantically similar to your question, even when the exact words don't match.
### Chat with citations
![Chat view — answer with source document and page number for every claim](docs/screenshots/02-chat.png)
---
## Tech Stack
## Why Pagepiper?
- **Backend:** FastAPI + SQLite (BM25 via custom BM25Index, vectors via sqlite-vec)
- **Frontend:** Vue 3 SPA served by nginx
- **Embedding model:** `nomic-embed-text` via Ollama (1024-dim, optional)
- **Chat LLM:** `mistral:7b` via Ollama (optional, any Ollama model works)
- **Deployment:** Docker Compose
- **Your library, not ours.** Documents are indexed and stored locally. Nothing is sent to a third-party service unless you explicitly configure a cloud LLM.
- **Works without an LLM.** BM25 full-text search runs entirely inside the Docker container. No Ollama, no API key, no GPU required for keyword search.
- **Answers cite their sources.** Every LLM response includes the document name and page number it drew from. You can verify or dispute every answer.
- **Hybrid search when you want it.** Connect a local Ollama instance to unlock semantic (vector) search that finds relevant passages even when your question doesn't use the exact words in the text.
- **Open ingest pipeline.** The indexing and search layer is MIT-licensed. Add support for new formats, improve the PDF parser, contribute — the community benefits directly.
---
## Quick Start (Self-Hosting)
## Quick Start
### Prerequisites
- [Docker](https://docs.docker.com/get-docker/) and Docker Compose
- PDFs or EPUBs you want to search
- Optional: [Ollama](https://ollama.com) for semantic search and RAG (retrieval-augmented generation) chat
### 1. Clone the repo
**Prerequisites:** [Docker](https://docs.docker.com/get-docker/) and Docker Compose. Optionally [Ollama](https://ollama.com) for LLM-synthesized answers.
```bash
git clone https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
cd pagepiper
```
### 2. Configure
```bash
cp .env.example .env
./manage.sh start
```
Open [http://localhost:8521](http://localhost:8521).
### Configure
Open `.env` and set your paths:
```dotenv
# Directory to scan for PDFs/EPUBs (used by the "Scan" button in the UI)
PAGEPIPER_BOOKS_DIR=/path/to/your/pdfs
# Where Pagepiper stores its SQLite index and uploaded files
PAGEPIPER_DATA_DIR=data
PAGEPIPER_DATA_DIR=./data
# Directory to scan for existing PDFs/EPUBs (used by the Scan button)
PAGEPIPER_BOOKS_DIR=/path/to/your/documents
```
To unlock hybrid search and LLM chat, uncomment and set the Ollama block:
To unlock LLM synthesis and semantic search, add your Ollama endpoint:
```dotenv
PAGEPIPER_OLLAMA_URL=http://localhost:11434
@ -77,121 +68,81 @@ PAGEPIPER_CHAT_MODEL=mistral:7b
PAGEPIPER_EMBED_MODEL=nomic-embed-text
```
### 3. Start
### Add documents
```bash
./manage.sh start
```
**Upload via browser** — click **Upload** in the Library view. Files save to `data/uploads/` and index automatically.
Open [http://localhost:8521](http://localhost:8521).
### 4. Add documents
Two ways to add files:
**Upload via browser** (easiest for small collections): Click **Upload** in the Library view and select a PDF or EPUB. The file saves to `data/uploads/` and begins indexing automatically.
**Scan a directory** (best for large collections): Set `PAGEPIPER_BOOKS_DIR` in your `.env` to a folder of PDFs/EPUBs, then click **Scan** in the Library view. Pagepiper finds all files recursively and queues them for indexing.
### 5. Search and chat
Switch to the **Chat** tab and ask questions. On the free tier, BM25 keyword search returns matching passages. With Ollama configured, you get semantic search and an LLM-generated answer with page-number citations.
**Scan a directory** — set `PAGEPIPER_BOOKS_DIR` in `.env`, then click **Scan**. Pagepiper finds all files recursively and queues them.
---
## Ollama Setup (optional)
## Supported Formats
Install Ollama from [ollama.com](https://ollama.com), then pull the models:
```bash
ollama pull mistral:7b
ollama pull nomic-embed-text
```
On a headless Linux server, make Ollama listen on all interfaces so the Docker container can reach it:
```bash
OLLAMA_HOST=0.0.0.0 ollama serve
```
On Docker Desktop (Linux or Mac), `host.docker.internal` resolves automatically. No extra network config needed.
| Format | Ingest | Page-level citations |
|--------|--------|----------------------|
| PDF | Yes | Yes |
| EPUB | Yes | Yes (chapter/location) |
---
## Environment Variables
## Stack
| Variable | Default | Description |
|----------|---------|-------------|
| `PAGEPIPER_BOOKS_DIR` | `./books` | Host directory to scan for PDFs and EPUBs |
| `PAGEPIPER_DATA_DIR` | `./data` | SQLite index and uploaded files live here |
| `PAGEPIPER_OLLAMA_URL` | *(unset)* | Ollama base URL; leave blank for BM25-only mode |
| `PAGEPIPER_EMBED_MODEL` | `nomic-embed-text` | Ollama embedding model (1024-dim default) |
| `PAGEPIPER_EMBED_DIMS` | `1024` | Must match the embedding model's output dimensions |
| `PAGEPIPER_CHAT_MODEL` | `mistral:7b` | Ollama chat model; any Ollama model name works |
| `PAGEPIPER_CHAT_FEEDBACK` | *(unset)* | Set to `true` to enable thumbs up/down on chat answers |
| Layer | Technology |
|-------|-----------|
| Backend API | FastAPI + SQLite |
| Full-text search | BM25 (custom index, no external service) |
| Vector search | sqlite-vec + Ollama embeddings (optional) |
| LLM synthesis | Ollama (local, any model) |
| Frontend | Vue 3 SPA served by nginx |
| Deployment | Docker Compose |
Default ports: Web UI `8521`, API `8540`.
---
## Management
```bash
./manage.sh start # Build and start (dev)
./manage.sh start # Build and start
./manage.sh stop # Stop
./manage.sh restart # Restart
./manage.sh status # Show container status
./manage.sh logs [svc] # Tail logs (default: all services; pass 'api' or 'web' to filter)
./manage.sh open # Open the UI in your browser
./manage.sh build # Rebuild images without cache
./manage.sh cloud:start # Start the cloud managed instance (port 8533)
./manage.sh cloud:stop
./manage.sh cloud:restart
./manage.sh cloud:status
./manage.sh cloud:logs [svc]
./manage.sh cloud:build
./manage.sh logs [svc] # Tail logs (pass 'api' or 'web' to filter)
./manage.sh open # Open UI in browser
./manage.sh build # Rebuild images
./manage.sh test # Run test suite
```
---
## Cloud Managed Instance
## Tiers
The cloud deployment runs at [pagepiper.circuitforge.tech](https://pagepiper.circuitforge.tech) and at `menagerie.circuitforge.tech/pagepiper`. It uses `compose.cloud.yml` with LLM inference routed through the cf-orch coordinator.
| Feature | Free | Paid (BYOK) |
|---------|------|-------------|
| PDF and EPUB upload | Yes | Yes |
| Directory scan | Yes | Yes |
| BM25 full-text search | Yes | Yes |
| Unlimited local ingestion | Yes | Yes |
| Hybrid BM25 + vector search | — | Yes (local Ollama) |
| LLM synthesis with page citations | — | Yes (local Ollama) |
To run your own cloud-style deployment:
```bash
cp .env.cloud.example .env
# Edit .env: set PAGEPIPER_OLLAMA_URL and data paths
./manage.sh cloud:start
```
Cloud instance listens on port 8533. The API is internal-only; nginx proxies `/api/` to the backend.
BYOK means you supply your own Ollama instance. No cloud API keys, no usage metering.
---
## Data and Backups
## Forgejo-primary
The `data/` directory contains the SQLite index database and all uploaded files. Back it up to preserve your index. Pagepiper indexes documents at ingest time. If you modify or replace a source file, use the re-index button on the document card to rebuild its entry.
Large PDFs (hundreds of pages) can take a few minutes to index. The status badge on the document card updates as indexing progresses.
Pagepiper is developed and hosted at [git.opensourcesolarpunk.com/Circuit-Forge/pagepiper](https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper). GitHub mirrors exist for discoverability only. File issues and submit pull requests on Forgejo.
---
## Licensing
## License
Pagepiper uses a split license:
- **MIT:** BM25 full-text search, document library management, ingest pipeline, EPUB support
- **BSL 1.1:** Hybrid vector search (embedding + k-NN), RAG chat, LLM integration
BSL 1.1 is free for personal non-commercial self-hosting. SaaS re-hosting or commercial redistribution requires a license from CircuitForge. BSL 1.1 converts to MIT after four years.
License keys: [circuitforge.tech](https://circuitforge.tech)
- **MIT:** Document ingest pipeline, BM25 full-text index, library management, EPUB support — the core discovery and retrieval layer.
- **BSL 1.1 (Business Source License):** Hybrid vector search, LLM synthesis, RAG (retrieval-augmented generation) chat interface — free for personal non-commercial self-hosting; commercial use or SaaS re-hosting requires a license. Converts to MIT after four years.
---
## Contributing
Issues and PRs welcome at [git.opensourcesolarpunk.com/Circuit-Forge/pagepiper](https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper).
The ingest pipeline and BM25 index are MIT-licensed. If you build a better PDF parser or add support for additional formats (CBZ, MOBI, etc.), the community benefits directly.
*A [Circuit Forge LLC](https://circuitforge.tech) product. Privacy · Safety · Accessibility — co-equal, non-negotiable.*