pagepiper/README.md

# Pagepiper

**v0.1.0** | Self-hosted PDF and EPUB search for your personal library

Pagepiper lets you drop PDFs and EPUBs into a library, index them, and search across the full text. With [Ollama](https://ollama.com) configured, you also get hybrid vector search and an LLM (large language model) chat interface that cites specific page numbers when it answers.

Built for TTRPG (tabletop roleplaying game) players tired of ctrl-F'ing through Pathfinder core rulebooks. Works equally well for fan fiction EPUB collections, AO3 exports, and any personal document library.

Try it: [pagepiper.circuitforge.tech](https://pagepiper.circuitforge.tech)

---

## Features

| Feature | Free tier | Paid (BYOK) |
|---------|-----------|-------------|
| PDF and EPUB upload via browser drag-and-drop | Yes | Yes |
| Directory scan for existing files | Yes | Yes |
| BM25 full-text search (no LLM required) | Yes | Yes |
| Unlimited local ingestion | Yes | Yes |
| Hybrid BM25 + k-NN vector search | No | Yes (local Ollama) |
| LLM chat with page-level citations | No | Yes (local Ollama) |
| Thumbs up / down feedback on answers | No | Yes |

BYOK (bring your own key) means you supply your own Ollama instance. No cloud API keys, no usage billing.

**BM25** (Best Match 25) is a keyword ranking algorithm. It works without any LLM and runs entirely inside the Docker container. **k-NN** (k-nearest neighbor) vector search uses embeddings to find passages that are semantically similar to your question, even when the exact words don't match.

---

## Tech Stack

- **Backend:** FastAPI + SQLite (BM25 via custom BM25Index, vectors via sqlite-vec)
- **Frontend:** Vue 3 SPA served by nginx
- **Embedding model:** `nomic-embed-text` via Ollama (1024-dim, optional)
- **Chat LLM:** `mistral:7b` via Ollama (optional, any Ollama model works)
- **Deployment:** Docker Compose

---

## Quick Start (Self-Hosting)

### Prerequisites

- [Docker](https://docs.docker.com/get-docker/) and Docker Compose
- PDFs or EPUBs you want to search
- Optional: [Ollama](https://ollama.com) for semantic search and RAG (retrieval-augmented generation) chat

### 1. Clone the repo

```bash
git clone https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
cd pagepiper
```

### 2. Configure

```bash
cp .env.example .env
```

Open `.env` and set your paths:

```dotenv
# Directory to scan for PDFs/EPUBs (used by the "Scan" button in the UI)
PAGEPIPER_BOOKS_DIR=/path/to/your/pdfs

# Where Pagepiper stores its SQLite index and uploaded files
PAGEPIPER_DATA_DIR=data
```

To unlock hybrid search and LLM chat, uncomment and set the Ollama block:

```dotenv
PAGEPIPER_OLLAMA_URL=http://localhost:11434
PAGEPIPER_CHAT_MODEL=mistral:7b
PAGEPIPER_EMBED_MODEL=nomic-embed-text
```

### 3. Start

```bash
./manage.sh start
```

Open [http://localhost:8521](http://localhost:8521).

### 4. Add documents

Two ways to add files:

**Upload via browser** (easiest for small collections): Click **Upload** in the Library view and select a PDF or EPUB. The file saves to `data/uploads/` and begins indexing automatically.

**Scan a directory** (best for large collections): Set `PAGEPIPER_BOOKS_DIR` in your `.env` to a folder of PDFs/EPUBs, then click **Scan** in the Library view. Pagepiper finds all files recursively and queues them for indexing.

### 5. Search and chat

Switch to the **Chat** tab and ask questions. On the free tier, BM25 keyword search returns matching passages. With Ollama configured, you get semantic search and an LLM-generated answer with page-number citations.

---

## Ollama Setup (optional)

Install Ollama from [ollama.com](https://ollama.com), then pull the models:

```bash
ollama pull mistral:7b
ollama pull nomic-embed-text
```

On a headless Linux server, make Ollama listen on all interfaces so the Docker container can reach it:

```bash
OLLAMA_HOST=0.0.0.0 ollama serve
```

On Docker Desktop (Linux or Mac), `host.docker.internal` resolves automatically. No extra network config needed.

---

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `PAGEPIPER_BOOKS_DIR` | `./books` | Host directory to scan for PDFs and EPUBs |
| `PAGEPIPER_DATA_DIR` | `./data` | SQLite index and uploaded files live here |
| `PAGEPIPER_OLLAMA_URL` | *(unset)* | Ollama base URL; leave blank for BM25-only mode |
| `PAGEPIPER_EMBED_MODEL` | `nomic-embed-text` | Ollama embedding model (1024-dim default) |
| `PAGEPIPER_EMBED_DIMS` | `1024` | Must match the embedding model's output dimensions |
| `PAGEPIPER_CHAT_MODEL` | `mistral:7b` | Ollama chat model; any Ollama model name works |
| `PAGEPIPER_CHAT_FEEDBACK` | *(unset)* | Set to `true` to enable thumbs up/down on chat answers |

---

## Management

```bash
./manage.sh start          # Build and start (dev)
./manage.sh stop           # Stop
./manage.sh restart        # Restart
./manage.sh status         # Show container status
./manage.sh logs [svc]     # Tail logs (default: all services; pass 'api' or 'web' to filter)
./manage.sh open           # Open the UI in your browser
./manage.sh build          # Rebuild images without cache

./manage.sh cloud:start    # Start the cloud managed instance (port 8533)
./manage.sh cloud:stop
./manage.sh cloud:restart
./manage.sh cloud:status
./manage.sh cloud:logs [svc]
./manage.sh cloud:build
```

---

## Cloud Managed Instance

The cloud deployment runs at [pagepiper.circuitforge.tech](https://pagepiper.circuitforge.tech) and at `menagerie.circuitforge.tech/pagepiper`. It uses `compose.cloud.yml` with LLM inference routed through the cf-orch coordinator.

To run your own cloud-style deployment:

```bash
cp .env.cloud.example .env
# Edit .env: set PAGEPIPER_OLLAMA_URL and data paths
./manage.sh cloud:start
```

Cloud instance listens on port 8533. The API is internal-only; nginx proxies `/api/` to the backend.

---

## Data and Backups

The `data/` directory contains the SQLite index database and all uploaded files. Back it up to preserve your index. Pagepiper indexes documents at ingest time. If you modify or replace a source file, use the re-index button on the document card to rebuild its entry.

Large PDFs (hundreds of pages) can take a few minutes to index. The status badge on the document card updates as indexing progresses.

---

## Licensing

Pagepiper uses a split license:

- **MIT:** BM25 full-text search, document library management, ingest pipeline, EPUB support
- **BSL 1.1:** Hybrid vector search (embedding + k-NN), RAG chat, LLM integration

BSL 1.1 is free for personal non-commercial self-hosting. SaaS re-hosting or commercial redistribution requires a license from CircuitForge. BSL 1.1 converts to MIT after four years.

License keys: [circuitforge.tech](https://circuitforge.tech)

---

## Contributing

Issues and PRs welcome at [git.opensourcesolarpunk.com/Circuit-Forge/pagepiper](https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper).

The ingest pipeline and BM25 index are MIT-licensed. If you build a better PDF parser or add support for additional formats (CBZ, MOBI, etc.), the community benefits directly.