Retrieval:
- Add _fetch_adjacent() to retriever: fetches page ± 1 chunks from DB
after ranking so mid-sentence EPUB chunk boundaries don't lose context
- Fix vec DB doc-filter: oversample to top_k*20 before Python filter
instead of post-filtering an already-small global pool (fixes wrong-book
results when searching within a single document)
- top_k default 5 → 10; context per chunk 500 → 1500 chars; citation
snippet 200 → 400 chars
Artifact cleaning:
- Add scripts/text_clean.py: strips ABC Amber LIT Converter watermarks,
processtext.com URLs, bare page numbers, piracy stamps from extracted text
- Wire clean_paragraph() into ingest_pdf.py and new ingest_epub.py
Startup validation:
- _check_vec_schema() at boot: detects embedding dimension mismatch,
deletes stale vec DB, and queues sequential re-embed in background thread
- Sequential _reembed_docs() prevents SQLite lock races on startup re-embed
cf-orch integration:
- Wire CF_ORCH_URL / CF_LICENSE_KEY into LLMRouter backend config so
allocate() fires and keeps the Ollama model warm between requests
Ingestion progress UI:
- GET /api/library/{doc_id}/status now returns vec_count from page_vecs_meta
- DocumentCard.vue polls status every 3 s while processing and shows
two-phase progress: indeterminate animation during extraction,
determinate "Embedding N/M pages" bar once vectors start landing
Other:
- Chat feedback endpoint + thumbs up/down UI (FeedbackButton.vue)
- EPUB ingest script (ingest_epub.py) with heading-based chunking
- migration 002: chat_feedback table
- README.md with setup and feature overview
197 lines
6.9 KiB
Markdown
197 lines
6.9 KiB
Markdown
# Pagepiper
|
|
|
|
**v0.1.0** | Self-hosted PDF and EPUB search for your personal library
|
|
|
|
Pagepiper lets you drop PDFs and EPUBs into a library, index them, and search across the full text. With [Ollama](https://ollama.com) configured, you also get hybrid vector search and an LLM (large language model) chat interface that cites specific page numbers when it answers.
|
|
|
|
Built for TTRPG (tabletop roleplaying game) players tired of ctrl-F'ing through Pathfinder core rulebooks. Works equally well for fan fiction EPUB collections, AO3 exports, and any personal document library.
|
|
|
|
Try it: [pagepiper.circuitforge.tech](https://pagepiper.circuitforge.tech)
|
|
|
|
---
|
|
|
|
## Features
|
|
|
|
| Feature | Free tier | Paid (BYOK) |
|
|
|---------|-----------|-------------|
|
|
| PDF and EPUB upload via browser drag-and-drop | Yes | Yes |
|
|
| Directory scan for existing files | Yes | Yes |
|
|
| BM25 full-text search (no LLM required) | Yes | Yes |
|
|
| Unlimited local ingestion | Yes | Yes |
|
|
| Hybrid BM25 + k-NN vector search | No | Yes (local Ollama) |
|
|
| LLM chat with page-level citations | No | Yes (local Ollama) |
|
|
| Thumbs up / down feedback on answers | No | Yes |
|
|
|
|
BYOK (bring your own key) means you supply your own Ollama instance. No cloud API keys, no usage billing.
|
|
|
|
**BM25** (Best Match 25) is a keyword ranking algorithm. It works without any LLM and runs entirely inside the Docker container. **k-NN** (k-nearest neighbor) vector search uses embeddings to find passages that are semantically similar to your question, even when the exact words don't match.
|
|
|
|
---
|
|
|
|
## Tech Stack
|
|
|
|
- **Backend:** FastAPI + SQLite (BM25 via custom BM25Index, vectors via sqlite-vec)
|
|
- **Frontend:** Vue 3 SPA served by nginx
|
|
- **Embedding model:** `nomic-embed-text` via Ollama (1024-dim, optional)
|
|
- **Chat LLM:** `mistral:7b` via Ollama (optional, any Ollama model works)
|
|
- **Deployment:** Docker Compose
|
|
|
|
---
|
|
|
|
## Quick Start (Self-Hosting)
|
|
|
|
### Prerequisites
|
|
|
|
- [Docker](https://docs.docker.com/get-docker/) and Docker Compose
|
|
- PDFs or EPUBs you want to search
|
|
- Optional: [Ollama](https://ollama.com) for semantic search and RAG (retrieval-augmented generation) chat
|
|
|
|
### 1. Clone the repo
|
|
|
|
```bash
|
|
git clone https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
|
|
cd pagepiper
|
|
```
|
|
|
|
### 2. Configure
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
```
|
|
|
|
Open `.env` and set your paths:
|
|
|
|
```dotenv
|
|
# Directory to scan for PDFs/EPUBs (used by the "Scan" button in the UI)
|
|
PAGEPIPER_BOOKS_DIR=/path/to/your/pdfs
|
|
|
|
# Where Pagepiper stores its SQLite index and uploaded files
|
|
PAGEPIPER_DATA_DIR=data
|
|
```
|
|
|
|
To unlock hybrid search and LLM chat, uncomment and set the Ollama block:
|
|
|
|
```dotenv
|
|
PAGEPIPER_OLLAMA_URL=http://localhost:11434
|
|
PAGEPIPER_CHAT_MODEL=mistral:7b
|
|
PAGEPIPER_EMBED_MODEL=nomic-embed-text
|
|
```
|
|
|
|
### 3. Start
|
|
|
|
```bash
|
|
./manage.sh start
|
|
```
|
|
|
|
Open [http://localhost:8521](http://localhost:8521).
|
|
|
|
### 4. Add documents
|
|
|
|
Two ways to add files:
|
|
|
|
**Upload via browser** (easiest for small collections): Click **Upload** in the Library view and select a PDF or EPUB. The file saves to `data/uploads/` and begins indexing automatically.
|
|
|
|
**Scan a directory** (best for large collections): Set `PAGEPIPER_BOOKS_DIR` in your `.env` to a folder of PDFs/EPUBs, then click **Scan** in the Library view. Pagepiper finds all files recursively and queues them for indexing.
|
|
|
|
### 5. Search and chat
|
|
|
|
Switch to the **Chat** tab and ask questions. On the free tier, BM25 keyword search returns matching passages. With Ollama configured, you get semantic search and an LLM-generated answer with page-number citations.
|
|
|
|
---
|
|
|
|
## Ollama Setup (optional)
|
|
|
|
Install Ollama from [ollama.com](https://ollama.com), then pull the models:
|
|
|
|
```bash
|
|
ollama pull mistral:7b
|
|
ollama pull nomic-embed-text
|
|
```
|
|
|
|
On a headless Linux server, make Ollama listen on all interfaces so the Docker container can reach it:
|
|
|
|
```bash
|
|
OLLAMA_HOST=0.0.0.0 ollama serve
|
|
```
|
|
|
|
On Docker Desktop (Linux or Mac), `host.docker.internal` resolves automatically. No extra network config needed.
|
|
|
|
---
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `PAGEPIPER_BOOKS_DIR` | `./books` | Host directory to scan for PDFs and EPUBs |
|
|
| `PAGEPIPER_DATA_DIR` | `./data` | SQLite index and uploaded files live here |
|
|
| `PAGEPIPER_OLLAMA_URL` | *(unset)* | Ollama base URL; leave blank for BM25-only mode |
|
|
| `PAGEPIPER_EMBED_MODEL` | `nomic-embed-text` | Ollama embedding model (1024-dim default) |
|
|
| `PAGEPIPER_EMBED_DIMS` | `1024` | Must match the embedding model's output dimensions |
|
|
| `PAGEPIPER_CHAT_MODEL` | `mistral:7b` | Ollama chat model; any Ollama model name works |
|
|
| `PAGEPIPER_CHAT_FEEDBACK` | *(unset)* | Set to `true` to enable thumbs up/down on chat answers |
|
|
|
|
---
|
|
|
|
## Management
|
|
|
|
```bash
|
|
./manage.sh start # Build and start (dev)
|
|
./manage.sh stop # Stop
|
|
./manage.sh restart # Restart
|
|
./manage.sh status # Show container status
|
|
./manage.sh logs [svc] # Tail logs (default: all services; pass 'api' or 'web' to filter)
|
|
./manage.sh open # Open the UI in your browser
|
|
./manage.sh build # Rebuild images without cache
|
|
|
|
./manage.sh cloud:start # Start the cloud managed instance (port 8533)
|
|
./manage.sh cloud:stop
|
|
./manage.sh cloud:restart
|
|
./manage.sh cloud:status
|
|
./manage.sh cloud:logs [svc]
|
|
./manage.sh cloud:build
|
|
```
|
|
|
|
---
|
|
|
|
## Cloud Managed Instance
|
|
|
|
The cloud deployment runs at [pagepiper.circuitforge.tech](https://pagepiper.circuitforge.tech) and at `menagerie.circuitforge.tech/pagepiper`. It uses `compose.cloud.yml` with LLM inference routed through the cf-orch coordinator.
|
|
|
|
To run your own cloud-style deployment:
|
|
|
|
```bash
|
|
cp .env.cloud.example .env
|
|
# Edit .env: set PAGEPIPER_OLLAMA_URL and data paths
|
|
./manage.sh cloud:start
|
|
```
|
|
|
|
Cloud instance listens on port 8533. The API is internal-only; nginx proxies `/api/` to the backend.
|
|
|
|
---
|
|
|
|
## Data and Backups
|
|
|
|
The `data/` directory contains the SQLite index database and all uploaded files. Back it up to preserve your index. Pagepiper indexes documents at ingest time. If you modify or replace a source file, use the re-index button on the document card to rebuild its entry.
|
|
|
|
Large PDFs (hundreds of pages) can take a few minutes to index. The status badge on the document card updates as indexing progresses.
|
|
|
|
---
|
|
|
|
## Licensing
|
|
|
|
Pagepiper uses a split license:
|
|
|
|
- **MIT:** BM25 full-text search, document library management, ingest pipeline, EPUB support
|
|
- **BSL 1.1:** Hybrid vector search (embedding + k-NN), RAG chat, LLM integration
|
|
|
|
BSL 1.1 is free for personal non-commercial self-hosting. SaaS re-hosting or commercial redistribution requires a license from CircuitForge. BSL 1.1 converts to MIT after four years.
|
|
|
|
License keys: [circuitforge.tech](https://circuitforge.tech)
|
|
|
|
---
|
|
|
|
## Contributing
|
|
|
|
Issues and PRs welcome at [git.opensourcesolarpunk.com/Circuit-Forge/pagepiper](https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper).
|
|
|
|
The ingest pipeline and BM25 index are MIT-licensed. If you build a better PDF parser or add support for additional formats (CBZ, MOBI, etc.), the community benefits directly.
|