Compare commits
No commits in common. "main" and "v0.1.0" have entirely different histories.
19 changed files with 332 additions and 838 deletions
190
README.md
190
README.md
|
|
@ -1,67 +1,75 @@
|
|||
# Pagepiper
|
||||
|
||||
**Search your document library. Get answers with exact page citations.**
|
||||
**v0.1.0** | Self-hosted PDF and EPUB search for your personal library
|
||||
|
||||
[](https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper)
|
||||
[](LICENSE)
|
||||
[](https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper/releases)
|
||||
Pagepiper lets you drop PDFs and EPUBs into a library, index them, and search across the full text. With [Ollama](https://ollama.com) configured, you also get hybrid vector search and an LLM (large language model) chat interface that cites specific page numbers when it answers.
|
||||
|
||||
Self-hosted PDF and EPUB search with BM25 (Best Match 25) full-text indexing and LLM (large language model) synthesis. Drop your documents in, ask a question, get an answer that tells you exactly which page to turn to.
|
||||
Built for TTRPG (tabletop roleplaying game) players tired of ctrl-F'ing through Pathfinder core rulebooks. Works equally well for fan fiction EPUB collections, AO3 exports, and any personal document library.
|
||||
|
||||
Built for TTRPG (tabletop roleplaying game) players who are tired of ctrl-F'ing through six-hundred-page rulebooks. Works equally well for legal research, technical manuals, academic papers, or any personal document library you want to query in plain language.
|
||||
|
||||
No cloud required. Your files stay on your machine.
|
||||
Try it: [pagepiper.circuitforge.tech](https://pagepiper.circuitforge.tech)
|
||||
|
||||
---
|
||||
|
||||
## Screenshots
|
||||
## Features
|
||||
|
||||
### Library
|
||||
| Feature | Free tier | Paid (BYOK) |
|
||||
|---------|-----------|-------------|
|
||||
| PDF and EPUB upload via browser drag-and-drop | Yes | Yes |
|
||||
| Directory scan for existing files | Yes | Yes |
|
||||
| BM25 full-text search (no LLM required) | Yes | Yes |
|
||||
| Unlimited local ingestion | Yes | Yes |
|
||||
| Hybrid BM25 + k-NN vector search | No | Yes (local Ollama) |
|
||||
| LLM chat with page-level citations | No | Yes (local Ollama) |
|
||||
| Thumbs up / down feedback on answers | No | Yes |
|
||||
|
||||

|
||||
BYOK (bring your own key) means you supply your own Ollama instance. No cloud API keys, no usage billing.
|
||||
|
||||
### Chat with citations
|
||||
|
||||

|
||||
**BM25** (Best Match 25) is a keyword ranking algorithm. It works without any LLM and runs entirely inside the Docker container. **k-NN** (k-nearest neighbor) vector search uses embeddings to find passages that are semantically similar to your question, even when the exact words don't match.
|
||||
|
||||
---
|
||||
|
||||
## Why Pagepiper?
|
||||
## Tech Stack
|
||||
|
||||
- **Your library, not ours.** Documents are indexed and stored locally. Nothing is sent to a third-party service unless you explicitly configure a cloud LLM.
|
||||
- **Works without an LLM.** BM25 full-text search runs entirely inside the Docker container. No Ollama, no API key, no GPU required for keyword search.
|
||||
- **Answers cite their sources.** Every LLM response includes the document name and page number it drew from. You can verify or dispute every answer.
|
||||
- **Hybrid search when you want it.** Connect a local Ollama instance to unlock semantic (vector) search that finds relevant passages even when your question doesn't use the exact words in the text.
|
||||
- **Open ingest pipeline.** The indexing and search layer is MIT-licensed. Add support for new formats, improve the PDF parser, contribute — the community benefits directly.
|
||||
- **Backend:** FastAPI + SQLite (BM25 via custom BM25Index, vectors via sqlite-vec)
|
||||
- **Frontend:** Vue 3 SPA served by nginx
|
||||
- **Embedding model:** `nomic-embed-text` via Ollama (1024-dim, optional)
|
||||
- **Chat LLM:** `mistral:7b` via Ollama (optional, any Ollama model works)
|
||||
- **Deployment:** Docker Compose
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
## Quick Start (Self-Hosting)
|
||||
|
||||
**Prerequisites:** [Docker](https://docs.docker.com/get-docker/) and Docker Compose. Optionally [Ollama](https://ollama.com) for LLM-synthesized answers.
|
||||
### Prerequisites
|
||||
|
||||
- [Docker](https://docs.docker.com/get-docker/) and Docker Compose
|
||||
- PDFs or EPUBs you want to search
|
||||
- Optional: [Ollama](https://ollama.com) for semantic search and RAG (retrieval-augmented generation) chat
|
||||
|
||||
### 1. Clone the repo
|
||||
|
||||
```bash
|
||||
git clone https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
|
||||
cd pagepiper
|
||||
cp .env.example .env
|
||||
./manage.sh start
|
||||
```
|
||||
|
||||
Open [http://localhost:8521](http://localhost:8521).
|
||||
### 2. Configure
|
||||
|
||||
### Configure
|
||||
```bash
|
||||
cp .env.example .env
|
||||
```
|
||||
|
||||
Open `.env` and set your paths:
|
||||
|
||||
```dotenv
|
||||
# Where Pagepiper stores its SQLite index and uploaded files
|
||||
PAGEPIPER_DATA_DIR=./data
|
||||
# Directory to scan for PDFs/EPUBs (used by the "Scan" button in the UI)
|
||||
PAGEPIPER_BOOKS_DIR=/path/to/your/pdfs
|
||||
|
||||
# Directory to scan for existing PDFs/EPUBs (used by the Scan button)
|
||||
PAGEPIPER_BOOKS_DIR=/path/to/your/documents
|
||||
# Where Pagepiper stores its SQLite index and uploaded files
|
||||
PAGEPIPER_DATA_DIR=data
|
||||
```
|
||||
|
||||
To unlock LLM synthesis and semantic search, add your Ollama endpoint:
|
||||
To unlock hybrid search and LLM chat, uncomment and set the Ollama block:
|
||||
|
||||
```dotenv
|
||||
PAGEPIPER_OLLAMA_URL=http://localhost:11434
|
||||
|
|
@ -69,81 +77,121 @@ PAGEPIPER_CHAT_MODEL=mistral:7b
|
|||
PAGEPIPER_EMBED_MODEL=nomic-embed-text
|
||||
```
|
||||
|
||||
### Add documents
|
||||
### 3. Start
|
||||
|
||||
**Upload via browser** — click **Upload** in the Library view. Files save to `data/uploads/` and index automatically.
|
||||
```bash
|
||||
./manage.sh start
|
||||
```
|
||||
|
||||
**Scan a directory** — set `PAGEPIPER_BOOKS_DIR` in `.env`, then click **Scan**. Pagepiper finds all files recursively and queues them.
|
||||
Open [http://localhost:8521](http://localhost:8521).
|
||||
|
||||
### 4. Add documents
|
||||
|
||||
Two ways to add files:
|
||||
|
||||
**Upload via browser** (easiest for small collections): Click **Upload** in the Library view and select a PDF or EPUB. The file saves to `data/uploads/` and begins indexing automatically.
|
||||
|
||||
**Scan a directory** (best for large collections): Set `PAGEPIPER_BOOKS_DIR` in your `.env` to a folder of PDFs/EPUBs, then click **Scan** in the Library view. Pagepiper finds all files recursively and queues them for indexing.
|
||||
|
||||
### 5. Search and chat
|
||||
|
||||
Switch to the **Chat** tab and ask questions. On the free tier, BM25 keyword search returns matching passages. With Ollama configured, you get semantic search and an LLM-generated answer with page-number citations.
|
||||
|
||||
---
|
||||
|
||||
## Supported Formats
|
||||
## Ollama Setup (optional)
|
||||
|
||||
| Format | Ingest | Page-level citations |
|
||||
|--------|--------|----------------------|
|
||||
| PDF | Yes | Yes |
|
||||
| EPUB | Yes | Yes (chapter/location) |
|
||||
Install Ollama from [ollama.com](https://ollama.com), then pull the models:
|
||||
|
||||
```bash
|
||||
ollama pull mistral:7b
|
||||
ollama pull nomic-embed-text
|
||||
```
|
||||
|
||||
On a headless Linux server, make Ollama listen on all interfaces so the Docker container can reach it:
|
||||
|
||||
```bash
|
||||
OLLAMA_HOST=0.0.0.0 ollama serve
|
||||
```
|
||||
|
||||
On Docker Desktop (Linux or Mac), `host.docker.internal` resolves automatically. No extra network config needed.
|
||||
|
||||
---
|
||||
|
||||
## Stack
|
||||
## Environment Variables
|
||||
|
||||
| Layer | Technology |
|
||||
|-------|-----------|
|
||||
| Backend API | FastAPI + SQLite |
|
||||
| Full-text search | BM25 (custom index, no external service) |
|
||||
| Vector search | sqlite-vec + Ollama embeddings (optional) |
|
||||
| LLM synthesis | Ollama (local, any model) |
|
||||
| Frontend | Vue 3 SPA served by nginx |
|
||||
| Deployment | Docker Compose |
|
||||
|
||||
Default ports: Web UI `8521`, API `8540`.
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `PAGEPIPER_BOOKS_DIR` | `./books` | Host directory to scan for PDFs and EPUBs |
|
||||
| `PAGEPIPER_DATA_DIR` | `./data` | SQLite index and uploaded files live here |
|
||||
| `PAGEPIPER_OLLAMA_URL` | *(unset)* | Ollama base URL; leave blank for BM25-only mode |
|
||||
| `PAGEPIPER_EMBED_MODEL` | `nomic-embed-text` | Ollama embedding model (1024-dim default) |
|
||||
| `PAGEPIPER_EMBED_DIMS` | `1024` | Must match the embedding model's output dimensions |
|
||||
| `PAGEPIPER_CHAT_MODEL` | `mistral:7b` | Ollama chat model; any Ollama model name works |
|
||||
| `PAGEPIPER_CHAT_FEEDBACK` | *(unset)* | Set to `true` to enable thumbs up/down on chat answers |
|
||||
|
||||
---
|
||||
|
||||
## Management
|
||||
|
||||
```bash
|
||||
./manage.sh start # Build and start
|
||||
./manage.sh start # Build and start (dev)
|
||||
./manage.sh stop # Stop
|
||||
./manage.sh restart # Restart
|
||||
./manage.sh status # Show container status
|
||||
./manage.sh logs [svc] # Tail logs (pass 'api' or 'web' to filter)
|
||||
./manage.sh open # Open UI in browser
|
||||
./manage.sh build # Rebuild images
|
||||
./manage.sh test # Run test suite
|
||||
./manage.sh logs [svc] # Tail logs (default: all services; pass 'api' or 'web' to filter)
|
||||
./manage.sh open # Open the UI in your browser
|
||||
./manage.sh build # Rebuild images without cache
|
||||
|
||||
./manage.sh cloud:start # Start the cloud managed instance (port 8533)
|
||||
./manage.sh cloud:stop
|
||||
./manage.sh cloud:restart
|
||||
./manage.sh cloud:status
|
||||
./manage.sh cloud:logs [svc]
|
||||
./manage.sh cloud:build
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tiers
|
||||
## Cloud Managed Instance
|
||||
|
||||
| Feature | Free | Paid (BYOK) |
|
||||
|---------|------|-------------|
|
||||
| PDF and EPUB upload | Yes | Yes |
|
||||
| Directory scan | Yes | Yes |
|
||||
| BM25 full-text search | Yes | Yes |
|
||||
| Unlimited local ingestion | Yes | Yes |
|
||||
| Hybrid BM25 + vector search | — | Yes (local Ollama) |
|
||||
| LLM synthesis with page citations | — | Yes (local Ollama) |
|
||||
The cloud deployment runs at [pagepiper.circuitforge.tech](https://pagepiper.circuitforge.tech) and at `menagerie.circuitforge.tech/pagepiper`. It uses `compose.cloud.yml` with LLM inference routed through the cf-orch coordinator.
|
||||
|
||||
BYOK means you supply your own Ollama instance. No cloud API keys, no usage metering.
|
||||
To run your own cloud-style deployment:
|
||||
|
||||
```bash
|
||||
cp .env.cloud.example .env
|
||||
# Edit .env: set PAGEPIPER_OLLAMA_URL and data paths
|
||||
./manage.sh cloud:start
|
||||
```
|
||||
|
||||
Cloud instance listens on port 8533. The API is internal-only; nginx proxies `/api/` to the backend.
|
||||
|
||||
---
|
||||
|
||||
## Forgejo-primary
|
||||
## Data and Backups
|
||||
|
||||
Pagepiper is developed and hosted at [git.opensourcesolarpunk.com/Circuit-Forge/pagepiper](https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper). GitHub mirrors exist for discoverability only. File issues and submit pull requests on Forgejo.
|
||||
The `data/` directory contains the SQLite index database and all uploaded files. Back it up to preserve your index. Pagepiper indexes documents at ingest time. If you modify or replace a source file, use the re-index button on the document card to rebuild its entry.
|
||||
|
||||
Large PDFs (hundreds of pages) can take a few minutes to index. The status badge on the document card updates as indexing progresses.
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
## Licensing
|
||||
|
||||
Pagepiper uses a split license:
|
||||
|
||||
- **MIT:** Document ingest pipeline, BM25 full-text index, library management, EPUB support — the core discovery and retrieval layer.
|
||||
- **BSL 1.1 (Business Source License):** Hybrid vector search, LLM synthesis, RAG (retrieval-augmented generation) chat interface — free for personal non-commercial self-hosting; commercial use or SaaS re-hosting requires a license. Converts to MIT after four years.
|
||||
- **MIT:** BM25 full-text search, document library management, ingest pipeline, EPUB support
|
||||
- **BSL 1.1:** Hybrid vector search (embedding + k-NN), RAG chat, LLM integration
|
||||
|
||||
BSL 1.1 is free for personal non-commercial self-hosting. SaaS re-hosting or commercial redistribution requires a license from CircuitForge. BSL 1.1 converts to MIT after four years.
|
||||
|
||||
License keys: [circuitforge.tech](https://circuitforge.tech)
|
||||
|
||||
---
|
||||
|
||||
*A [Circuit Forge LLC](https://circuitforge.tech) product. Privacy · Safety · Accessibility — co-equal, non-negotiable.*
|
||||
## Contributing
|
||||
|
||||
Issues and PRs welcome at [git.opensourcesolarpunk.com/Circuit-Forge/pagepiper](https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper).
|
||||
|
||||
The ingest pipeline and BM25 index are MIT-licensed. If you build a better PDF parser or add support for additional formats (CBZ, MOBI, etc.), the community benefits directly.
|
||||
|
|
|
|||
|
|
@ -10,11 +10,9 @@ from __future__ import annotations
|
|||
import logging
|
||||
import os
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException
|
||||
from fastapi import APIRouter, HTTPException
|
||||
from pydantic import BaseModel
|
||||
|
||||
from app.cloud_session import require_paid_tier
|
||||
from app.deps import UserCtx, get_user_ctx
|
||||
from app.services.retriever import Retriever
|
||||
from app.services.synthesizer import Synthesizer
|
||||
|
||||
|
|
@ -58,6 +56,21 @@ def _get_llm_router():
|
|||
return LLMRouter(cfg)
|
||||
|
||||
|
||||
def _get_db_path() -> str:
|
||||
"""Read lazily so test fixtures take effect."""
|
||||
import pathlib
|
||||
|
||||
data_dir = pathlib.Path(os.environ.get("PAGEPIPER_DATA_DIR", "data"))
|
||||
return str(data_dir / "pagepiper.db")
|
||||
|
||||
|
||||
def _get_vec_db_path() -> str:
|
||||
import pathlib
|
||||
|
||||
data_dir = pathlib.Path(os.environ.get("PAGEPIPER_DATA_DIR", "data"))
|
||||
return str(data_dir / "pagepiper_vecs.db")
|
||||
|
||||
|
||||
def _require_llm():
|
||||
"""Return LLMRouter or raise 402."""
|
||||
llm = _get_llm_router()
|
||||
|
|
@ -76,20 +89,18 @@ def _require_llm():
|
|||
|
||||
|
||||
@router.post("")
|
||||
def chat(
|
||||
req: ChatRequest,
|
||||
ctx: UserCtx = Depends(get_user_ctx),
|
||||
_tier: str = Depends(require_paid_tier),
|
||||
) -> ChatResponse:
|
||||
def chat(req: ChatRequest) -> ChatResponse:
|
||||
llm = _require_llm()
|
||||
|
||||
retriever = Retriever(ctx.bm25)
|
||||
from app.main import _bm25
|
||||
|
||||
retriever = Retriever(_bm25)
|
||||
chunks = retriever.hybrid_search(
|
||||
query=req.message,
|
||||
top_k=req.top_k,
|
||||
doc_ids=req.doc_ids,
|
||||
db_path=ctx.db_path,
|
||||
vec_db_path=ctx.vec_db_path,
|
||||
db_path=_get_db_path(),
|
||||
vec_db_path=_get_vec_db_path(),
|
||||
llm=llm,
|
||||
)
|
||||
|
||||
|
|
@ -130,10 +141,7 @@ def chat_feedback_status() -> dict:
|
|||
|
||||
|
||||
@router.post("/feedback")
|
||||
def submit_chat_feedback(
|
||||
req: ChatFeedbackRequest,
|
||||
ctx: UserCtx = Depends(get_user_ctx),
|
||||
) -> dict:
|
||||
def submit_chat_feedback(req: ChatFeedbackRequest) -> dict:
|
||||
import json
|
||||
import sqlite3
|
||||
|
||||
|
|
@ -141,7 +149,8 @@ def submit_chat_feedback(
|
|||
from fastapi import HTTPException
|
||||
raise HTTPException(status_code=422, detail="rating must be 1 or -1")
|
||||
|
||||
con = sqlite3.connect(ctx.db_path)
|
||||
db_path = _get_db_path()
|
||||
con = sqlite3.connect(db_path)
|
||||
try:
|
||||
con.execute(
|
||||
"INSERT INTO chat_feedback (rating, question, answer, doc_ids) VALUES (?, ?, ?, ?)",
|
||||
|
|
|
|||
|
|
@ -14,25 +14,26 @@ from typing import Callable
|
|||
|
||||
from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException, UploadFile
|
||||
|
||||
from app.config import VEC_DIMENSIONS
|
||||
from app.deps import UserCtx, get_db, get_user_ctx
|
||||
from app.config import WATCH_DIR, DB_PATH, VEC_DB_PATH, DATA_DIR
|
||||
from app.deps import get_db
|
||||
|
||||
_MAX_UPLOAD_BYTES = 200 * 1024 * 1024 # 200 MB
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
router = APIRouter(prefix="/api/library", tags=["library"])
|
||||
|
||||
# Injected by main.py after _bm25 is created
|
||||
_mark_bm25_dirty: Callable[[], None] | None = None
|
||||
|
||||
|
||||
_INGEST_TASKS = {
|
||||
".pdf": "pagepiper/ingest_pdf",
|
||||
".epub": "pagepiper/ingest_epub",
|
||||
".docx": "pagepiper/ingest_docx",
|
||||
}
|
||||
|
||||
_INGEST_RUNNERS = {
|
||||
".pdf": "scripts.ingest_pdf",
|
||||
".epub": "scripts.ingest_epub",
|
||||
".docx": "scripts.ingest_docx",
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -40,22 +41,24 @@ def _dispatch_ingest(
|
|||
doc_id: str,
|
||||
file_path: str,
|
||||
background_tasks: BackgroundTasks,
|
||||
data_dir: Path,
|
||||
mark_dirty_fn: Callable[[], None],
|
||||
) -> str:
|
||||
"""Dispatch an ingest task. Tries cf-orch; falls back to BackgroundTasks."""
|
||||
import importlib
|
||||
import os as _os
|
||||
from pathlib import Path as _Path
|
||||
|
||||
suffix = Path(file_path).suffix.lower()
|
||||
suffix = _Path(file_path).suffix.lower()
|
||||
task_name = _INGEST_TASKS.get(suffix, "pagepiper/ingest_pdf")
|
||||
runner_module = _INGEST_RUNNERS.get(suffix, "scripts.ingest_pdf")
|
||||
|
||||
# Read lazily so test fixtures (monkeypatch.setenv) take effect
|
||||
_data_dir = _Path(_os.environ.get("PAGEPIPER_DATA_DIR", "data"))
|
||||
task_id = str(uuid.uuid4())
|
||||
args = {
|
||||
"doc_id": doc_id,
|
||||
"file_path": file_path,
|
||||
"db_path": str(data_dir / "pagepiper.db"),
|
||||
"vec_db_path": str(data_dir / "pagepiper_vecs.db"),
|
||||
"db_path": str(_data_dir / "pagepiper.db"),
|
||||
"vec_db_path": str(_data_dir / "pagepiper_vecs.db"),
|
||||
}
|
||||
|
||||
try:
|
||||
|
|
@ -64,7 +67,7 @@ def _dispatch_ingest(
|
|||
logger.info("Dispatched cf-orch ingest task %s for doc %s", task_id, doc_id)
|
||||
except Exception:
|
||||
mod = importlib.import_module(runner_module)
|
||||
background_tasks.add_task(_run_ingest_background, mod.run, args, task_id, mark_dirty_fn)
|
||||
background_tasks.add_task(_run_ingest_background, mod.run, args, task_id)
|
||||
logger.info(
|
||||
"cf-orch unavailable — running ingest in background thread (task %s)", task_id
|
||||
)
|
||||
|
|
@ -72,47 +75,19 @@ def _dispatch_ingest(
|
|||
return task_id
|
||||
|
||||
|
||||
def _run_ingest_background(
|
||||
run_fn: Callable[..., None],
|
||||
args: dict,
|
||||
task_id: str,
|
||||
mark_dirty_fn: Callable[[], None] | None = None,
|
||||
) -> None:
|
||||
def _run_ingest_background(run_fn: Callable[..., None], args: dict, task_id: str) -> None:
|
||||
from app.api.ingest import _task_registry
|
||||
_task_registry[task_id] = {"status": "running", "progress": 0}
|
||||
try:
|
||||
run_fn(**args)
|
||||
_task_registry[task_id] = {"status": "complete", "progress": 100}
|
||||
if mark_dirty_fn:
|
||||
mark_dirty_fn()
|
||||
if _mark_bm25_dirty:
|
||||
_mark_bm25_dirty()
|
||||
except Exception as exc:
|
||||
logger.exception("Ingest task %s failed", task_id)
|
||||
_task_registry[task_id] = {"status": "error", "error": str(exc)}
|
||||
|
||||
|
||||
@router.get("/sample-chunks")
|
||||
def sample_chunks(
|
||||
limit: int = 50,
|
||||
db: sqlite3.Connection = Depends(get_db),
|
||||
) -> list[dict]:
|
||||
"""Return up to `limit` randomly sampled page chunks for corpus benchmarking.
|
||||
|
||||
No tier gate — internal tool, same-host access only. Returns [] if empty.
|
||||
"""
|
||||
if limit < 1:
|
||||
limit = 1
|
||||
elif limit > 200:
|
||||
limit = 200
|
||||
rows = db.execute(
|
||||
"SELECT id, doc_id, page_number, text FROM page_chunks ORDER BY RANDOM() LIMIT ?",
|
||||
[limit],
|
||||
).fetchall()
|
||||
return [
|
||||
{"chunk_id": r["id"], "doc_id": r["doc_id"], "page_number": r["page_number"], "text": r["text"]}
|
||||
for r in rows
|
||||
]
|
||||
|
||||
|
||||
@router.get("")
|
||||
def list_library(db: sqlite3.Connection = Depends(get_db)) -> list[dict]:
|
||||
rows = db.execute(
|
||||
|
|
@ -126,18 +101,13 @@ def list_library(db: sqlite3.Connection = Depends(get_db)) -> list[dict]:
|
|||
def scan_library(
|
||||
background_tasks: BackgroundTasks,
|
||||
db: sqlite3.Connection = Depends(get_db),
|
||||
ctx: UserCtx = Depends(get_user_ctx),
|
||||
) -> dict:
|
||||
"""Scan the watched directory and queue ingest for any new PDFs."""
|
||||
watch = ctx.watch_dir
|
||||
watch = WATCH_DIR
|
||||
if not watch.exists():
|
||||
raise HTTPException(status_code=404, detail=f"Watch directory not found: {watch}")
|
||||
|
||||
pdfs = (
|
||||
list(watch.glob("**/*.pdf"))
|
||||
+ list(watch.glob("**/*.epub"))
|
||||
+ list(watch.glob("**/*.docx"))
|
||||
)
|
||||
pdfs = list(watch.glob("**/*.pdf")) + list(watch.glob("**/*.epub"))
|
||||
queued = []
|
||||
|
||||
for pdf_path in pdfs:
|
||||
|
|
@ -147,7 +117,7 @@ def scan_library(
|
|||
).fetchone()
|
||||
|
||||
if existing and existing["status"] == "ready":
|
||||
continue
|
||||
continue # already indexed
|
||||
|
||||
if existing:
|
||||
doc_id = existing["id"]
|
||||
|
|
@ -159,9 +129,7 @@ def scan_library(
|
|||
).fetchone()[0]
|
||||
db.commit()
|
||||
|
||||
task_id = _dispatch_ingest(
|
||||
doc_id, path_str, background_tasks, ctx.data_dir, ctx.bm25.mark_dirty
|
||||
)
|
||||
task_id = _dispatch_ingest(doc_id, path_str, background_tasks)
|
||||
db.execute(
|
||||
"UPDATE documents SET status='processing', task_id=? WHERE id=?",
|
||||
[task_id, doc_id],
|
||||
|
|
@ -177,15 +145,12 @@ def reingest_document(
|
|||
doc_id: str,
|
||||
background_tasks: BackgroundTasks,
|
||||
db: sqlite3.Connection = Depends(get_db),
|
||||
ctx: UserCtx = Depends(get_user_ctx),
|
||||
) -> dict:
|
||||
row = db.execute("SELECT file_path FROM documents WHERE id=?", [doc_id]).fetchone()
|
||||
if not row:
|
||||
raise HTTPException(status_code=404, detail="Document not found")
|
||||
|
||||
task_id = _dispatch_ingest(
|
||||
doc_id, row["file_path"], background_tasks, ctx.data_dir, ctx.bm25.mark_dirty
|
||||
)
|
||||
task_id = _dispatch_ingest(doc_id, row["file_path"], background_tasks)
|
||||
db.execute(
|
||||
"UPDATE documents SET status='processing', task_id=?, error_msg=NULL WHERE id=?",
|
||||
[task_id, doc_id],
|
||||
|
|
@ -198,7 +163,6 @@ def reingest_document(
|
|||
def delete_document(
|
||||
doc_id: str,
|
||||
db: sqlite3.Connection = Depends(get_db),
|
||||
ctx: UserCtx = Depends(get_user_ctx),
|
||||
) -> None:
|
||||
row = db.execute("SELECT id FROM documents WHERE id=?", [doc_id]).fetchone()
|
||||
if not row:
|
||||
|
|
@ -207,21 +171,23 @@ def delete_document(
|
|||
db.execute("DELETE FROM documents WHERE id=?", [doc_id])
|
||||
db.commit()
|
||||
|
||||
# Remove embeddings from vector store
|
||||
try:
|
||||
from circuitforge_core.vector.sqlite_vec import LocalSQLiteVecStore # type: ignore[import]
|
||||
store = LocalSQLiteVecStore(
|
||||
db_path=ctx.vec_db_path, table="page_vecs", dimensions=VEC_DIMENSIONS
|
||||
)
|
||||
from app.config import VEC_DIMENSIONS
|
||||
store = LocalSQLiteVecStore(db_path=VEC_DB_PATH, table="page_vecs", dimensions=VEC_DIMENSIONS)
|
||||
store.delete_where({"doc_id": doc_id})
|
||||
except Exception as exc:
|
||||
logger.warning("Could not remove vectors for doc %s: %s", doc_id, exc)
|
||||
|
||||
ctx.bm25.mark_dirty()
|
||||
if _mark_bm25_dirty:
|
||||
_mark_bm25_dirty()
|
||||
|
||||
|
||||
def _get_vec_count(doc_id: str, vec_db_path: str) -> int:
|
||||
def _get_vec_count(doc_id: str) -> int:
|
||||
"""Return how many vectors have been stored for this doc. Returns 0 on any error."""
|
||||
try:
|
||||
conn = sqlite3.connect(vec_db_path)
|
||||
conn = sqlite3.connect(VEC_DB_PATH)
|
||||
count = conn.execute(
|
||||
"SELECT COUNT(*) FROM page_vecs_meta WHERE json_extract(metadata, '$.doc_id') = ?",
|
||||
[doc_id],
|
||||
|
|
@ -236,7 +202,6 @@ def _get_vec_count(doc_id: str, vec_db_path: str) -> int:
|
|||
def document_status(
|
||||
doc_id: str,
|
||||
db: sqlite3.Connection = Depends(get_db),
|
||||
ctx: UserCtx = Depends(get_user_ctx),
|
||||
) -> dict:
|
||||
row = db.execute(
|
||||
"SELECT id, status, task_id, page_count, error_msg FROM documents WHERE id=?",
|
||||
|
|
@ -245,7 +210,7 @@ def document_status(
|
|||
if not row:
|
||||
raise HTTPException(status_code=404, detail="Document not found")
|
||||
result = dict(row)
|
||||
result["vec_count"] = _get_vec_count(doc_id, ctx.vec_db_path)
|
||||
result["vec_count"] = _get_vec_count(doc_id)
|
||||
return result
|
||||
|
||||
|
||||
|
|
@ -254,19 +219,18 @@ def upload_document(
|
|||
file: UploadFile,
|
||||
background_tasks: BackgroundTasks,
|
||||
db: sqlite3.Connection = Depends(get_db),
|
||||
ctx: UserCtx = Depends(get_user_ctx),
|
||||
) -> dict:
|
||||
"""Accept a PDF/EPUB upload, save to data/uploads/, and queue for indexing."""
|
||||
name = Path(file.filename or "").name
|
||||
suffix = Path(name).suffix.lower()
|
||||
if suffix not in _INGEST_TASKS:
|
||||
raise HTTPException(status_code=400, detail="Supported formats: PDF, EPUB, DOCX")
|
||||
raise HTTPException(status_code=400, detail="Supported formats: PDF, EPUB")
|
||||
|
||||
content = file.file.read()
|
||||
if len(content) > _MAX_UPLOAD_BYTES:
|
||||
raise HTTPException(status_code=413, detail="File exceeds 200 MB limit")
|
||||
|
||||
upload_dir = ctx.data_dir / "uploads"
|
||||
upload_dir = DATA_DIR / "uploads"
|
||||
upload_dir.mkdir(parents=True, exist_ok=True)
|
||||
dest = upload_dir / name
|
||||
dest.write_bytes(content)
|
||||
|
|
@ -289,9 +253,7 @@ def upload_document(
|
|||
).fetchone()[0]
|
||||
db.commit()
|
||||
|
||||
task_id = _dispatch_ingest(
|
||||
doc_id, path_str, background_tasks, ctx.data_dir, ctx.bm25.mark_dirty
|
||||
)
|
||||
task_id = _dispatch_ingest(doc_id, path_str, background_tasks)
|
||||
db.execute(
|
||||
"UPDATE documents SET status='processing', task_id=? WHERE id=?",
|
||||
[task_id, doc_id],
|
||||
|
|
|
|||
|
|
@ -7,11 +7,13 @@ MIT — no tier gate. No Ollama required.
|
|||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
from typing import Annotated
|
||||
|
||||
from fastapi import APIRouter, Depends
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
from app.deps import UserCtx, get_user_ctx
|
||||
from app.services.bm25_index import BM25Index
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
router = APIRouter(prefix="/api/search", tags=["search"])
|
||||
|
|
@ -27,17 +29,32 @@ class SearchResult(BaseModel):
|
|||
chunk_id: str
|
||||
doc_id: str
|
||||
page_number: int
|
||||
text_snippet: str
|
||||
text_snippet: str # first 300 chars of the page text
|
||||
bm25_score: float
|
||||
|
||||
|
||||
def _get_bm25() -> BM25Index:
|
||||
import app.main as _main
|
||||
bm25 = getattr(_main, "_bm25", None)
|
||||
if bm25 is None:
|
||||
raise RuntimeError("BM25 index not initialised — app.main not loaded")
|
||||
return bm25
|
||||
|
||||
|
||||
def _get_db_path() -> str:
|
||||
"""Read lazily so test fixtures (monkeypatch.setattr) take effect."""
|
||||
import pathlib
|
||||
data_dir = pathlib.Path(os.environ.get("PAGEPIPER_DATA_DIR", "data"))
|
||||
return str(data_dir / "pagepiper.db")
|
||||
|
||||
|
||||
@router.post("")
|
||||
def search(
|
||||
req: SearchRequest,
|
||||
ctx: UserCtx = Depends(get_user_ctx),
|
||||
bm25: Annotated[BM25Index, Depends(_get_bm25)],
|
||||
) -> list[SearchResult]:
|
||||
ctx.bm25.ensure_fresh(ctx.db_path)
|
||||
hits = ctx.bm25.query(req.query, top_k=req.top_k, doc_ids=req.doc_ids)
|
||||
bm25.ensure_fresh(_get_db_path())
|
||||
hits = bm25.query(req.query, top_k=req.top_k, doc_ids=req.doc_ids)
|
||||
return [
|
||||
SearchResult(
|
||||
chunk_id=h.chunk_id,
|
||||
|
|
|
|||
|
|
@ -1,131 +0,0 @@
|
|||
# app/cloud_session.py
|
||||
"""Cloud session auth for Pagepiper — validates cf_session cookie via Directus + Heimdall.
|
||||
|
||||
In local mode (CLOUD_MODE unset or false), require_paid_tier is a no-op.
|
||||
In cloud mode, the Caddy proxy forwards the browser's Cookie header as
|
||||
X-CF-Session. This module extracts cf_session, validates it against
|
||||
Directus /users/me, then checks the user's Pagepiper tier via Heimdall.
|
||||
Auto-provisions a free tier key for new users.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
|
||||
import httpx
|
||||
from fastapi import HTTPException, Request
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
CLOUD_MODE: bool = os.environ.get("CLOUD_MODE", "").lower() in ("1", "true", "yes")
|
||||
DIRECTUS_URL: str = os.environ.get("DIRECTUS_URL", "http://172.31.0.3:8055").rstrip("/")
|
||||
HEIMDALL_URL: str = os.environ.get("HEIMDALL_URL", "https://license.circuitforge.tech").rstrip("/")
|
||||
HEIMDALL_ADMIN_TOKEN: str = os.environ.get("HEIMDALL_ADMIN_TOKEN", "")
|
||||
|
||||
_TIER_ORDER = {"free": 0, "paid": 1, "premium": 2, "ultra": 3}
|
||||
|
||||
|
||||
def _extract_session_token(cookie_header: str) -> str:
|
||||
m = re.search(r"(?:^|;)\s*cf_session=([^;]+)", cookie_header)
|
||||
return m.group(1).strip() if m else ""
|
||||
|
||||
|
||||
def _get_user_id(jwt: str) -> str | None:
|
||||
try:
|
||||
resp = httpx.get(
|
||||
f"{DIRECTUS_URL}/users/me",
|
||||
headers={"Authorization": f"Bearer {jwt}"},
|
||||
timeout=5.0,
|
||||
)
|
||||
if resp.status_code == 200:
|
||||
return resp.json().get("data", {}).get("id")
|
||||
except Exception as exc:
|
||||
log.warning("Directus session check failed: %s", exc)
|
||||
return None
|
||||
|
||||
|
||||
def _ensure_provisioned(user_id: str) -> None:
|
||||
if not HEIMDALL_ADMIN_TOKEN:
|
||||
return
|
||||
try:
|
||||
httpx.post(
|
||||
f"{HEIMDALL_URL}/admin/provision",
|
||||
json={"directus_user_id": user_id, "product": "pagepiper", "tier": "free"},
|
||||
headers={"Authorization": f"Bearer {HEIMDALL_ADMIN_TOKEN}"},
|
||||
timeout=5.0,
|
||||
)
|
||||
except Exception as exc:
|
||||
log.warning("Heimdall provision failed for user %s: %s", user_id, exc)
|
||||
|
||||
|
||||
def _get_tier(user_id: str) -> str:
|
||||
if not HEIMDALL_ADMIN_TOKEN:
|
||||
return "free"
|
||||
try:
|
||||
resp = httpx.get(
|
||||
f"{HEIMDALL_URL}/admin/cloud/resolve",
|
||||
params={"directus_user_id": user_id, "product": "pagepiper"},
|
||||
headers={"Authorization": f"Bearer {HEIMDALL_ADMIN_TOKEN}"},
|
||||
timeout=5.0,
|
||||
)
|
||||
if resp.status_code == 200:
|
||||
return resp.json().get("tier", "free")
|
||||
except Exception as exc:
|
||||
log.warning("Heimdall tier check failed for user %s: %s", user_id, exc)
|
||||
return "free"
|
||||
|
||||
|
||||
def resolve_authenticated_user(request: Request) -> str:
|
||||
"""Validate the session cookie and return the Directus user_id. Raises 401 if invalid."""
|
||||
cookie_header = request.headers.get("x-cf-session", "")
|
||||
jwt = _extract_session_token(cookie_header)
|
||||
|
||||
if not jwt:
|
||||
raise HTTPException(
|
||||
status_code=401,
|
||||
detail={
|
||||
"error": "auth_required",
|
||||
"message": "Sign in at circuitforge.tech to use Pagepiper cloud.",
|
||||
},
|
||||
)
|
||||
|
||||
user_id = _get_user_id(jwt)
|
||||
if not user_id:
|
||||
raise HTTPException(
|
||||
status_code=401,
|
||||
detail={
|
||||
"error": "session_invalid",
|
||||
"message": "Your session has expired. Sign in again at circuitforge.tech.",
|
||||
},
|
||||
)
|
||||
|
||||
_ensure_provisioned(user_id)
|
||||
return user_id
|
||||
|
||||
|
||||
def require_paid_tier(request: Request) -> str:
|
||||
"""FastAPI dependency — 401 if no valid session, 402 if tier < paid. Returns user_id.
|
||||
|
||||
In local mode (CLOUD_MODE not set), returns LOCAL_USER_ID without any auth check.
|
||||
"""
|
||||
if not CLOUD_MODE:
|
||||
from app.config import LOCAL_USER_ID
|
||||
return LOCAL_USER_ID
|
||||
|
||||
user_id = resolve_authenticated_user(request)
|
||||
tier = _get_tier(user_id)
|
||||
|
||||
if _TIER_ORDER.get(tier, 0) < _TIER_ORDER["paid"]:
|
||||
raise HTTPException(
|
||||
status_code=402,
|
||||
detail={
|
||||
"error": "upgrade_required",
|
||||
"message": (
|
||||
"RAG chat requires a Paid tier Pagepiper license. "
|
||||
"Upgrade at circuitforge.tech/software/pagepiper."
|
||||
),
|
||||
},
|
||||
)
|
||||
|
||||
return user_id
|
||||
|
|
@ -12,35 +12,15 @@ VEC_DB_PATH = str(DATA_DIR / "pagepiper_vecs.db")
|
|||
WATCH_DIR = Path(os.environ.get("PAGEPIPER_WATCH_DIR", "books"))
|
||||
VEC_DIMENSIONS = int(os.environ.get("PAGEPIPER_EMBED_DIMS", "1024"))
|
||||
|
||||
LOCAL_USER_ID = "__local__"
|
||||
|
||||
|
||||
def user_data_dir(user_id: str) -> Path:
|
||||
"""Return (and create) the per-user data directory under DATA_DIR/users/."""
|
||||
d = DATA_DIR / "users" / user_id
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
return d
|
||||
|
||||
|
||||
def get_llm_config() -> dict | None:
|
||||
"""Build LLMRouter config from env vars.
|
||||
|
||||
Returns None only when neither PAGEPIPER_OLLAMA_URL nor CF_ORCH_URL is set.
|
||||
CF_ORCH_URL alone is sufficient — the coordinator resolves the service URL at
|
||||
allocation time so PAGEPIPER_OLLAMA_URL becomes optional.
|
||||
"""
|
||||
"""Build LLMRouter config from env vars. Returns None if PAGEPIPER_OLLAMA_URL is unset."""
|
||||
url = os.environ.get("PAGEPIPER_OLLAMA_URL", "").strip()
|
||||
orch_url = os.environ.get("CF_ORCH_URL", "").strip()
|
||||
|
||||
if not url and not orch_url:
|
||||
if not url:
|
||||
return None
|
||||
|
||||
chat_model = os.environ.get("PAGEPIPER_CHAT_MODEL", "mistral:7b")
|
||||
|
||||
_base_url = ""
|
||||
if url:
|
||||
_clean = url.rstrip("/")
|
||||
_base_url = _clean if _clean.endswith("/v1") else _clean + "/v1"
|
||||
chat_model = os.environ.get("PAGEPIPER_CHAT_MODEL", "mistral:7b")
|
||||
|
||||
backend: dict = {
|
||||
"type": "openai_compat",
|
||||
|
|
@ -50,9 +30,12 @@ def get_llm_config() -> dict | None:
|
|||
"supports_images": False,
|
||||
}
|
||||
|
||||
# Wire cf-orch allocation when coordinator is configured so the model stays warm
|
||||
# and cold-start latency doesn't cause chat timeouts.
|
||||
orch_url = os.environ.get("CF_ORCH_URL", "").strip()
|
||||
if orch_url:
|
||||
backend["cf_orch"] = {
|
||||
"service": os.environ.get("PAGEPIPER_ORCH_SERVICE", "ollama"),
|
||||
"service": "ollama",
|
||||
"model_candidates": [chat_model],
|
||||
"ttl_s": 3600,
|
||||
}
|
||||
|
|
|
|||
68
app/deps.py
68
app/deps.py
|
|
@ -3,75 +3,13 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import sqlite3
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Generator
|
||||
|
||||
from fastapi import Depends, Request
|
||||
|
||||
from app.config import DATA_DIR, LOCAL_USER_ID
|
||||
from app.services.bm25_index import BM25Index
|
||||
from app.config import DB_PATH
|
||||
|
||||
|
||||
@dataclass
|
||||
class UserCtx:
|
||||
"""Per-request context routing DB paths and BM25 to the right user."""
|
||||
|
||||
user_id: str
|
||||
db_path: str
|
||||
vec_db_path: str
|
||||
data_dir: Path
|
||||
watch_dir: Path
|
||||
bm25: BM25Index
|
||||
|
||||
|
||||
_user_startup_done: set[str] = set()
|
||||
|
||||
|
||||
def _run_user_startup(user_id: str, user_dir: Path) -> None:
|
||||
"""Run migrations and vec schema check once per process lifetime per user."""
|
||||
if user_id in _user_startup_done:
|
||||
return
|
||||
_user_startup_done.add(user_id)
|
||||
from app.config import VEC_DIMENSIONS
|
||||
from app.startup import apply_migrations, check_and_rebuild_vec_schema
|
||||
apply_migrations(str(user_dir / "pagepiper.db"))
|
||||
check_and_rebuild_vec_schema(
|
||||
str(user_dir / "pagepiper_vecs.db"), VEC_DIMENSIONS, str(user_dir / "pagepiper.db")
|
||||
)
|
||||
|
||||
|
||||
def get_user_ctx(request: Request) -> UserCtx:
|
||||
"""Resolve the per-user data directory, DB paths, and BM25 instance for this request."""
|
||||
import app.main as _main
|
||||
from app.cloud_session import CLOUD_MODE
|
||||
|
||||
if CLOUD_MODE:
|
||||
from app.cloud_session import resolve_authenticated_user
|
||||
from app.config import user_data_dir
|
||||
user_id = resolve_authenticated_user(request)
|
||||
user_dir = user_data_dir(user_id)
|
||||
_run_user_startup(user_id, user_dir)
|
||||
watch_dir = user_dir / "books"
|
||||
watch_dir.mkdir(parents=True, exist_ok=True)
|
||||
else:
|
||||
from app.config import WATCH_DIR
|
||||
user_id = LOCAL_USER_ID
|
||||
user_dir = DATA_DIR
|
||||
watch_dir = WATCH_DIR
|
||||
|
||||
return UserCtx(
|
||||
user_id=user_id,
|
||||
db_path=str(user_dir / "pagepiper.db"),
|
||||
vec_db_path=str(user_dir / "pagepiper_vecs.db"),
|
||||
data_dir=user_dir,
|
||||
watch_dir=watch_dir,
|
||||
bm25=_main._get_bm25_for(user_id),
|
||||
)
|
||||
|
||||
|
||||
def get_db(ctx: UserCtx = Depends(get_user_ctx)) -> Generator[sqlite3.Connection, None, None]:
|
||||
conn = sqlite3.connect(ctx.db_path, check_same_thread=False)
|
||||
def get_db() -> Generator[sqlite3.Connection, None, None]:
|
||||
conn = sqlite3.connect(DB_PATH, check_same_thread=False)
|
||||
conn.execute("PRAGMA foreign_keys = ON")
|
||||
conn.execute("PRAGMA journal_mode = WAL")
|
||||
conn.row_factory = sqlite3.Row
|
||||
|
|
|
|||
115
app/main.py
115
app/main.py
|
|
@ -4,6 +4,9 @@ from __future__ import annotations
|
|||
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import sqlite3
|
||||
import threading
|
||||
from contextlib import asynccontextmanager
|
||||
|
||||
from fastapi import FastAPI
|
||||
|
|
@ -13,40 +16,110 @@ from app.services.bm25_index import BM25Index
|
|||
|
||||
logger = logging.getLogger("pagepiper")
|
||||
|
||||
# Per-user BM25 registry — keyed by user_id; "__local__" for single-user mode
|
||||
_bm25_map: dict[str, BM25Index] = {}
|
||||
# Module-level BM25 singleton — shared across all requests
|
||||
_bm25 = BM25Index()
|
||||
|
||||
|
||||
def _get_bm25_for(user_id: str) -> BM25Index:
|
||||
if user_id not in _bm25_map:
|
||||
_bm25_map[user_id] = BM25Index()
|
||||
return _bm25_map[user_id]
|
||||
def _apply_migrations() -> None:
|
||||
from scripts.db_migrate import migrate
|
||||
migrate(DB_PATH)
|
||||
|
||||
|
||||
def _reembed_docs(docs: list[tuple[str, str]], db_path: str, vec_db_path: str) -> None:
|
||||
"""Re-run full ingest for a list of (doc_id, file_path) sequentially."""
|
||||
for doc_id, file_path in docs:
|
||||
suffix = os.path.splitext(file_path)[1].lower()
|
||||
try:
|
||||
if suffix == ".epub":
|
||||
from scripts.ingest_epub import run
|
||||
else:
|
||||
from scripts.ingest_pdf import run
|
||||
logger.info("Auto re-embed: starting %s", os.path.basename(file_path))
|
||||
run(doc_id=doc_id, file_path=file_path, db_path=db_path, vec_db_path=vec_db_path)
|
||||
except Exception as exc:
|
||||
logger.error("Auto re-embed failed for doc %s: %s", doc_id[:8], exc)
|
||||
|
||||
|
||||
def _check_vec_schema(vec_db_path: str, expected_dims: int, db_path: str) -> None:
|
||||
"""Drop the vec DB if its stored dimension doesn't match config, then queue re-embed.
|
||||
|
||||
sqlite-vec bakes the embedding dimension into the virtual table DDL, so changing
|
||||
models requires dropping and recreating the whole file. Catches the mismatch at
|
||||
startup rather than surfacing it as an obscure OperationalError mid-request.
|
||||
"""
|
||||
if not os.path.exists(vec_db_path):
|
||||
return
|
||||
try:
|
||||
conn = sqlite3.connect(vec_db_path)
|
||||
row = conn.execute(
|
||||
"SELECT sql FROM sqlite_master WHERE name='page_vecs_vecs'"
|
||||
).fetchone()
|
||||
conn.close()
|
||||
except Exception as exc:
|
||||
logger.warning("Vec schema check could not read %s (non-fatal): %s", vec_db_path, exc)
|
||||
return
|
||||
|
||||
if not row:
|
||||
return # table not yet created — first embed will build it with the right dims
|
||||
|
||||
m = re.search(r'float\[(\d+)\]', row[0])
|
||||
if not m:
|
||||
return
|
||||
actual_dims = int(m.group(1))
|
||||
if actual_dims == expected_dims:
|
||||
return
|
||||
|
||||
logger.warning(
|
||||
"Vec DB dimension mismatch: stored=%d, configured=%d — dropping %s and queuing re-embed",
|
||||
actual_dims, expected_dims, vec_db_path,
|
||||
)
|
||||
try:
|
||||
os.remove(vec_db_path)
|
||||
except OSError as exc:
|
||||
logger.error(
|
||||
"Could not delete stale vec DB %s: %s — fix permissions and restart", vec_db_path, exc
|
||||
)
|
||||
return
|
||||
|
||||
# Collect all ready docs so we can rebuild their embeddings in the background.
|
||||
try:
|
||||
conn = sqlite3.connect(db_path)
|
||||
docs = conn.execute(
|
||||
"SELECT id, file_path FROM documents WHERE status='ready'"
|
||||
).fetchall()
|
||||
conn.close()
|
||||
except Exception as exc:
|
||||
logger.warning("Could not query documents for re-embed: %s", exc)
|
||||
return
|
||||
|
||||
if not docs:
|
||||
return
|
||||
|
||||
logger.info("Queuing re-embed for %d document(s) in background", len(docs))
|
||||
threading.Thread(
|
||||
target=_reembed_docs,
|
||||
args=(docs, db_path, vec_db_path),
|
||||
daemon=True,
|
||||
name="pagepiper-reembed",
|
||||
).start()
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
from app.cloud_session import CLOUD_MODE
|
||||
from app.config import LOCAL_USER_ID
|
||||
from app.startup import apply_migrations, check_and_rebuild_vec_schema
|
||||
|
||||
_apply_migrations()
|
||||
embed_model = os.environ.get("PAGEPIPER_EMBED_MODEL", "nomic-embed-text")
|
||||
logger.info("Pagepiper starting — embed model: %s, dims: %d", embed_model, VEC_DIMENSIONS)
|
||||
|
||||
if CLOUD_MODE:
|
||||
from app.startup import warn_if_unencrypted
|
||||
from app.config import DATA_DIR
|
||||
warn_if_unencrypted(str(DATA_DIR))
|
||||
else:
|
||||
# In cloud mode, per-user migration and vec schema check run on first request (deps.py).
|
||||
apply_migrations(DB_PATH)
|
||||
check_and_rebuild_vec_schema(VEC_DB_PATH, VEC_DIMENSIONS, DB_PATH)
|
||||
_get_bm25_for(LOCAL_USER_ID).mark_dirty()
|
||||
|
||||
_check_vec_schema(VEC_DB_PATH, VEC_DIMENSIONS, DB_PATH)
|
||||
_bm25.mark_dirty() # will rebuild on first search
|
||||
yield
|
||||
|
||||
|
||||
app = FastAPI(title="Pagepiper", lifespan=lifespan)
|
||||
|
||||
# Wire BM25 dirty callback into library router
|
||||
from app.api import library as _lib_module # noqa: E402
|
||||
_lib_module._mark_bm25_dirty = _bm25.mark_dirty
|
||||
|
||||
# Register routers
|
||||
from app.api.library import router as library_router # noqa: E402
|
||||
from app.api.ingest import router as ingest_router # noqa: E402
|
||||
|
|
|
|||
|
|
@ -170,30 +170,7 @@ class Retriever:
|
|||
vec = (1.0 / (1.0 + r.vector_score)) if r.vector_score is not None else 0.0
|
||||
return bm25 * 0.5 + vec * 0.5
|
||||
|
||||
all_ranked = sorted(merged.values(), key=_combined, reverse=True)
|
||||
|
||||
# Discard results where the best match is pure noise (neither BM25 term
|
||||
# overlap nor vector similarity exceeded the minimum signal threshold).
|
||||
# This lets the caller's empty-result guard fire instead of sending
|
||||
# low-confidence chunks to the LLM where it fills gaps with training data.
|
||||
MIN_SIGNAL = 0.01
|
||||
if all_ranked and _combined(all_ranked[0]) < MIN_SIGNAL:
|
||||
return []
|
||||
|
||||
# Cap per-document contribution to max_per_doc of top_k so that one book
|
||||
# does not crowd out all slots when the query matches it heavily by name
|
||||
# alone (e.g. a character name that appears in every chapter).
|
||||
max_per_doc = max(2, top_k // 3)
|
||||
ranked: list[RetrievedChunk] = []
|
||||
doc_counts: dict[str, int] = {}
|
||||
for r in all_ranked:
|
||||
if len(ranked) >= top_k:
|
||||
break
|
||||
count = doc_counts.get(r.doc_id, 0)
|
||||
if count < max_per_doc:
|
||||
ranked.append(r)
|
||||
doc_counts[r.doc_id] = count + 1
|
||||
|
||||
ranked = sorted(merged.values(), key=_combined, reverse=True)[:top_k]
|
||||
adjacent = _fetch_adjacent(ranked, db_path)
|
||||
return ranked + adjacent
|
||||
|
||||
|
|
@ -212,8 +189,5 @@ class Retriever:
|
|||
)
|
||||
for r in self._bm25.query(query, top_k=top_k, doc_ids=doc_ids)
|
||||
]
|
||||
MIN_SIGNAL = 0.01
|
||||
if hits and hits[0].bm25_score < MIN_SIGNAL:
|
||||
return []
|
||||
adjacent = _fetch_adjacent(hits, db_path)
|
||||
return hits + adjacent
|
||||
|
|
|
|||
|
|
@ -11,52 +11,12 @@ from dataclasses import dataclass
|
|||
from app.services.retriever import RetrievedChunk
|
||||
|
||||
_SYSTEM_PROMPT = (
|
||||
"You are a strict document retrieval assistant. "
|
||||
"Your sole job is to extract and present information from the document excerpts given to you. "
|
||||
"You have no memory of books, stories, or authors. "
|
||||
"If the excerpts do not contain the answer, say so and stop. Never guess."
|
||||
"You are a helpful document assistant. "
|
||||
"Answer the user's question using ONLY the provided document excerpts. "
|
||||
"For each claim, cite the source page as [p.N]. "
|
||||
"If the excerpts are insufficient, say so. Do not invent information."
|
||||
)
|
||||
|
||||
_NO_RESULTS_ANSWER = (
|
||||
"I could not find any relevant passages in the indexed documents for that question. "
|
||||
"Try rephrasing, or check that the relevant document has been ingested."
|
||||
)
|
||||
|
||||
# Phrases the model uses when it escapes the provided context and pulls from
|
||||
# training data. Any response containing one of these is replaced with the
|
||||
# canned no-answer message.
|
||||
_ESCAPE_PHRASES = [
|
||||
"in the series",
|
||||
"in the novel",
|
||||
"in the book",
|
||||
"in the context of the series",
|
||||
"it can be assumed",
|
||||
"based on my knowledge",
|
||||
"based on the broader",
|
||||
"the broader story",
|
||||
"by terry goodkind",
|
||||
"sword of truth",
|
||||
"legend of the seeker",
|
||||
"throughout the series",
|
||||
"throughout the novel",
|
||||
"throughout the book",
|
||||
]
|
||||
|
||||
|
||||
def _strip_escape(response: str) -> str:
|
||||
"""Replace responses that leaked outside the provided context with the canned message.
|
||||
|
||||
Detects the 'helpful override' pattern where the model acknowledges the
|
||||
excerpts lack the answer but supplements from training data anyway.
|
||||
"""
|
||||
lower = response.lower()
|
||||
if any(phrase in lower for phrase in _ESCAPE_PHRASES):
|
||||
return (
|
||||
"I could not find an answer to that question in the indexed documents. "
|
||||
"The answer may be in a document that has not been ingested yet."
|
||||
)
|
||||
return response
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Citation:
|
||||
|
|
@ -82,30 +42,13 @@ class Synthesizer:
|
|||
history: list[dict],
|
||||
chunks: list[RetrievedChunk],
|
||||
) -> SynthesisResult:
|
||||
if not chunks:
|
||||
return SynthesisResult(answer=_NO_RESULTS_ANSWER, citations=())
|
||||
|
||||
# 1500 chars (~300 words) per chunk: enough to capture definitions that
|
||||
# appear mid-paragraph without blowing past a 32k-context model's limit.
|
||||
context_parts = [f"[p.{c.page_number}]\n{c.text[:1500]}" for c in chunks]
|
||||
context = "\n\n---\n\n".join(context_parts)
|
||||
# Quote-first structure: the model must commit to a grounding passage
|
||||
# before generating an answer. Forces an explicit "NOT FOUND" admission
|
||||
# when the excerpt doesn't contain the answer, rather than the "the excerpt
|
||||
# doesn't say... however, in the series..." escape pattern.
|
||||
prompt = (
|
||||
f"Excerpts from the indexed documents:\n\n{context}\n\n"
|
||||
f"---\n\n"
|
||||
f"Question: {message}\n\n"
|
||||
f"Step 1 — Find the relevant passage: Quote the exact sentence(s) from "
|
||||
f"the excerpts above that answer the question, or write NOT FOUND.\n\n"
|
||||
f"Step 2 — Answer: Based solely on what you quoted in Step 1, answer "
|
||||
f"the question with page citations [p.N]. If Step 1 is NOT FOUND, "
|
||||
f"write: \"I could not find an answer to that question in the indexed documents.\""
|
||||
)
|
||||
prompt = f"Document excerpts:\n\n{context}\n\nQuestion: {message}"
|
||||
|
||||
answer = self._llm.complete(prompt, system=_SYSTEM_PROMPT)
|
||||
answer = _strip_escape(answer)
|
||||
|
||||
citations = tuple(
|
||||
Citation(
|
||||
|
|
|
|||
137
app/startup.py
137
app/startup.py
|
|
@ -1,137 +0,0 @@
|
|||
# app/startup.py
|
||||
"""DB migration and vec schema check utilities — called at startup and on first user request."""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import sqlite3
|
||||
import subprocess
|
||||
import threading
|
||||
|
||||
logger = logging.getLogger("pagepiper")
|
||||
|
||||
|
||||
def warn_if_unencrypted(data_dir: str) -> None:
|
||||
"""Log a warning if cloud mode is running without fscrypt encryption.
|
||||
|
||||
Checks whether the users/ subdirectory of data_dir is fscrypt-encrypted.
|
||||
Non-fatal: warns but does not block startup.
|
||||
"""
|
||||
users_dir = os.path.join(data_dir, "users")
|
||||
os.makedirs(users_dir, exist_ok=True)
|
||||
|
||||
if not _fscrypt_available():
|
||||
logger.warning(
|
||||
"SECURITY: fscrypt not found on this system. Cloud user data at %s is stored "
|
||||
"unencrypted. Install fscrypt and run scripts/setup_cloud_fscrypt.sh to enable "
|
||||
"encryption at rest.",
|
||||
users_dir,
|
||||
)
|
||||
return
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["fscrypt", "status", users_dir],
|
||||
capture_output=True, text=True, timeout=5,
|
||||
)
|
||||
if "Encrypted" not in result.stdout:
|
||||
logger.warning(
|
||||
"SECURITY: user data directory %s is not fscrypt-encrypted. "
|
||||
"Run: sudo scripts/setup_cloud_fscrypt.sh <user_id>",
|
||||
users_dir,
|
||||
)
|
||||
except Exception as exc:
|
||||
logger.debug("fscrypt status check failed (non-fatal): %s", exc)
|
||||
|
||||
|
||||
def _fscrypt_available() -> bool:
|
||||
try:
|
||||
subprocess.run(["fscrypt", "--version"], capture_output=True, timeout=2)
|
||||
return True
|
||||
except (FileNotFoundError, subprocess.TimeoutExpired):
|
||||
return False
|
||||
|
||||
|
||||
def apply_migrations(db_path: str) -> None:
|
||||
from scripts.db_migrate import migrate
|
||||
migrate(db_path)
|
||||
|
||||
|
||||
def reembed_docs(docs: list[tuple[str, str]], db_path: str, vec_db_path: str) -> None:
|
||||
for doc_id, file_path in docs:
|
||||
suffix = os.path.splitext(file_path)[1].lower()
|
||||
try:
|
||||
if suffix == ".epub":
|
||||
from scripts.ingest_epub import run
|
||||
elif suffix == ".docx":
|
||||
from scripts.ingest_docx import run
|
||||
else:
|
||||
from scripts.ingest_pdf import run
|
||||
logger.info("Auto re-embed: starting %s", os.path.basename(file_path))
|
||||
run(doc_id=doc_id, file_path=file_path, db_path=db_path, vec_db_path=vec_db_path)
|
||||
except Exception as exc:
|
||||
logger.error("Auto re-embed failed for doc %s: %s", doc_id[:8], exc)
|
||||
|
||||
|
||||
def check_and_rebuild_vec_schema(vec_db_path: str, expected_dims: int, db_path: str) -> None:
|
||||
"""Drop the vec DB if its stored dimension doesn't match config, then queue re-embed.
|
||||
|
||||
sqlite-vec bakes the embedding dimension into the virtual table DDL, so changing
|
||||
models requires dropping and recreating the whole file. Catches the mismatch at
|
||||
startup rather than surfacing it as an obscure OperationalError mid-request.
|
||||
"""
|
||||
if not os.path.exists(vec_db_path):
|
||||
return
|
||||
try:
|
||||
conn = sqlite3.connect(vec_db_path)
|
||||
row = conn.execute(
|
||||
"SELECT sql FROM sqlite_master WHERE name='page_vecs_vecs'"
|
||||
).fetchone()
|
||||
conn.close()
|
||||
except Exception as exc:
|
||||
logger.warning("Vec schema check could not read %s (non-fatal): %s", vec_db_path, exc)
|
||||
return
|
||||
|
||||
if not row:
|
||||
return
|
||||
|
||||
m = re.search(r'float\[(\d+)\]', row[0])
|
||||
if not m:
|
||||
return
|
||||
actual_dims = int(m.group(1))
|
||||
if actual_dims == expected_dims:
|
||||
return
|
||||
|
||||
logger.warning(
|
||||
"Vec DB dimension mismatch: stored=%d, configured=%d — dropping %s and queuing re-embed",
|
||||
actual_dims, expected_dims, vec_db_path,
|
||||
)
|
||||
try:
|
||||
os.remove(vec_db_path)
|
||||
except OSError as exc:
|
||||
logger.error(
|
||||
"Could not delete stale vec DB %s: %s — fix permissions and restart", vec_db_path, exc
|
||||
)
|
||||
return
|
||||
|
||||
try:
|
||||
conn = sqlite3.connect(db_path)
|
||||
docs = conn.execute(
|
||||
"SELECT id, file_path FROM documents WHERE status='ready'"
|
||||
).fetchall()
|
||||
conn.close()
|
||||
except Exception as exc:
|
||||
logger.warning("Could not query documents for re-embed: %s", exc)
|
||||
return
|
||||
|
||||
if not docs:
|
||||
return
|
||||
|
||||
logger.info("Queuing re-embed for %d document(s) in background", len(docs))
|
||||
threading.Thread(
|
||||
target=reembed_docs,
|
||||
args=(docs, db_path, vec_db_path),
|
||||
daemon=True,
|
||||
name="pagepiper-reembed",
|
||||
).start()
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 94 KiB After Width: | Height: | Size: 26 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 128 KiB After Width: | Height: | Size: 25 KiB |
28
manage.sh
28
manage.sh
|
|
@ -12,8 +12,8 @@ OVERRIDE_ARGS=()
|
|||
[[ -f "compose.override.yml" ]] && OVERRIDE_ARGS=(-f compose.override.yml)
|
||||
|
||||
usage() {
|
||||
echo "Usage: $0 {start|stop|restart|status|logs [svc]|open|build|test|update"
|
||||
echo " |cloud-start|cloud-stop|cloud-restart|cloud-status|cloud-logs [svc]|cloud-build|cloud-update}"
|
||||
echo "Usage: $0 {start|stop|restart|status|logs [svc]|open|build|test"
|
||||
echo " |cloud:start|cloud:stop|cloud:restart|cloud:status|cloud:logs [svc]|cloud:build}"
|
||||
exit 1
|
||||
}
|
||||
|
||||
|
|
@ -48,39 +48,27 @@ case "$cmd" in
|
|||
test)
|
||||
conda run -n cf pytest tests/ -v
|
||||
;;
|
||||
update)
|
||||
git pull
|
||||
docker compose -f "$COMPOSE_FILE" "${OVERRIDE_ARGS[@]}" down
|
||||
docker compose -f "$COMPOSE_FILE" "${OVERRIDE_ARGS[@]}" up -d --build
|
||||
echo "Pagepiper updated and running → http://localhost:${WEB_PORT}"
|
||||
;;
|
||||
cloud-start)
|
||||
cloud:start)
|
||||
docker compose -f "$COMPOSE_CLOUD_FILE" -p "$CLOUD_PROJECT" up -d --build
|
||||
echo "Pagepiper cloud running → http://localhost:${CLOUD_WEB_PORT}"
|
||||
;;
|
||||
cloud-stop)
|
||||
cloud:stop)
|
||||
docker compose -f "$COMPOSE_CLOUD_FILE" -p "$CLOUD_PROJECT" down
|
||||
;;
|
||||
cloud-restart)
|
||||
cloud:restart)
|
||||
docker compose -f "$COMPOSE_CLOUD_FILE" -p "$CLOUD_PROJECT" down
|
||||
docker compose -f "$COMPOSE_CLOUD_FILE" -p "$CLOUD_PROJECT" up -d --build
|
||||
echo "Pagepiper cloud running → http://localhost:${CLOUD_WEB_PORT}"
|
||||
;;
|
||||
cloud-status)
|
||||
cloud:status)
|
||||
docker compose -f "$COMPOSE_CLOUD_FILE" -p "$CLOUD_PROJECT" ps
|
||||
;;
|
||||
cloud-logs)
|
||||
cloud:logs)
|
||||
docker compose -f "$COMPOSE_CLOUD_FILE" -p "$CLOUD_PROJECT" logs -f "${1:-}"
|
||||
;;
|
||||
cloud-build)
|
||||
cloud:build)
|
||||
docker compose -f "$COMPOSE_CLOUD_FILE" -p "$CLOUD_PROJECT" build --no-cache
|
||||
;;
|
||||
cloud-update)
|
||||
git pull
|
||||
docker compose -f "$COMPOSE_CLOUD_FILE" -p "$CLOUD_PROJECT" down
|
||||
docker compose -f "$COMPOSE_CLOUD_FILE" -p "$CLOUD_PROJECT" up -d --build
|
||||
echo "Pagepiper cloud updated and running → http://localhost:${CLOUD_WEB_PORT}"
|
||||
;;
|
||||
*)
|
||||
usage
|
||||
;;
|
||||
|
|
|
|||
|
|
@ -1,120 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
# setup_cloud_fscrypt.sh — encrypt a cloud user's data directory with fscrypt.
|
||||
#
|
||||
# Run as root on the HOST (not inside the container) before first deployment.
|
||||
# Requires: fscrypt >= 0.3, Linux kernel >= 4.1, ext4/f2fs filesystem.
|
||||
#
|
||||
# Usage:
|
||||
# sudo ./scripts/setup_cloud_fscrypt.sh <user_id>
|
||||
# sudo ./scripts/setup_cloud_fscrypt.sh --list # show all encrypted dirs
|
||||
# sudo ./scripts/setup_cloud_fscrypt.sh --status <user_id>
|
||||
#
|
||||
# Environment:
|
||||
# PAGEPIPER_DATA_DIR — base data directory (default: /devl/pagepiper-cloud-data)
|
||||
#
|
||||
# Key management:
|
||||
# Keys are stored in the system protector backed by a passphrase or root keyring.
|
||||
# For unattended unlock on server boot, use a raw_key protector derived from a
|
||||
# secret in HashiCorp Vault or similar; see docs/encryption.md for details.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
DATA_DIR="${PAGEPIPER_DATA_DIR:-/devl/pagepiper-cloud-data}"
|
||||
USERS_DIR="$DATA_DIR/users"
|
||||
|
||||
_usage() {
|
||||
grep '^# ' "$0" | cut -c3-
|
||||
exit 1
|
||||
}
|
||||
|
||||
_require_root() {
|
||||
if [[ "$EUID" -ne 0 ]]; then
|
||||
echo "ERROR: this script must be run as root" >&2
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
_require_fscrypt() {
|
||||
if ! command -v fscrypt &>/dev/null; then
|
||||
echo "ERROR: fscrypt not found. Install with: apt-get install fscrypt" >&2
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
_check_fscrypt_setup() {
|
||||
local mnt
|
||||
mnt=$(df -P "$DATA_DIR" | tail -1 | awk '{print $6}')
|
||||
if ! fscrypt status "$mnt" &>/dev/null; then
|
||||
echo "Initialising fscrypt on $mnt..."
|
||||
fscrypt setup --quiet "$mnt"
|
||||
echo "fscrypt setup complete on $mnt"
|
||||
fi
|
||||
}
|
||||
|
||||
cmd="${1:-}"
|
||||
case "$cmd" in
|
||||
--list)
|
||||
_require_root
|
||||
_require_fscrypt
|
||||
echo "Encrypted user directories under $USERS_DIR:"
|
||||
find "$USERS_DIR" -maxdepth 1 -mindepth 1 -type d 2>/dev/null | while read -r dir; do
|
||||
if fscrypt status "$dir" 2>/dev/null | grep -q "Encrypted"; then
|
||||
echo " [encrypted] $dir"
|
||||
else
|
||||
echo " [plain] $dir"
|
||||
fi
|
||||
done
|
||||
;;
|
||||
--status)
|
||||
_require_root
|
||||
_require_fscrypt
|
||||
user_id="${2:-}"
|
||||
[[ -z "$user_id" ]] && { echo "Usage: $0 --status <user_id>" >&2; exit 1; }
|
||||
user_dir="$USERS_DIR/$user_id"
|
||||
if [[ ! -d "$user_dir" ]]; then
|
||||
echo "Directory $user_dir does not exist"
|
||||
exit 1
|
||||
fi
|
||||
fscrypt status "$user_dir"
|
||||
;;
|
||||
"")
|
||||
_usage
|
||||
;;
|
||||
-*)
|
||||
_usage
|
||||
;;
|
||||
*)
|
||||
# Encrypt a user's directory
|
||||
user_id="$1"
|
||||
_require_root
|
||||
_require_fscrypt
|
||||
|
||||
user_dir="$USERS_DIR/$user_id"
|
||||
if [[ ! -d "$user_dir" ]]; then
|
||||
echo "Creating user directory: $user_dir"
|
||||
mkdir -p "$user_dir"
|
||||
fi
|
||||
|
||||
if fscrypt status "$user_dir" 2>/dev/null | grep -q "Encrypted"; then
|
||||
echo "Directory $user_dir is already encrypted."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Warn if directory contains existing data — fscrypt encrypt migrates in place
|
||||
if [[ -n "$(ls -A "$user_dir")" ]]; then
|
||||
echo "WARNING: $user_dir is non-empty. fscrypt will encrypt files in place."
|
||||
echo "Ensure the container is stopped and you have a backup before continuing."
|
||||
read -rp "Continue? [y/N] " confirm
|
||||
[[ "$confirm" =~ ^[Yy]$ ]] || exit 1
|
||||
fi
|
||||
|
||||
_check_fscrypt_setup
|
||||
|
||||
echo "Encrypting $user_dir..."
|
||||
fscrypt encrypt "$user_dir" --source=pam_passphrase --quiet
|
||||
echo "Encryption set up for user $user_id. Directory: $user_dir"
|
||||
echo ""
|
||||
echo "IMPORTANT: unlock the directory before starting the container:"
|
||||
echo " fscrypt unlock $user_dir"
|
||||
;;
|
||||
esac
|
||||
|
|
@ -27,32 +27,13 @@ def client(test_db, tmp_path, monkeypatch):
|
|||
(tmp_path / "books").mkdir(exist_ok=True)
|
||||
|
||||
import app.main as _main_module
|
||||
from app.config import LOCAL_USER_ID
|
||||
from app.deps import UserCtx, get_db, get_user_ctx
|
||||
from app.main import app
|
||||
from app.services.bm25_index import BM25Index
|
||||
from app.startup import apply_migrations, check_and_rebuild_vec_schema
|
||||
from app.main import app, _bm25
|
||||
from app.deps import get_db
|
||||
|
||||
monkeypatch.setattr(_main_module, "_apply_migrations", lambda: None, raising=False)
|
||||
monkeypatch.setattr(
|
||||
"app.startup.apply_migrations", lambda *a, **kw: None
|
||||
)
|
||||
monkeypatch.setattr(
|
||||
"app.startup.check_and_rebuild_vec_schema", lambda *a, **kw: None
|
||||
)
|
||||
|
||||
test_bm25 = BM25Index()
|
||||
test_bm25.mark_dirty()
|
||||
|
||||
def override_user_ctx():
|
||||
return UserCtx(
|
||||
user_id=LOCAL_USER_ID,
|
||||
db_path=test_db,
|
||||
vec_db_path=str(tmp_path / "test_vecs.db"),
|
||||
data_dir=Path(tmp_path),
|
||||
watch_dir=Path(tmp_path) / "books",
|
||||
bm25=test_bm25,
|
||||
)
|
||||
# Suppress startup side effects — test_db fixture already applies the schema,
|
||||
# and vec schema validation is tested separately in test_startup.py
|
||||
monkeypatch.setattr(_main_module, "_apply_migrations", lambda: None)
|
||||
monkeypatch.setattr(_main_module, "_check_vec_schema", lambda *a, **kw: None)
|
||||
|
||||
def override_db():
|
||||
conn = sqlite3.connect(test_db)
|
||||
|
|
@ -63,7 +44,7 @@ def client(test_db, tmp_path, monkeypatch):
|
|||
finally:
|
||||
conn.close()
|
||||
|
||||
app.dependency_overrides[get_user_ctx] = override_user_ctx
|
||||
app.dependency_overrides[get_db] = override_db
|
||||
_bm25.mark_dirty() # clear any state from previous tests
|
||||
yield TestClient(app)
|
||||
app.dependency_overrides.clear()
|
||||
|
|
|
|||
|
|
@ -66,40 +66,3 @@ def test_reingest_updates_status_to_processing(client, test_db, tmp_path):
|
|||
# Document should be in processing state (or beyond if stub ingest ran instantly)
|
||||
status_resp = client.get(f"/api/library/{doc_id}/status")
|
||||
assert status_resp.json()["status"] in ("processing", "error", "ready")
|
||||
|
||||
|
||||
def _add_chunks(db_path: str, doc_id: str, count: int) -> None:
|
||||
conn = sqlite3.connect(db_path)
|
||||
for i in range(count):
|
||||
conn.execute(
|
||||
"INSERT INTO page_chunks(doc_id, page_number, text, source, word_count) VALUES (?,?,?,?,?)",
|
||||
[doc_id, i + 1, f"Page {i + 1} text content.", "text_layer", 4],
|
||||
)
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
|
||||
def test_sample_chunks_empty_returns_empty(client):
|
||||
resp = client.get("/api/library/sample-chunks")
|
||||
assert resp.status_code == 200
|
||||
assert resp.json() == []
|
||||
|
||||
|
||||
def test_sample_chunks_returns_fields(client, test_db):
|
||||
doc_id = _add_doc(test_db, "Monster Manual", "/books/mm.pdf")
|
||||
_add_chunks(test_db, doc_id, 5)
|
||||
resp = client.get("/api/library/sample-chunks?limit=3")
|
||||
assert resp.status_code == 200
|
||||
chunks = resp.json()
|
||||
assert len(chunks) == 3
|
||||
for c in chunks:
|
||||
assert {"chunk_id", "doc_id", "page_number", "text"} == set(c.keys())
|
||||
assert c["doc_id"] == doc_id
|
||||
|
||||
|
||||
def test_sample_chunks_limit_capped_at_200(client, test_db):
|
||||
doc_id = _add_doc(test_db, "Big Book", "/books/big.pdf")
|
||||
_add_chunks(test_db, doc_id, 10)
|
||||
resp = client.get("/api/library/sample-chunks?limit=9999")
|
||||
assert resp.status_code == 200
|
||||
assert len(resp.json()) <= 200
|
||||
|
|
|
|||
|
|
@ -20,7 +20,9 @@ def _add_chunks(db_path: str, doc_id: str, chunks: list[dict]) -> None:
|
|||
conn.close()
|
||||
|
||||
|
||||
def test_search_returns_results(client, test_db):
|
||||
def test_search_returns_results(client, test_db, monkeypatch):
|
||||
import app.api.search as _search_mod
|
||||
monkeypatch.setattr(_search_mod, "_get_db_path", lambda: test_db)
|
||||
# BM25Okapi IDF is 0 when df == N/2 (e.g. 2 docs, 1 match → log(1.0) = 0).
|
||||
# Add a 3rd unrelated chunk so relevant terms score above zero.
|
||||
_add_chunks(test_db, "book-a", [
|
||||
|
|
@ -44,7 +46,9 @@ def test_search_empty_index_returns_empty(client):
|
|||
assert resp.json() == []
|
||||
|
||||
|
||||
def test_search_filters_by_doc_ids(client, test_db):
|
||||
def test_search_filters_by_doc_ids(client, test_db, monkeypatch):
|
||||
import app.api.search as _search_mod
|
||||
monkeypatch.setattr(_search_mod, "_get_db_path", lambda: test_db)
|
||||
# Three chunks so BM25Okapi IDF is non-zero for terms appearing in one doc.
|
||||
_add_chunks(test_db, "book-a", [
|
||||
{"page_number": 1, "text": "Grapple rules for melee attacks."},
|
||||
|
|
|
|||
|
|
@ -9,8 +9,7 @@ from unittest.mock import MagicMock, patch
|
|||
|
||||
import pytest
|
||||
|
||||
from app.startup import check_and_rebuild_vec_schema as _check_vec_schema
|
||||
from app.startup import reembed_docs as _reembed_docs
|
||||
from app.main import _check_vec_schema, _reembed_docs
|
||||
|
||||
|
||||
def _make_vec_db(path: str, dims: int) -> None:
|
||||
|
|
|
|||
Loading…
Reference in a new issue