Compare commits
No commits in common. "0795a9286c415a1ee6fea8241e059066a03d392f" and "2e24808d913853398ab83b570246ec6716aab719" have entirely different histories.
0795a9286c
...
2e24808d91
45 changed files with 110 additions and 2788 deletions
|
|
@ -10,9 +10,7 @@ PAGEPIPER_BOOKS_DIR=/devl/pagepiper-cloud-data/books
|
||||||
PAGEPIPER_OLLAMA_URL=
|
PAGEPIPER_OLLAMA_URL=
|
||||||
|
|
||||||
# Embedding and chat model selection (only used when PAGEPIPER_OLLAMA_URL is set)
|
# Embedding and chat model selection (only used when PAGEPIPER_OLLAMA_URL is set)
|
||||||
# mxbai-embed-large (1024-dim) is recommended; nomic-embed-text uses 768-dim
|
PAGEPIPER_EMBED_MODEL=nomic-embed-text
|
||||||
PAGEPIPER_EMBED_MODEL=mxbai-embed-large
|
|
||||||
PAGEPIPER_EMBED_DIMS=1024
|
|
||||||
PAGEPIPER_CHAT_MODEL=mistral:7b
|
PAGEPIPER_CHAT_MODEL=mistral:7b
|
||||||
|
|
||||||
# Heimdall license server (optional — for per-user tier validation)
|
# Heimdall license server (optional — for per-user tier validation)
|
||||||
|
|
@ -22,17 +20,3 @@ HEIMDALL_ADMIN_TOKEN=
|
||||||
# cf-orch streaming proxy — coordinator product key
|
# cf-orch streaming proxy — coordinator product key
|
||||||
# Must match COORDINATOR_PRODUCT_KEYS["pagepiper"] in cf-orch.env on the coordinator
|
# Must match COORDINATOR_PRODUCT_KEYS["pagepiper"] in cf-orch.env on the coordinator
|
||||||
COORDINATOR_PAGEPIPER_KEY=
|
COORDINATOR_PAGEPIPER_KEY=
|
||||||
|
|
||||||
# cf-orch coordinator URL — routes chat/embed calls through managed GPU allocation
|
|
||||||
# CF_LICENSE_KEY is the auth token sent to the coordinator (same value as COORDINATOR_PAGEPIPER_KEY)
|
|
||||||
# Leave CF_ORCH_URL blank to skip allocation and hit PAGEPIPER_OLLAMA_URL directly
|
|
||||||
CF_ORCH_URL=
|
|
||||||
CF_LICENSE_KEY=
|
|
||||||
CF_APP_NAME=pagepiper
|
|
||||||
|
|
||||||
# Forgejo API token — enables in-app feedback button (files issues to Circuit-Forge/pagepiper)
|
|
||||||
FORGEJO_API_TOKEN=
|
|
||||||
|
|
||||||
# Enable thumbs up/down on chat answers (stores retrieval quality signals locally)
|
|
||||||
# Off by default — opt in when you want to collect correction data
|
|
||||||
# PAGEPIPER_CHAT_FEEDBACK=true
|
|
||||||
|
|
|
||||||
|
|
@ -10,11 +10,3 @@ PAGEPIPER_DATA_DIR=data
|
||||||
# PAGEPIPER_OLLAMA_URL=http://localhost:11434
|
# PAGEPIPER_OLLAMA_URL=http://localhost:11434
|
||||||
# PAGEPIPER_CHAT_MODEL=mistral:7b
|
# PAGEPIPER_CHAT_MODEL=mistral:7b
|
||||||
# PAGEPIPER_EMBED_MODEL=nomic-embed-text
|
# PAGEPIPER_EMBED_MODEL=nomic-embed-text
|
||||||
|
|
||||||
# Forgejo API token — enables the in-app feedback button (files Forgejo issues)
|
|
||||||
# Create a token at https://git.opensourcesolarpunk.com/user/settings/applications
|
|
||||||
# FORGEJO_API_TOKEN=
|
|
||||||
|
|
||||||
# Enable thumbs up/down on chat answers (stores retrieval quality signals locally)
|
|
||||||
# Off by default — opt in when you want to collect correction data
|
|
||||||
# PAGEPIPER_CHAT_FEEDBACK=true
|
|
||||||
|
|
|
||||||
|
|
@ -26,7 +26,6 @@ RUN conda run -n pagepiper pip install --no-cache-dir -e "/app/circuitforge-core
|
||||||
WORKDIR /app/pagepiper
|
WORKDIR /app/pagepiper
|
||||||
RUN conda run -n pagepiper pip install --no-cache-dir -e .
|
RUN conda run -n pagepiper pip install --no-cache-dir -e .
|
||||||
|
|
||||||
ENV API_PORT=8522
|
EXPOSE 8522
|
||||||
EXPOSE $API_PORT
|
CMD ["conda", "run", "--no-capture-output", "-n", "pagepiper", \
|
||||||
CMD conda run --no-capture-output -n pagepiper \
|
"uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8522"]
|
||||||
uvicorn app.main:app --host 0.0.0.0 --port ${API_PORT}
|
|
||||||
|
|
|
||||||
197
README.md
197
README.md
|
|
@ -1,197 +0,0 @@
|
||||||
# Pagepiper
|
|
||||||
|
|
||||||
**v0.1.0** | Self-hosted PDF and EPUB search for your personal library
|
|
||||||
|
|
||||||
Pagepiper lets you drop PDFs and EPUBs into a library, index them, and search across the full text. With [Ollama](https://ollama.com) configured, you also get hybrid vector search and an LLM (large language model) chat interface that cites specific page numbers when it answers.
|
|
||||||
|
|
||||||
Built for TTRPG (tabletop roleplaying game) players tired of ctrl-F'ing through Pathfinder core rulebooks. Works equally well for fan fiction EPUB collections, AO3 exports, and any personal document library.
|
|
||||||
|
|
||||||
Try it: [pagepiper.circuitforge.tech](https://pagepiper.circuitforge.tech)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Features
|
|
||||||
|
|
||||||
| Feature | Free tier | Paid (BYOK) |
|
|
||||||
|---------|-----------|-------------|
|
|
||||||
| PDF and EPUB upload via browser drag-and-drop | Yes | Yes |
|
|
||||||
| Directory scan for existing files | Yes | Yes |
|
|
||||||
| BM25 full-text search (no LLM required) | Yes | Yes |
|
|
||||||
| Unlimited local ingestion | Yes | Yes |
|
|
||||||
| Hybrid BM25 + k-NN vector search | No | Yes (local Ollama) |
|
|
||||||
| LLM chat with page-level citations | No | Yes (local Ollama) |
|
|
||||||
| Thumbs up / down feedback on answers | No | Yes |
|
|
||||||
|
|
||||||
BYOK (bring your own key) means you supply your own Ollama instance. No cloud API keys, no usage billing.
|
|
||||||
|
|
||||||
**BM25** (Best Match 25) is a keyword ranking algorithm. It works without any LLM and runs entirely inside the Docker container. **k-NN** (k-nearest neighbor) vector search uses embeddings to find passages that are semantically similar to your question, even when the exact words don't match.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Tech Stack
|
|
||||||
|
|
||||||
- **Backend:** FastAPI + SQLite (BM25 via custom BM25Index, vectors via sqlite-vec)
|
|
||||||
- **Frontend:** Vue 3 SPA served by nginx
|
|
||||||
- **Embedding model:** `nomic-embed-text` via Ollama (1024-dim, optional)
|
|
||||||
- **Chat LLM:** `mistral:7b` via Ollama (optional, any Ollama model works)
|
|
||||||
- **Deployment:** Docker Compose
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Quick Start (Self-Hosting)
|
|
||||||
|
|
||||||
### Prerequisites
|
|
||||||
|
|
||||||
- [Docker](https://docs.docker.com/get-docker/) and Docker Compose
|
|
||||||
- PDFs or EPUBs you want to search
|
|
||||||
- Optional: [Ollama](https://ollama.com) for semantic search and RAG (retrieval-augmented generation) chat
|
|
||||||
|
|
||||||
### 1. Clone the repo
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git clone https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
|
|
||||||
cd pagepiper
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Configure
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cp .env.example .env
|
|
||||||
```
|
|
||||||
|
|
||||||
Open `.env` and set your paths:
|
|
||||||
|
|
||||||
```dotenv
|
|
||||||
# Directory to scan for PDFs/EPUBs (used by the "Scan" button in the UI)
|
|
||||||
PAGEPIPER_BOOKS_DIR=/path/to/your/pdfs
|
|
||||||
|
|
||||||
# Where Pagepiper stores its SQLite index and uploaded files
|
|
||||||
PAGEPIPER_DATA_DIR=data
|
|
||||||
```
|
|
||||||
|
|
||||||
To unlock hybrid search and LLM chat, uncomment and set the Ollama block:
|
|
||||||
|
|
||||||
```dotenv
|
|
||||||
PAGEPIPER_OLLAMA_URL=http://localhost:11434
|
|
||||||
PAGEPIPER_CHAT_MODEL=mistral:7b
|
|
||||||
PAGEPIPER_EMBED_MODEL=nomic-embed-text
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Start
|
|
||||||
|
|
||||||
```bash
|
|
||||||
./manage.sh start
|
|
||||||
```
|
|
||||||
|
|
||||||
Open [http://localhost:8521](http://localhost:8521).
|
|
||||||
|
|
||||||
### 4. Add documents
|
|
||||||
|
|
||||||
Two ways to add files:
|
|
||||||
|
|
||||||
**Upload via browser** (easiest for small collections): Click **Upload** in the Library view and select a PDF or EPUB. The file saves to `data/uploads/` and begins indexing automatically.
|
|
||||||
|
|
||||||
**Scan a directory** (best for large collections): Set `PAGEPIPER_BOOKS_DIR` in your `.env` to a folder of PDFs/EPUBs, then click **Scan** in the Library view. Pagepiper finds all files recursively and queues them for indexing.
|
|
||||||
|
|
||||||
### 5. Search and chat
|
|
||||||
|
|
||||||
Switch to the **Chat** tab and ask questions. On the free tier, BM25 keyword search returns matching passages. With Ollama configured, you get semantic search and an LLM-generated answer with page-number citations.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Ollama Setup (optional)
|
|
||||||
|
|
||||||
Install Ollama from [ollama.com](https://ollama.com), then pull the models:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ollama pull mistral:7b
|
|
||||||
ollama pull nomic-embed-text
|
|
||||||
```
|
|
||||||
|
|
||||||
On a headless Linux server, make Ollama listen on all interfaces so the Docker container can reach it:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
OLLAMA_HOST=0.0.0.0 ollama serve
|
|
||||||
```
|
|
||||||
|
|
||||||
On Docker Desktop (Linux or Mac), `host.docker.internal` resolves automatically. No extra network config needed.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Environment Variables
|
|
||||||
|
|
||||||
| Variable | Default | Description |
|
|
||||||
|----------|---------|-------------|
|
|
||||||
| `PAGEPIPER_BOOKS_DIR` | `./books` | Host directory to scan for PDFs and EPUBs |
|
|
||||||
| `PAGEPIPER_DATA_DIR` | `./data` | SQLite index and uploaded files live here |
|
|
||||||
| `PAGEPIPER_OLLAMA_URL` | *(unset)* | Ollama base URL; leave blank for BM25-only mode |
|
|
||||||
| `PAGEPIPER_EMBED_MODEL` | `nomic-embed-text` | Ollama embedding model (1024-dim default) |
|
|
||||||
| `PAGEPIPER_EMBED_DIMS` | `1024` | Must match the embedding model's output dimensions |
|
|
||||||
| `PAGEPIPER_CHAT_MODEL` | `mistral:7b` | Ollama chat model; any Ollama model name works |
|
|
||||||
| `PAGEPIPER_CHAT_FEEDBACK` | *(unset)* | Set to `true` to enable thumbs up/down on chat answers |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Management
|
|
||||||
|
|
||||||
```bash
|
|
||||||
./manage.sh start # Build and start (dev)
|
|
||||||
./manage.sh stop # Stop
|
|
||||||
./manage.sh restart # Restart
|
|
||||||
./manage.sh status # Show container status
|
|
||||||
./manage.sh logs [svc] # Tail logs (default: all services; pass 'api' or 'web' to filter)
|
|
||||||
./manage.sh open # Open the UI in your browser
|
|
||||||
./manage.sh build # Rebuild images without cache
|
|
||||||
|
|
||||||
./manage.sh cloud:start # Start the cloud managed instance (port 8533)
|
|
||||||
./manage.sh cloud:stop
|
|
||||||
./manage.sh cloud:restart
|
|
||||||
./manage.sh cloud:status
|
|
||||||
./manage.sh cloud:logs [svc]
|
|
||||||
./manage.sh cloud:build
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Cloud Managed Instance
|
|
||||||
|
|
||||||
The cloud deployment runs at [pagepiper.circuitforge.tech](https://pagepiper.circuitforge.tech) and at `menagerie.circuitforge.tech/pagepiper`. It uses `compose.cloud.yml` with LLM inference routed through the cf-orch coordinator.
|
|
||||||
|
|
||||||
To run your own cloud-style deployment:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cp .env.cloud.example .env
|
|
||||||
# Edit .env: set PAGEPIPER_OLLAMA_URL and data paths
|
|
||||||
./manage.sh cloud:start
|
|
||||||
```
|
|
||||||
|
|
||||||
Cloud instance listens on port 8533. The API is internal-only; nginx proxies `/api/` to the backend.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Data and Backups
|
|
||||||
|
|
||||||
The `data/` directory contains the SQLite index database and all uploaded files. Back it up to preserve your index. Pagepiper indexes documents at ingest time. If you modify or replace a source file, use the re-index button on the document card to rebuild its entry.
|
|
||||||
|
|
||||||
Large PDFs (hundreds of pages) can take a few minutes to index. The status badge on the document card updates as indexing progresses.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Licensing
|
|
||||||
|
|
||||||
Pagepiper uses a split license:
|
|
||||||
|
|
||||||
- **MIT:** BM25 full-text search, document library management, ingest pipeline, EPUB support
|
|
||||||
- **BSL 1.1:** Hybrid vector search (embedding + k-NN), RAG chat, LLM integration
|
|
||||||
|
|
||||||
BSL 1.1 is free for personal non-commercial self-hosting. SaaS re-hosting or commercial redistribution requires a license from CircuitForge. BSL 1.1 converts to MIT after four years.
|
|
||||||
|
|
||||||
License keys: [circuitforge.tech](https://circuitforge.tech)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Contributing
|
|
||||||
|
|
||||||
Issues and PRs welcome at [git.opensourcesolarpunk.com/Circuit-Forge/pagepiper](https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper).
|
|
||||||
|
|
||||||
The ingest pipeline and BM25 index are MIT-licensed. If you build a better PDF parser or add support for additional formats (CBZ, MOBI, etc.), the community benefits directly.
|
|
||||||
|
|
@ -29,7 +29,7 @@ class ChatRequest(BaseModel):
|
||||||
message: str
|
message: str
|
||||||
history: list[ChatTurn] = []
|
history: list[ChatTurn] = []
|
||||||
doc_ids: list[str] | None = None
|
doc_ids: list[str] | None = None
|
||||||
top_k: int = 10
|
top_k: int = 5
|
||||||
|
|
||||||
|
|
||||||
class ChatResponse(BaseModel):
|
class ChatResponse(BaseModel):
|
||||||
|
|
@ -37,13 +37,6 @@ class ChatResponse(BaseModel):
|
||||||
citations: list[dict]
|
citations: list[dict]
|
||||||
|
|
||||||
|
|
||||||
class ChatFeedbackRequest(BaseModel):
|
|
||||||
rating: int # 1 = thumbs up, -1 = thumbs down
|
|
||||||
question: str = ""
|
|
||||||
answer: str = ""
|
|
||||||
doc_ids: list[str] = []
|
|
||||||
|
|
||||||
|
|
||||||
def _get_llm_router():
|
def _get_llm_router():
|
||||||
"""Return LLMRouter if Ollama configured, else None."""
|
"""Return LLMRouter if Ollama configured, else None."""
|
||||||
from app.config import get_llm_config
|
from app.config import get_llm_config
|
||||||
|
|
@ -132,31 +125,3 @@ def chat(req: ChatRequest) -> ChatResponse:
|
||||||
for c in result.citations
|
for c in result.citations
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@router.get("/feedback/status")
|
|
||||||
def chat_feedback_status() -> dict:
|
|
||||||
enabled = os.environ.get("PAGEPIPER_CHAT_FEEDBACK", "").lower() in ("1", "true", "yes")
|
|
||||||
return {"enabled": enabled}
|
|
||||||
|
|
||||||
|
|
||||||
@router.post("/feedback")
|
|
||||||
def submit_chat_feedback(req: ChatFeedbackRequest) -> dict:
|
|
||||||
import json
|
|
||||||
import sqlite3
|
|
||||||
|
|
||||||
if req.rating not in (1, -1):
|
|
||||||
from fastapi import HTTPException
|
|
||||||
raise HTTPException(status_code=422, detail="rating must be 1 or -1")
|
|
||||||
|
|
||||||
db_path = _get_db_path()
|
|
||||||
con = sqlite3.connect(db_path)
|
|
||||||
try:
|
|
||||||
con.execute(
|
|
||||||
"INSERT INTO chat_feedback (rating, question, answer, doc_ids) VALUES (?, ?, ?, ?)",
|
|
||||||
(req.rating, req.question[:2000], req.answer[:4000], json.dumps(req.doc_ids)),
|
|
||||||
)
|
|
||||||
con.commit()
|
|
||||||
finally:
|
|
||||||
con.close()
|
|
||||||
return {"ok": True}
|
|
||||||
|
|
|
||||||
|
|
@ -1,7 +0,0 @@
|
||||||
"""Feedback router — provided by circuitforge-core."""
|
|
||||||
from circuitforge_core.api import make_feedback_router
|
|
||||||
|
|
||||||
router = make_feedback_router(
|
|
||||||
repo="Circuit-Forge/pagepiper",
|
|
||||||
product="pagepiper",
|
|
||||||
)
|
|
||||||
|
|
@ -1,88 +0,0 @@
|
||||||
"""Screenshot attachment endpoint for in-app feedback.
|
|
||||||
|
|
||||||
After the cf-core feedback router creates a Forgejo issue, the frontend
|
|
||||||
can call POST /feedback/attach to upload a screenshot as a comment on that issue.
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import base64
|
|
||||||
import os
|
|
||||||
|
|
||||||
import requests
|
|
||||||
from fastapi import APIRouter, HTTPException
|
|
||||||
from pydantic import BaseModel, Field
|
|
||||||
|
|
||||||
router = APIRouter()
|
|
||||||
|
|
||||||
_FORGEJO_BASE = os.environ.get(
|
|
||||||
"FORGEJO_API_URL", "https://git.opensourcesolarpunk.com/api/v1"
|
|
||||||
)
|
|
||||||
_REPO = "Circuit-Forge/pagepiper"
|
|
||||||
_MAX_BYTES = 5 * 1024 * 1024
|
|
||||||
|
|
||||||
|
|
||||||
class AttachRequest(BaseModel):
|
|
||||||
issue_number: int
|
|
||||||
filename: str = Field(default="screenshot.png", max_length=80)
|
|
||||||
image_b64: str # data URI or raw base64
|
|
||||||
|
|
||||||
|
|
||||||
class AttachResponse(BaseModel):
|
|
||||||
comment_url: str
|
|
||||||
|
|
||||||
|
|
||||||
def _forgejo_headers() -> dict[str, str]:
|
|
||||||
token = os.environ.get("FORGEJO_API_TOKEN", "")
|
|
||||||
return {"Authorization": f"token {token}"}
|
|
||||||
|
|
||||||
|
|
||||||
def _decode_image(image_b64: str) -> tuple[bytes, str]:
|
|
||||||
if image_b64.startswith("data:"):
|
|
||||||
header, _, data = image_b64.partition(",")
|
|
||||||
mime = header.split(";")[0].split(":")[1] if ":" in header else "image/png"
|
|
||||||
else:
|
|
||||||
data = image_b64
|
|
||||||
mime = "image/png"
|
|
||||||
return base64.b64decode(data), mime
|
|
||||||
|
|
||||||
|
|
||||||
@router.post("/attach", response_model=AttachResponse)
|
|
||||||
def attach_screenshot(payload: AttachRequest) -> AttachResponse:
|
|
||||||
token = os.environ.get("FORGEJO_API_TOKEN", "")
|
|
||||||
if not token:
|
|
||||||
raise HTTPException(status_code=503, detail="Feedback not configured.")
|
|
||||||
|
|
||||||
raw_bytes, mime = _decode_image(payload.image_b64)
|
|
||||||
if len(raw_bytes) > _MAX_BYTES:
|
|
||||||
raise HTTPException(
|
|
||||||
status_code=413,
|
|
||||||
detail=f"Screenshot exceeds 5 MB limit ({len(raw_bytes) // 1024} KB received).",
|
|
||||||
)
|
|
||||||
|
|
||||||
asset_resp = requests.post(
|
|
||||||
f"{_FORGEJO_BASE}/repos/{_REPO}/issues/{payload.issue_number}/assets",
|
|
||||||
headers=_forgejo_headers(),
|
|
||||||
files={"attachment": (payload.filename, raw_bytes, mime)},
|
|
||||||
timeout=20,
|
|
||||||
)
|
|
||||||
if not asset_resp.ok:
|
|
||||||
raise HTTPException(
|
|
||||||
status_code=502,
|
|
||||||
detail=f"Forgejo asset upload failed: {asset_resp.text[:200]}",
|
|
||||||
)
|
|
||||||
|
|
||||||
asset_url = asset_resp.json().get("browser_download_url", "")
|
|
||||||
comment_body = f"**Screenshot attached by reporter:**\n\n"
|
|
||||||
comment_resp = requests.post(
|
|
||||||
f"{_FORGEJO_BASE}/repos/{_REPO}/issues/{payload.issue_number}/comments",
|
|
||||||
headers={**_forgejo_headers(), "Content-Type": "application/json"},
|
|
||||||
json={"body": comment_body},
|
|
||||||
timeout=15,
|
|
||||||
)
|
|
||||||
if not comment_resp.ok:
|
|
||||||
raise HTTPException(
|
|
||||||
status_code=502,
|
|
||||||
detail=f"Forgejo comment failed: {comment_resp.text[:200]}",
|
|
||||||
)
|
|
||||||
|
|
||||||
return AttachResponse(comment_url=comment_resp.json().get("html_url", ""))
|
|
||||||
|
|
@ -12,13 +12,11 @@ import uuid
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Callable
|
from typing import Callable
|
||||||
|
|
||||||
from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException, UploadFile
|
from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException
|
||||||
|
|
||||||
from app.config import WATCH_DIR, DB_PATH, VEC_DB_PATH, DATA_DIR
|
from app.config import WATCH_DIR, DB_PATH, VEC_DB_PATH
|
||||||
from app.deps import get_db
|
from app.deps import get_db
|
||||||
|
|
||||||
_MAX_UPLOAD_BYTES = 200 * 1024 * 1024 # 200 MB
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
router = APIRouter(prefix="/api/library", tags=["library"])
|
router = APIRouter(prefix="/api/library", tags=["library"])
|
||||||
|
|
||||||
|
|
@ -26,31 +24,15 @@ router = APIRouter(prefix="/api/library", tags=["library"])
|
||||||
_mark_bm25_dirty: Callable[[], None] | None = None
|
_mark_bm25_dirty: Callable[[], None] | None = None
|
||||||
|
|
||||||
|
|
||||||
_INGEST_TASKS = {
|
|
||||||
".pdf": "pagepiper/ingest_pdf",
|
|
||||||
".epub": "pagepiper/ingest_epub",
|
|
||||||
}
|
|
||||||
|
|
||||||
_INGEST_RUNNERS = {
|
|
||||||
".pdf": "scripts.ingest_pdf",
|
|
||||||
".epub": "scripts.ingest_epub",
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def _dispatch_ingest(
|
def _dispatch_ingest(
|
||||||
doc_id: str,
|
doc_id: str,
|
||||||
file_path: str,
|
file_path: str,
|
||||||
background_tasks: BackgroundTasks,
|
background_tasks: BackgroundTasks,
|
||||||
) -> str:
|
) -> str:
|
||||||
"""Dispatch an ingest task. Tries cf-orch; falls back to BackgroundTasks."""
|
"""Dispatch an ingest task. Tries cf-orch; falls back to BackgroundTasks."""
|
||||||
import importlib
|
|
||||||
import os as _os
|
import os as _os
|
||||||
from pathlib import Path as _Path
|
from pathlib import Path as _Path
|
||||||
|
|
||||||
suffix = _Path(file_path).suffix.lower()
|
|
||||||
task_name = _INGEST_TASKS.get(suffix, "pagepiper/ingest_pdf")
|
|
||||||
runner_module = _INGEST_RUNNERS.get(suffix, "scripts.ingest_pdf")
|
|
||||||
|
|
||||||
# Read lazily so test fixtures (monkeypatch.setenv) take effect
|
# Read lazily so test fixtures (monkeypatch.setenv) take effect
|
||||||
_data_dir = _Path(_os.environ.get("PAGEPIPER_DATA_DIR", "data"))
|
_data_dir = _Path(_os.environ.get("PAGEPIPER_DATA_DIR", "data"))
|
||||||
task_id = str(uuid.uuid4())
|
task_id = str(uuid.uuid4())
|
||||||
|
|
@ -63,11 +45,11 @@ def _dispatch_ingest(
|
||||||
|
|
||||||
try:
|
try:
|
||||||
from circuitforge_core.tasks import dispatch_task # type: ignore[import]
|
from circuitforge_core.tasks import dispatch_task # type: ignore[import]
|
||||||
task_id = dispatch_task(caller=task_name, args=args)
|
task_id = dispatch_task(caller="pagepiper/ingest_pdf", args=args)
|
||||||
logger.info("Dispatched cf-orch ingest task %s for doc %s", task_id, doc_id)
|
logger.info("Dispatched cf-orch ingest task %s for doc %s", task_id, doc_id)
|
||||||
except Exception:
|
except Exception:
|
||||||
mod = importlib.import_module(runner_module)
|
from scripts.ingest_pdf import run as run_ingest
|
||||||
background_tasks.add_task(_run_ingest_background, mod.run, args, task_id)
|
background_tasks.add_task(_run_ingest_background, run_ingest, args, task_id)
|
||||||
logger.info(
|
logger.info(
|
||||||
"cf-orch unavailable — running ingest in background thread (task %s)", task_id
|
"cf-orch unavailable — running ingest in background thread (task %s)", task_id
|
||||||
)
|
)
|
||||||
|
|
@ -107,7 +89,7 @@ def scan_library(
|
||||||
if not watch.exists():
|
if not watch.exists():
|
||||||
raise HTTPException(status_code=404, detail=f"Watch directory not found: {watch}")
|
raise HTTPException(status_code=404, detail=f"Watch directory not found: {watch}")
|
||||||
|
|
||||||
pdfs = list(watch.glob("**/*.pdf")) + list(watch.glob("**/*.epub"))
|
pdfs = list(watch.glob("**/*.pdf"))
|
||||||
queued = []
|
queued = []
|
||||||
|
|
||||||
for pdf_path in pdfs:
|
for pdf_path in pdfs:
|
||||||
|
|
@ -174,8 +156,7 @@ def delete_document(
|
||||||
# Remove embeddings from vector store
|
# Remove embeddings from vector store
|
||||||
try:
|
try:
|
||||||
from circuitforge_core.vector.sqlite_vec import LocalSQLiteVecStore # type: ignore[import]
|
from circuitforge_core.vector.sqlite_vec import LocalSQLiteVecStore # type: ignore[import]
|
||||||
from app.config import VEC_DIMENSIONS
|
store = LocalSQLiteVecStore(db_path=VEC_DB_PATH, table="page_vecs", dimensions=768)
|
||||||
store = LocalSQLiteVecStore(db_path=VEC_DB_PATH, table="page_vecs", dimensions=VEC_DIMENSIONS)
|
|
||||||
store.delete_where({"doc_id": doc_id})
|
store.delete_where({"doc_id": doc_id})
|
||||||
except Exception as exc:
|
except Exception as exc:
|
||||||
logger.warning("Could not remove vectors for doc %s: %s", doc_id, exc)
|
logger.warning("Could not remove vectors for doc %s: %s", doc_id, exc)
|
||||||
|
|
@ -184,20 +165,6 @@ def delete_document(
|
||||||
_mark_bm25_dirty()
|
_mark_bm25_dirty()
|
||||||
|
|
||||||
|
|
||||||
def _get_vec_count(doc_id: str) -> int:
|
|
||||||
"""Return how many vectors have been stored for this doc. Returns 0 on any error."""
|
|
||||||
try:
|
|
||||||
conn = sqlite3.connect(VEC_DB_PATH)
|
|
||||||
count = conn.execute(
|
|
||||||
"SELECT COUNT(*) FROM page_vecs_meta WHERE json_extract(metadata, '$.doc_id') = ?",
|
|
||||||
[doc_id],
|
|
||||||
).fetchone()[0]
|
|
||||||
conn.close()
|
|
||||||
return int(count)
|
|
||||||
except Exception:
|
|
||||||
return 0
|
|
||||||
|
|
||||||
|
|
||||||
@router.get("/{doc_id}/status")
|
@router.get("/{doc_id}/status")
|
||||||
def document_status(
|
def document_status(
|
||||||
doc_id: str,
|
doc_id: str,
|
||||||
|
|
@ -209,54 +176,4 @@ def document_status(
|
||||||
).fetchone()
|
).fetchone()
|
||||||
if not row:
|
if not row:
|
||||||
raise HTTPException(status_code=404, detail="Document not found")
|
raise HTTPException(status_code=404, detail="Document not found")
|
||||||
result = dict(row)
|
return dict(row)
|
||||||
result["vec_count"] = _get_vec_count(doc_id)
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
@router.post("/upload", status_code=202)
|
|
||||||
def upload_document(
|
|
||||||
file: UploadFile,
|
|
||||||
background_tasks: BackgroundTasks,
|
|
||||||
db: sqlite3.Connection = Depends(get_db),
|
|
||||||
) -> dict:
|
|
||||||
"""Accept a PDF/EPUB upload, save to data/uploads/, and queue for indexing."""
|
|
||||||
name = Path(file.filename or "").name
|
|
||||||
suffix = Path(name).suffix.lower()
|
|
||||||
if suffix not in _INGEST_TASKS:
|
|
||||||
raise HTTPException(status_code=400, detail="Supported formats: PDF, EPUB")
|
|
||||||
|
|
||||||
content = file.file.read()
|
|
||||||
if len(content) > _MAX_UPLOAD_BYTES:
|
|
||||||
raise HTTPException(status_code=413, detail="File exceeds 200 MB limit")
|
|
||||||
|
|
||||||
upload_dir = DATA_DIR / "uploads"
|
|
||||||
upload_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
dest = upload_dir / name
|
|
||||||
dest.write_bytes(content)
|
|
||||||
|
|
||||||
path_str = str(dest.resolve())
|
|
||||||
existing = db.execute(
|
|
||||||
"SELECT id, status FROM documents WHERE file_path = ?", [path_str]
|
|
||||||
).fetchone()
|
|
||||||
|
|
||||||
if existing and existing["status"] == "ready":
|
|
||||||
return {"doc_id": existing["id"], "task_id": None, "filename": name, "status": "already_indexed"}
|
|
||||||
|
|
||||||
if existing:
|
|
||||||
doc_id = existing["id"]
|
|
||||||
else:
|
|
||||||
title = dest.stem.replace("_", " ").replace("-", " ").title()
|
|
||||||
doc_id = db.execute(
|
|
||||||
"INSERT INTO documents(title, file_path, status) VALUES (?,?,?) RETURNING id",
|
|
||||||
[title, path_str, "pending"],
|
|
||||||
).fetchone()[0]
|
|
||||||
db.commit()
|
|
||||||
|
|
||||||
task_id = _dispatch_ingest(doc_id, path_str, background_tasks)
|
|
||||||
db.execute(
|
|
||||||
"UPDATE documents SET status='processing', task_id=? WHERE id=?",
|
|
||||||
[task_id, doc_id],
|
|
||||||
)
|
|
||||||
db.commit()
|
|
||||||
return {"doc_id": doc_id, "task_id": task_id, "filename": name, "status": "queued"}
|
|
||||||
|
|
|
||||||
|
|
@ -10,7 +10,6 @@ DATA_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
DB_PATH = str(DATA_DIR / "pagepiper.db")
|
DB_PATH = str(DATA_DIR / "pagepiper.db")
|
||||||
VEC_DB_PATH = str(DATA_DIR / "pagepiper_vecs.db")
|
VEC_DB_PATH = str(DATA_DIR / "pagepiper_vecs.db")
|
||||||
WATCH_DIR = Path(os.environ.get("PAGEPIPER_WATCH_DIR", "books"))
|
WATCH_DIR = Path(os.environ.get("PAGEPIPER_WATCH_DIR", "books"))
|
||||||
VEC_DIMENSIONS = int(os.environ.get("PAGEPIPER_EMBED_DIMS", "1024"))
|
|
||||||
|
|
||||||
|
|
||||||
def get_llm_config() -> dict | None:
|
def get_llm_config() -> dict | None:
|
||||||
|
|
@ -20,27 +19,17 @@ def get_llm_config() -> dict | None:
|
||||||
return None
|
return None
|
||||||
_clean = url.rstrip("/")
|
_clean = url.rstrip("/")
|
||||||
_base_url = _clean if _clean.endswith("/v1") else _clean + "/v1"
|
_base_url = _clean if _clean.endswith("/v1") else _clean + "/v1"
|
||||||
chat_model = os.environ.get("PAGEPIPER_CHAT_MODEL", "mistral:7b")
|
|
||||||
|
|
||||||
backend: dict = {
|
|
||||||
"type": "openai_compat",
|
|
||||||
"base_url": _base_url,
|
|
||||||
"model": chat_model,
|
|
||||||
"embedding_model": os.environ.get("PAGEPIPER_EMBED_MODEL", "nomic-embed-text"),
|
|
||||||
"supports_images": False,
|
|
||||||
}
|
|
||||||
|
|
||||||
# Wire cf-orch allocation when coordinator is configured so the model stays warm
|
|
||||||
# and cold-start latency doesn't cause chat timeouts.
|
|
||||||
orch_url = os.environ.get("CF_ORCH_URL", "").strip()
|
|
||||||
if orch_url:
|
|
||||||
backend["cf_orch"] = {
|
|
||||||
"service": "ollama",
|
|
||||||
"model_candidates": [chat_model],
|
|
||||||
"ttl_s": 3600,
|
|
||||||
}
|
|
||||||
|
|
||||||
return {
|
return {
|
||||||
"fallback_order": ["ollama"],
|
"fallback_order": ["ollama"],
|
||||||
"backends": {"ollama": backend},
|
"backends": {
|
||||||
|
"ollama": {
|
||||||
|
"type": "openai_compat",
|
||||||
|
"base_url": _base_url,
|
||||||
|
"model": os.environ.get("PAGEPIPER_CHAT_MODEL", "mistral:7b"),
|
||||||
|
"embedding_model": os.environ.get(
|
||||||
|
"PAGEPIPER_EMBED_MODEL", "nomic-embed-text"
|
||||||
|
),
|
||||||
|
"supports_images": False,
|
||||||
|
}
|
||||||
|
},
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -9,7 +9,7 @@ from app.config import DB_PATH
|
||||||
|
|
||||||
|
|
||||||
def get_db() -> Generator[sqlite3.Connection, None, None]:
|
def get_db() -> Generator[sqlite3.Connection, None, None]:
|
||||||
conn = sqlite3.connect(DB_PATH, check_same_thread=False)
|
conn = sqlite3.connect(DB_PATH)
|
||||||
conn.execute("PRAGMA foreign_keys = ON")
|
conn.execute("PRAGMA foreign_keys = ON")
|
||||||
conn.execute("PRAGMA journal_mode = WAL")
|
conn.execute("PRAGMA journal_mode = WAL")
|
||||||
conn.row_factory = sqlite3.Row
|
conn.row_factory = sqlite3.Row
|
||||||
|
|
|
||||||
92
app/main.py
92
app/main.py
|
|
@ -3,15 +3,11 @@
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import logging
|
import logging
|
||||||
import os
|
|
||||||
import re
|
|
||||||
import sqlite3
|
|
||||||
import threading
|
|
||||||
from contextlib import asynccontextmanager
|
from contextlib import asynccontextmanager
|
||||||
|
|
||||||
from fastapi import FastAPI
|
from fastapi import FastAPI
|
||||||
|
|
||||||
from app.config import DB_PATH, VEC_DB_PATH, VEC_DIMENSIONS
|
from app.config import DB_PATH
|
||||||
from app.services.bm25_index import BM25Index
|
from app.services.bm25_index import BM25Index
|
||||||
|
|
||||||
logger = logging.getLogger("pagepiper")
|
logger = logging.getLogger("pagepiper")
|
||||||
|
|
@ -25,91 +21,9 @@ def _apply_migrations() -> None:
|
||||||
migrate(DB_PATH)
|
migrate(DB_PATH)
|
||||||
|
|
||||||
|
|
||||||
def _reembed_docs(docs: list[tuple[str, str]], db_path: str, vec_db_path: str) -> None:
|
|
||||||
"""Re-run full ingest for a list of (doc_id, file_path) sequentially."""
|
|
||||||
for doc_id, file_path in docs:
|
|
||||||
suffix = os.path.splitext(file_path)[1].lower()
|
|
||||||
try:
|
|
||||||
if suffix == ".epub":
|
|
||||||
from scripts.ingest_epub import run
|
|
||||||
else:
|
|
||||||
from scripts.ingest_pdf import run
|
|
||||||
logger.info("Auto re-embed: starting %s", os.path.basename(file_path))
|
|
||||||
run(doc_id=doc_id, file_path=file_path, db_path=db_path, vec_db_path=vec_db_path)
|
|
||||||
except Exception as exc:
|
|
||||||
logger.error("Auto re-embed failed for doc %s: %s", doc_id[:8], exc)
|
|
||||||
|
|
||||||
|
|
||||||
def _check_vec_schema(vec_db_path: str, expected_dims: int, db_path: str) -> None:
|
|
||||||
"""Drop the vec DB if its stored dimension doesn't match config, then queue re-embed.
|
|
||||||
|
|
||||||
sqlite-vec bakes the embedding dimension into the virtual table DDL, so changing
|
|
||||||
models requires dropping and recreating the whole file. Catches the mismatch at
|
|
||||||
startup rather than surfacing it as an obscure OperationalError mid-request.
|
|
||||||
"""
|
|
||||||
if not os.path.exists(vec_db_path):
|
|
||||||
return
|
|
||||||
try:
|
|
||||||
conn = sqlite3.connect(vec_db_path)
|
|
||||||
row = conn.execute(
|
|
||||||
"SELECT sql FROM sqlite_master WHERE name='page_vecs_vecs'"
|
|
||||||
).fetchone()
|
|
||||||
conn.close()
|
|
||||||
except Exception as exc:
|
|
||||||
logger.warning("Vec schema check could not read %s (non-fatal): %s", vec_db_path, exc)
|
|
||||||
return
|
|
||||||
|
|
||||||
if not row:
|
|
||||||
return # table not yet created — first embed will build it with the right dims
|
|
||||||
|
|
||||||
m = re.search(r'float\[(\d+)\]', row[0])
|
|
||||||
if not m:
|
|
||||||
return
|
|
||||||
actual_dims = int(m.group(1))
|
|
||||||
if actual_dims == expected_dims:
|
|
||||||
return
|
|
||||||
|
|
||||||
logger.warning(
|
|
||||||
"Vec DB dimension mismatch: stored=%d, configured=%d — dropping %s and queuing re-embed",
|
|
||||||
actual_dims, expected_dims, vec_db_path,
|
|
||||||
)
|
|
||||||
try:
|
|
||||||
os.remove(vec_db_path)
|
|
||||||
except OSError as exc:
|
|
||||||
logger.error(
|
|
||||||
"Could not delete stale vec DB %s: %s — fix permissions and restart", vec_db_path, exc
|
|
||||||
)
|
|
||||||
return
|
|
||||||
|
|
||||||
# Collect all ready docs so we can rebuild their embeddings in the background.
|
|
||||||
try:
|
|
||||||
conn = sqlite3.connect(db_path)
|
|
||||||
docs = conn.execute(
|
|
||||||
"SELECT id, file_path FROM documents WHERE status='ready'"
|
|
||||||
).fetchall()
|
|
||||||
conn.close()
|
|
||||||
except Exception as exc:
|
|
||||||
logger.warning("Could not query documents for re-embed: %s", exc)
|
|
||||||
return
|
|
||||||
|
|
||||||
if not docs:
|
|
||||||
return
|
|
||||||
|
|
||||||
logger.info("Queuing re-embed for %d document(s) in background", len(docs))
|
|
||||||
threading.Thread(
|
|
||||||
target=_reembed_docs,
|
|
||||||
args=(docs, db_path, vec_db_path),
|
|
||||||
daemon=True,
|
|
||||||
name="pagepiper-reembed",
|
|
||||||
).start()
|
|
||||||
|
|
||||||
|
|
||||||
@asynccontextmanager
|
@asynccontextmanager
|
||||||
async def lifespan(app: FastAPI):
|
async def lifespan(app: FastAPI):
|
||||||
_apply_migrations()
|
_apply_migrations()
|
||||||
embed_model = os.environ.get("PAGEPIPER_EMBED_MODEL", "nomic-embed-text")
|
|
||||||
logger.info("Pagepiper starting — embed model: %s, dims: %d", embed_model, VEC_DIMENSIONS)
|
|
||||||
_check_vec_schema(VEC_DB_PATH, VEC_DIMENSIONS, DB_PATH)
|
|
||||||
_bm25.mark_dirty() # will rebuild on first search
|
_bm25.mark_dirty() # will rebuild on first search
|
||||||
yield
|
yield
|
||||||
|
|
||||||
|
|
@ -125,12 +39,8 @@ from app.api.library import router as library_router # noqa: E402
|
||||||
from app.api.ingest import router as ingest_router # noqa: E402
|
from app.api.ingest import router as ingest_router # noqa: E402
|
||||||
from app.api.search import router as search_router # noqa: E402
|
from app.api.search import router as search_router # noqa: E402
|
||||||
from app.api.chat import router as chat_router # noqa: E402
|
from app.api.chat import router as chat_router # noqa: E402
|
||||||
from app.api.feedback import router as feedback_router # noqa: E402
|
|
||||||
from app.api.feedback_attach import router as feedback_attach_router # noqa: E402
|
|
||||||
|
|
||||||
app.include_router(library_router)
|
app.include_router(library_router)
|
||||||
app.include_router(ingest_router)
|
app.include_router(ingest_router)
|
||||||
app.include_router(search_router)
|
app.include_router(search_router)
|
||||||
app.include_router(chat_router)
|
app.include_router(chat_router)
|
||||||
app.include_router(feedback_router, prefix="/api/v1/feedback")
|
|
||||||
app.include_router(feedback_attach_router, prefix="/api/v1/feedback")
|
|
||||||
|
|
|
||||||
|
|
@ -8,7 +8,6 @@ BM25-only path is MIT and has no gate.
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import logging
|
import logging
|
||||||
import sqlite3
|
|
||||||
from dataclasses import dataclass
|
from dataclasses import dataclass
|
||||||
|
|
||||||
from app.services.bm25_index import BM25Index
|
from app.services.bm25_index import BM25Index
|
||||||
|
|
@ -16,62 +15,6 @@ from app.services.bm25_index import BM25Index
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
def _fetch_adjacent(
|
|
||||||
hits: list["RetrievedChunk"],
|
|
||||||
db_path: str,
|
|
||||||
window: int = 1,
|
|
||||||
) -> list["RetrievedChunk"]:
|
|
||||||
"""Return chunks immediately before/after each hit that aren't already in the hit set.
|
|
||||||
|
|
||||||
Definitional passages often start mid-sentence because the EPUB/PDF chunk
|
|
||||||
boundary fell mid-paragraph. Fetching the preceding chunk restores the subject
|
|
||||||
so the LLM can understand 'them' / 'they' references correctly.
|
|
||||||
"""
|
|
||||||
if not hits:
|
|
||||||
return []
|
|
||||||
|
|
||||||
existing_keys = {(c.doc_id, c.page_number) for c in hits}
|
|
||||||
needed: dict[str, set[int]] = {}
|
|
||||||
for c in hits:
|
|
||||||
for delta in range(-window, window + 1):
|
|
||||||
if delta == 0:
|
|
||||||
continue
|
|
||||||
adj_page = c.page_number + delta
|
|
||||||
if adj_page > 0 and (c.doc_id, adj_page) not in existing_keys:
|
|
||||||
needed.setdefault(c.doc_id, set()).add(adj_page)
|
|
||||||
|
|
||||||
if not needed:
|
|
||||||
return []
|
|
||||||
|
|
||||||
extra: list[RetrievedChunk] = []
|
|
||||||
try:
|
|
||||||
conn = sqlite3.connect(db_path)
|
|
||||||
conn.row_factory = sqlite3.Row
|
|
||||||
for doc_id, pages in needed.items():
|
|
||||||
placeholders = ",".join("?" * len(pages))
|
|
||||||
rows = conn.execute(
|
|
||||||
f"SELECT id, doc_id, page_number, text FROM page_chunks "
|
|
||||||
f"WHERE doc_id=? AND page_number IN ({placeholders})",
|
|
||||||
[doc_id] + sorted(pages),
|
|
||||||
).fetchall()
|
|
||||||
for row in rows:
|
|
||||||
extra.append(
|
|
||||||
RetrievedChunk(
|
|
||||||
chunk_id=row["id"],
|
|
||||||
doc_id=row["doc_id"],
|
|
||||||
page_number=row["page_number"],
|
|
||||||
text=row["text"],
|
|
||||||
bm25_score=0.0,
|
|
||||||
vector_score=None,
|
|
||||||
)
|
|
||||||
)
|
|
||||||
conn.close()
|
|
||||||
except Exception as exc:
|
|
||||||
logger.warning("Context expansion query failed (non-fatal): %s", exc)
|
|
||||||
|
|
||||||
return extra
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass(frozen=True)
|
@dataclass(frozen=True)
|
||||||
class RetrievedChunk:
|
class RetrievedChunk:
|
||||||
"""A chunk returned by the retriever, with source scores."""
|
"""A chunk returned by the retriever, with source scores."""
|
||||||
|
|
@ -112,23 +55,13 @@ class Retriever:
|
||||||
for r in self._bm25.query(query, top_k=top_k * 2, doc_ids=doc_ids)
|
for r in self._bm25.query(query, top_k=top_k * 2, doc_ids=doc_ids)
|
||||||
}
|
}
|
||||||
|
|
||||||
try:
|
vec = llm.embed([query])[0]
|
||||||
vec = llm.embed([query])[0]
|
store = LocalSQLiteVecStore(db_path=vec_db_path, table="page_vecs", dimensions=768)
|
||||||
except Exception as exc:
|
filter_meta = {"doc_id": doc_ids[0]} if doc_ids and len(doc_ids) == 1 else None
|
||||||
logger.warning("Embed failed, falling back to BM25-only: %s", exc)
|
vec_hits = store.query(vec, top_k=top_k * 2, filter_metadata=filter_meta)
|
||||||
return self._bm25_only(query, top_k, doc_ids, db_path)
|
|
||||||
from app.config import VEC_DIMENSIONS
|
|
||||||
store = LocalSQLiteVecStore(db_path=vec_db_path, table="page_vecs", dimensions=VEC_DIMENSIONS)
|
|
||||||
|
|
||||||
# sqlite-vec applies filter_metadata as a Python post-filter after fetching k
|
if doc_ids and len(doc_ids) > 1:
|
||||||
# nearest globally. When the corpus spans many documents and only a subset is
|
vec_hits = [h for h in vec_hits if h.metadata.get("doc_id") in doc_ids]
|
||||||
# selected, most of those k candidates are from non-target docs and get dropped,
|
|
||||||
# leaving too few vector hits. Oversample heavily and filter in Python instead.
|
|
||||||
if doc_ids:
|
|
||||||
vec_candidates = store.query(vec, top_k=top_k * 20)
|
|
||||||
vec_hits = [h for h in vec_candidates if h.metadata.get("doc_id") in doc_ids]
|
|
||||||
else:
|
|
||||||
vec_hits = store.query(vec, top_k=top_k * 2)
|
|
||||||
|
|
||||||
# Merge: BM25 hits take priority; vector hits fill in additional results
|
# Merge: BM25 hits take priority; vector hits fill in additional results
|
||||||
merged: dict[str, RetrievedChunk] = {}
|
merged: dict[str, RetrievedChunk] = {}
|
||||||
|
|
@ -143,10 +76,10 @@ class Retriever:
|
||||||
)
|
)
|
||||||
for vh in vec_hits:
|
for vh in vec_hits:
|
||||||
# _chunks is the loaded list of dicts from BM25Index; no public accessor exists
|
# _chunks is the loaded list of dicts from BM25Index; no public accessor exists
|
||||||
text = next((c["text"] for c in self._bm25._chunks if c["id"] == vh.entry_id), "")
|
text = next((c["text"] for c in self._bm25._chunks if c["id"] == vh.id), "")
|
||||||
if vh.entry_id in merged:
|
if vh.id in merged:
|
||||||
existing = merged[vh.entry_id]
|
existing = merged[vh.id]
|
||||||
merged[vh.entry_id] = RetrievedChunk(
|
merged[vh.id] = RetrievedChunk(
|
||||||
chunk_id=existing.chunk_id,
|
chunk_id=existing.chunk_id,
|
||||||
doc_id=existing.doc_id,
|
doc_id=existing.doc_id,
|
||||||
page_number=existing.page_number,
|
page_number=existing.page_number,
|
||||||
|
|
@ -155,8 +88,8 @@ class Retriever:
|
||||||
vector_score=vh.score,
|
vector_score=vh.score,
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
merged[vh.entry_id] = RetrievedChunk(
|
merged[vh.id] = RetrievedChunk(
|
||||||
chunk_id=vh.entry_id,
|
chunk_id=vh.id,
|
||||||
doc_id=vh.metadata.get("doc_id", ""),
|
doc_id=vh.metadata.get("doc_id", ""),
|
||||||
page_number=int(vh.metadata.get("page_number", 0)),
|
page_number=int(vh.metadata.get("page_number", 0)),
|
||||||
text=text,
|
text=text,
|
||||||
|
|
@ -170,15 +103,14 @@ class Retriever:
|
||||||
vec = (1.0 / (1.0 + r.vector_score)) if r.vector_score is not None else 0.0
|
vec = (1.0 / (1.0 + r.vector_score)) if r.vector_score is not None else 0.0
|
||||||
return bm25 * 0.5 + vec * 0.5
|
return bm25 * 0.5 + vec * 0.5
|
||||||
|
|
||||||
ranked = sorted(merged.values(), key=_combined, reverse=True)[:top_k]
|
ranked = sorted(merged.values(), key=_combined, reverse=True)
|
||||||
adjacent = _fetch_adjacent(ranked, db_path)
|
return ranked[:top_k]
|
||||||
return ranked + adjacent
|
|
||||||
|
|
||||||
def _bm25_only(
|
def _bm25_only(
|
||||||
self, query: str, top_k: int, doc_ids: list[str] | None, db_path: str
|
self, query: str, top_k: int, doc_ids: list[str] | None, db_path: str
|
||||||
) -> list[RetrievedChunk]:
|
) -> list[RetrievedChunk]:
|
||||||
self._bm25.ensure_fresh(db_path)
|
self._bm25.ensure_fresh(db_path)
|
||||||
hits = [
|
return [
|
||||||
RetrievedChunk(
|
RetrievedChunk(
|
||||||
chunk_id=r.chunk_id,
|
chunk_id=r.chunk_id,
|
||||||
doc_id=r.doc_id,
|
doc_id=r.doc_id,
|
||||||
|
|
@ -189,5 +121,3 @@ class Retriever:
|
||||||
)
|
)
|
||||||
for r in self._bm25.query(query, top_k=top_k, doc_ids=doc_ids)
|
for r in self._bm25.query(query, top_k=top_k, doc_ids=doc_ids)
|
||||||
]
|
]
|
||||||
adjacent = _fetch_adjacent(hits, db_path)
|
|
||||||
return hits + adjacent
|
|
||||||
|
|
|
||||||
|
|
@ -42,9 +42,7 @@ class Synthesizer:
|
||||||
history: list[dict],
|
history: list[dict],
|
||||||
chunks: list[RetrievedChunk],
|
chunks: list[RetrievedChunk],
|
||||||
) -> SynthesisResult:
|
) -> SynthesisResult:
|
||||||
# 1500 chars (~300 words) per chunk: enough to capture definitions that
|
context_parts = [f"[p.{c.page_number}]\n{c.text[:500]}" for c in chunks]
|
||||||
# appear mid-paragraph without blowing past a 32k-context model's limit.
|
|
||||||
context_parts = [f"[p.{c.page_number}]\n{c.text[:1500]}" for c in chunks]
|
|
||||||
context = "\n\n---\n\n".join(context_parts)
|
context = "\n\n---\n\n".join(context_parts)
|
||||||
prompt = f"Document excerpts:\n\n{context}\n\nQuestion: {message}"
|
prompt = f"Document excerpts:\n\n{context}\n\nQuestion: {message}"
|
||||||
|
|
||||||
|
|
@ -54,7 +52,7 @@ class Synthesizer:
|
||||||
Citation(
|
Citation(
|
||||||
doc_id=c.doc_id,
|
doc_id=c.doc_id,
|
||||||
page_number=c.page_number,
|
page_number=c.page_number,
|
||||||
snippet=c.text[:400],
|
snippet=c.text[:200],
|
||||||
bm25_score=c.bm25_score,
|
bm25_score=c.bm25_score,
|
||||||
)
|
)
|
||||||
for c in chunks
|
for c in chunks
|
||||||
|
|
|
||||||
|
|
@ -20,8 +20,6 @@ services:
|
||||||
# cf-orch: route LLM inference through coordinator for managed GPU access
|
# cf-orch: route LLM inference through coordinator for managed GPU access
|
||||||
CF_ORCH_URL: http://host.docker.internal:7700
|
CF_ORCH_URL: http://host.docker.internal:7700
|
||||||
CF_APP_NAME: pagepiper
|
CF_APP_NAME: pagepiper
|
||||||
# CF_LICENSE_KEY is the auth token CFOrchClient sends to the coordinator
|
|
||||||
CF_LICENSE_KEY: ${COORDINATOR_PAGEPIPER_KEY:-}
|
|
||||||
COORDINATOR_URL: http://10.1.10.71:7700
|
COORDINATOR_URL: http://10.1.10.71:7700
|
||||||
COORDINATOR_PAGEPIPER_KEY: ${COORDINATOR_PAGEPIPER_KEY:-}
|
COORDINATOR_PAGEPIPER_KEY: ${COORDINATOR_PAGEPIPER_KEY:-}
|
||||||
extra_hosts:
|
extra_hosts:
|
||||||
|
|
|
||||||
15
compose.yml
15
compose.yml
|
|
@ -3,28 +3,21 @@ services:
|
||||||
build:
|
build:
|
||||||
context: ..
|
context: ..
|
||||||
dockerfile: pagepiper/Dockerfile
|
dockerfile: pagepiper/Dockerfile
|
||||||
|
network_mode: host
|
||||||
env_file: .env
|
env_file: .env
|
||||||
extra_hosts:
|
|
||||||
- "host.docker.internal:host-gateway"
|
|
||||||
volumes:
|
volumes:
|
||||||
- ./data:/app/pagepiper/data
|
- ./data:/app/pagepiper/data
|
||||||
- ${PAGEPIPER_BOOKS_DIR:-./books}:/books:ro
|
- ${PAGEPIPER_BOOKS_DIR:-./books}:/books:ro
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
networks:
|
|
||||||
- pagepiper-dev-net
|
|
||||||
|
|
||||||
web:
|
web:
|
||||||
build:
|
build:
|
||||||
context: .
|
context: .
|
||||||
dockerfile: docker/web/Dockerfile
|
dockerfile: docker/web/Dockerfile
|
||||||
ports:
|
ports:
|
||||||
- "${WEB_PORT:-8521}:80"
|
- "8521:80"
|
||||||
|
extra_hosts:
|
||||||
|
- "host.docker.internal:host-gateway"
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
depends_on:
|
depends_on:
|
||||||
- api
|
- api
|
||||||
networks:
|
|
||||||
- pagepiper-dev-net
|
|
||||||
|
|
||||||
networks:
|
|
||||||
pagepiper-dev-net:
|
|
||||||
driver: bridge
|
|
||||||
|
|
|
||||||
|
|
@ -8,10 +8,10 @@ server {
|
||||||
try_files $uri $uri/ /index.html;
|
try_files $uri $uri/ /index.html;
|
||||||
}
|
}
|
||||||
|
|
||||||
# Proxy API requests to FastAPI (Docker bridge, service name)
|
# Proxy API requests to FastAPI (host network, port 8522)
|
||||||
location /api/ {
|
location /api/ {
|
||||||
proxy_pass http://api:8522;
|
proxy_pass http://host.docker.internal:8522;
|
||||||
proxy_set_header Host $http_host;
|
proxy_set_header Host $host;
|
||||||
proxy_set_header X-Real-IP $remote_addr;
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -1,51 +0,0 @@
|
||||||
# Installation
|
|
||||||
|
|
||||||
Pagepiper runs as a Docker Compose stack: a FastAPI backend and a Vue 3 frontend served by nginx. No external services are required for the core BM25 search feature set.
|
|
||||||
|
|
||||||
## Prerequisites
|
|
||||||
|
|
||||||
- Docker and Docker Compose
|
|
||||||
- 1 GB disk for images, plus space for your document library
|
|
||||||
|
|
||||||
## Quick setup
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git clone https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
|
|
||||||
cd pagepiper
|
|
||||||
cp .env.example .env
|
|
||||||
./manage.sh start
|
|
||||||
```
|
|
||||||
|
|
||||||
The web UI opens at `http://localhost:8521`.
|
|
||||||
|
|
||||||
## manage.sh commands
|
|
||||||
|
|
||||||
| Command | Description |
|
|
||||||
|---------|-------------|
|
|
||||||
| `./manage.sh start` | Start all services (builds on first run) |
|
|
||||||
| `./manage.sh stop` | Stop all services |
|
|
||||||
| `./manage.sh restart` | Rebuild and restart |
|
|
||||||
| `./manage.sh status` | Show running containers |
|
|
||||||
| `./manage.sh logs [api\|web]` | Tail logs |
|
|
||||||
| `./manage.sh build` | Rebuild images without starting |
|
|
||||||
| `./manage.sh test` | Run the test suite |
|
|
||||||
| `./manage.sh open` | Open browser to the web UI |
|
|
||||||
|
|
||||||
## Mounting a document directory
|
|
||||||
|
|
||||||
To scan an entire folder of PDFs and EPUBs at startup, set `PAGEPIPER_WATCH_DIR` in your `.env`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
PAGEPIPER_WATCH_DIR=/home/you/books
|
|
||||||
```
|
|
||||||
|
|
||||||
Then use the **Scan for PDFs** button in the library to index everything in that directory.
|
|
||||||
|
|
||||||
## Updating
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git pull
|
|
||||||
./manage.sh restart
|
|
||||||
```
|
|
||||||
|
|
||||||
The SQLite database persists in `data/` across rebuilds.
|
|
||||||
|
|
@ -1,49 +0,0 @@
|
||||||
# Ollama Setup
|
|
||||||
|
|
||||||
Hybrid vector search and RAG chat are gated behind a local Ollama instance. This is the BYOK (bring your own key) unlock for the Free tier — no paid subscription required.
|
|
||||||
|
|
||||||
## Install Ollama
|
|
||||||
|
|
||||||
```bash
|
|
||||||
curl -fsSL https://ollama.ai/install.sh | sh
|
|
||||||
```
|
|
||||||
|
|
||||||
## Pull the required models
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Embedding model — converts pages into vectors
|
|
||||||
ollama pull nomic-embed-text
|
|
||||||
|
|
||||||
# Chat model — answers questions using retrieved page excerpts
|
|
||||||
ollama pull mistral:7b
|
|
||||||
```
|
|
||||||
|
|
||||||
`nomic-embed-text` produces 1024-dimensional vectors and runs comfortably on 8 GB of VRAM.
|
|
||||||
`mistral:7b` requires roughly 5 GB of VRAM. Substitute any compatible model.
|
|
||||||
|
|
||||||
## Configure Pagepiper
|
|
||||||
|
|
||||||
In your `.env`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
PAGEPIPER_OLLAMA_URL=http://localhost:11434
|
|
||||||
PAGEPIPER_EMBED_MODEL=nomic-embed-text
|
|
||||||
PAGEPIPER_CHAT_MODEL=mistral:7b
|
|
||||||
```
|
|
||||||
|
|
||||||
Restart Pagepiper:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
./manage.sh restart
|
|
||||||
```
|
|
||||||
|
|
||||||
## Verify
|
|
||||||
|
|
||||||
Upload or re-index a document. The document card should show **Embedding N / M pages** during ingest. Once complete, the Chat tab becomes active.
|
|
||||||
|
|
||||||
## Changing embedding models
|
|
||||||
|
|
||||||
If you switch `PAGEPIPER_EMBED_MODEL`, Pagepiper detects the dimension mismatch at startup, deletes the old vector database, and automatically re-embeds all indexed documents in the background. BM25 search remains available throughout.
|
|
||||||
|
|
||||||
!!! note
|
|
||||||
Re-embedding a large library can take 30-60 minutes depending on hardware.
|
|
||||||
|
|
@ -1,36 +0,0 @@
|
||||||
# Quick Start
|
|
||||||
|
|
||||||
This guide gets you from zero to searching your first document in under five minutes.
|
|
||||||
|
|
||||||
## 1. Start Pagepiper
|
|
||||||
|
|
||||||
```bash
|
|
||||||
./manage.sh start
|
|
||||||
```
|
|
||||||
|
|
||||||
Open `http://localhost:8521` in your browser.
|
|
||||||
|
|
||||||
## 2. Add a document
|
|
||||||
|
|
||||||
You have two options:
|
|
||||||
|
|
||||||
**Upload directly** — click **Upload PDF / EPUB** in the library header and pick a file from your computer.
|
|
||||||
|
|
||||||
**Scan a directory** — set `PAGEPIPER_WATCH_DIR` in your `.env` to a folder of PDFs or EPUBs, then click **Scan for PDFs**. Pagepiper indexes every file it finds.
|
|
||||||
|
|
||||||
## 3. Wait for indexing
|
|
||||||
|
|
||||||
The document card shows progress while text is being extracted and embedded:
|
|
||||||
|
|
||||||
- **Extracting text...** (animated bar) — PDF/EPUB is being parsed into page chunks
|
|
||||||
- **Embedding N / M pages (X%)** (filling bar) — vectors are being written to the vector store (only when Ollama is configured)
|
|
||||||
|
|
||||||
Once the badge shows **READY**, the document is searchable.
|
|
||||||
|
|
||||||
## 4. Search
|
|
||||||
|
|
||||||
Click **Search** in the navigation. Type any phrase and see ranked page excerpts with scores. Results are instant using BM25 full-text search — no Ollama required.
|
|
||||||
|
|
||||||
## 5. Chat (optional, requires Ollama)
|
|
||||||
|
|
||||||
See the [Ollama Setup](ollama-setup.md) guide to enable hybrid vector search and LLM-powered chat. Once configured, the **Chat** tab lets you ask natural-language questions and get answers with page citations.
|
|
||||||
142
docs/index.md
142
docs/index.md
|
|
@ -1,142 +0,0 @@
|
||||||
# Pagepiper
|
|
||||||
|
|
||||||
Self-hosted document search with BM25 full-text indexing and (with local Ollama) hybrid vector search and LLM-powered chat. Supports PDF and EPUB files.
|
|
||||||
|
|
||||||
## Demo
|
|
||||||
|
|
||||||
Try it: [pagepiper.circuitforge.tech](https://pagepiper.circuitforge.tech)
|
|
||||||
|
|
||||||
## Screenshots
|
|
||||||
|
|
||||||
### Library
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
Scan your PDF directory to index documents, or upload individual PDFs directly. Each document shows page count and ingest status.
|
|
||||||
|
|
||||||
### Chat
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
Ask questions across your indexed documents. Results cite the source document and page number.
|
|
||||||
|
|
||||||
## Tiers
|
|
||||||
|
|
||||||
| Feature | Free | Paid (BYOK) |
|
|
||||||
|---------|------|-------------|
|
|
||||||
| BM25 full-text search | Yes | Yes |
|
|
||||||
| PDF and EPUB upload via browser | Yes | Yes |
|
|
||||||
| Unlimited local ingestion | Yes | Yes |
|
|
||||||
| Hybrid vector search | No | Yes (local Ollama) |
|
|
||||||
| LLM chat over documents | No | Yes (local Ollama) |
|
|
||||||
|
|
||||||
BYOK (Bring Your Own Key) means you supply your own Ollama instance. No cloud API keys required.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Self-Hosting Guide
|
|
||||||
|
|
||||||
### Prerequisites
|
|
||||||
|
|
||||||
- [Docker](https://docs.docker.com/get-docker/) and Docker Compose
|
|
||||||
- PDFs you want to search
|
|
||||||
- Optional: [Ollama](https://ollama.com) running locally for semantic search and LLM chat
|
|
||||||
|
|
||||||
### Step 1: Get the code
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git clone https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
|
|
||||||
cd pagepiper
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 2: Configure
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cp .env.example .env
|
|
||||||
```
|
|
||||||
|
|
||||||
Open `.env` and set your directories:
|
|
||||||
|
|
||||||
```dotenv
|
|
||||||
# Where pagepiper stores its index database
|
|
||||||
PAGEPIPER_DATA_DIR=./data
|
|
||||||
|
|
||||||
# Directory to scan for PDFs (used by the "Scan for PDFs" button)
|
|
||||||
# You can also upload individual PDFs via the web UI without setting this
|
|
||||||
PAGEPIPER_BOOKS_DIR=/path/to/your/pdfs
|
|
||||||
```
|
|
||||||
|
|
||||||
To unlock hybrid vector search and LLM chat, add your Ollama endpoint:
|
|
||||||
|
|
||||||
```dotenv
|
|
||||||
PAGEPIPER_OLLAMA_URL=http://localhost:11434
|
|
||||||
PAGEPIPER_CHAT_MODEL=mistral:7b
|
|
||||||
PAGEPIPER_EMBED_MODEL=nomic-embed-text
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 3: Start
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker compose up -d --build
|
|
||||||
```
|
|
||||||
|
|
||||||
Open [http://localhost:8521](http://localhost:8521) in your browser.
|
|
||||||
|
|
||||||
### Step 4: Add your PDFs
|
|
||||||
|
|
||||||
Two ways to add documents:
|
|
||||||
|
|
||||||
**Option A — Upload via browser** (easiest for small collections):
|
|
||||||
|
|
||||||
Click the **Upload PDF** button in the Library view and select a file. It saves to `data/uploads/` and begins indexing automatically.
|
|
||||||
|
|
||||||
**Option B — Mount a directory** (best for large collections):
|
|
||||||
|
|
||||||
Set `PAGEPIPER_BOOKS_DIR` in your `.env` to point at a folder of PDFs, then click **Scan for PDFs**. Pagepiper finds all `.pdf` files recursively and queues them for indexing.
|
|
||||||
|
|
||||||
### Step 5: Search
|
|
||||||
|
|
||||||
Switch to the **Chat** tab and ask questions about your documents. The Free tier uses BM25 keyword matching. With Ollama configured, you get semantic (vector) search and LLM-generated answers with page-level citations.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Ollama Setup (optional)
|
|
||||||
|
|
||||||
Install Ollama from [ollama.com](https://ollama.com), then pull the models:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ollama pull mistral:7b
|
|
||||||
ollama pull nomic-embed-text
|
|
||||||
```
|
|
||||||
|
|
||||||
Pagepiper's Docker container reaches Ollama at `host.docker.internal` — no extra network config needed on Linux/Mac with Docker Desktop. On a headless Linux server, make sure Ollama binds to `0.0.0.0`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
OLLAMA_HOST=0.0.0.0 ollama serve
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Managing the instance
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check status
|
|
||||||
docker compose ps
|
|
||||||
|
|
||||||
# View API logs
|
|
||||||
docker compose logs -f api
|
|
||||||
|
|
||||||
# Stop
|
|
||||||
docker compose down
|
|
||||||
|
|
||||||
# Rebuild after updates
|
|
||||||
docker compose up -d --build
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- Pagepiper indexes PDFs at ingest time. Changes to the source file require a re-index (use the re-index button on the document card).
|
|
||||||
- The `data/` directory contains the SQLite index database and any uploaded files. Back it up to preserve your index.
|
|
||||||
- Large PDFs (hundreds of pages) can take a few minutes to index. Watch the status badge on the document card.
|
|
||||||
|
|
@ -1 +0,0 @@
|
||||||
(function(){var s=document.createElement("script");s.defer=true;s.dataset.domain="docs.circuitforge.tech,circuitforge.tech";s.dataset.api="https://analytics.circuitforge.tech/api/event";s.src="https://analytics.circuitforge.tech/js/script.js";document.head.appendChild(s);})();
|
|
||||||
|
|
@ -1,60 +0,0 @@
|
||||||
# Architecture
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
```
|
|
||||||
Browser (Vue 3 SPA)
|
|
||||||
|
|
|
||||||
nginx (static + /api proxy)
|
|
||||||
|
|
|
||||||
FastAPI backend
|
|
||||||
├── BM25Index (in-process, rank-bm25)
|
|
||||||
├── Retriever (BM25 + optional vector)
|
|
||||||
├── Synthesizer (LLMRouter → Ollama)
|
|
||||||
└── SQLite (page_chunks + metadata)
|
|
||||||
+
|
|
||||||
sqlite-vec (vectors)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Ingest pipeline
|
|
||||||
|
|
||||||
```
|
|
||||||
PDF / EPUB file
|
|
||||||
│
|
|
||||||
├─ PDFExtractor (pdfminer + OCR fallback) ← circuitforge_core
|
|
||||||
│ or
|
|
||||||
└─ EPUBExtractor (BeautifulSoup + heading chunking)
|
|
||||||
│
|
|
||||||
text_clean.py (strip artifacts)
|
|
||||||
│
|
|
||||||
INSERT INTO page_chunks
|
|
||||||
│
|
|
||||||
Ollama embed (batches of 64) ← BYOK gate
|
|
||||||
│
|
|
||||||
sqlite-vec upsert
|
|
||||||
```
|
|
||||||
|
|
||||||
## Retrieval
|
|
||||||
|
|
||||||
Hybrid search merges BM25 and semantic results with a 50/50 score blend:
|
|
||||||
|
|
||||||
1. BM25 queries the in-process index (no round-trip to DB)
|
|
||||||
2. Semantic query embeds the user query via Ollama, fetches `top_k * 20` nearest vectors, filters by `doc_id` in Python
|
|
||||||
3. Hits are merged: BM25 scores and vector scores combined; BM25 hits take priority
|
|
||||||
4. Top `k` results are ranked, then adjacent pages (page ± 1) are fetched to restore context for mid-sentence chunk boundaries
|
|
||||||
|
|
||||||
## Storage
|
|
||||||
|
|
||||||
| File | Format | Contents |
|
|
||||||
|------|--------|---------|
|
|
||||||
| `pagepiper.db` | SQLite | `documents`, `page_chunks`, `chat_feedback` |
|
|
||||||
| `pagepiper_vecs.db` | sqlite-vec | `page_vecs` virtual table + `page_vecs_meta` |
|
|
||||||
|
|
||||||
The vector database stores one row per page chunk. If the embedding model changes, Pagepiper detects the dimension mismatch at startup (reads `CREATE VIRTUAL TABLE` DDL from `sqlite_master`), deletes the vec DB, and queues a background re-embed.
|
|
||||||
|
|
||||||
## Licensing boundary
|
|
||||||
|
|
||||||
| Component | License |
|
|
||||||
|-----------|---------|
|
|
||||||
| BM25 search, ingest pipeline, library API | MIT |
|
|
||||||
| Hybrid vector search, RAG chat, embedding | BSL 1.1 (BYOK unlocked on Free tier) |
|
|
||||||
|
|
@ -1,42 +0,0 @@
|
||||||
# Environment Variables
|
|
||||||
|
|
||||||
Copy `.env.example` to `.env` and configure as needed.
|
|
||||||
|
|
||||||
## Core
|
|
||||||
|
|
||||||
| Variable | Default | Description |
|
|
||||||
|----------|---------|-------------|
|
|
||||||
| `PAGEPIPER_DATA_DIR` | `data` | Directory for SQLite databases and uploads |
|
|
||||||
| `PAGEPIPER_WATCH_DIR` | _(unset)_ | Directory scanned for PDFs/EPUBs on demand |
|
|
||||||
| `SECRET_KEY` | _(required)_ | Random secret for internal signing |
|
|
||||||
|
|
||||||
## Ollama / BYOK
|
|
||||||
|
|
||||||
| Variable | Default | Description |
|
|
||||||
|----------|---------|-------------|
|
|
||||||
| `PAGEPIPER_OLLAMA_URL` | _(unset)_ | Ollama base URL, e.g. `http://localhost:11434`. Enables hybrid search and chat. |
|
|
||||||
| `PAGEPIPER_EMBED_MODEL` | `nomic-embed-text` | Ollama embedding model |
|
|
||||||
| `PAGEPIPER_EMBED_DIMS` | `1024` | Embedding dimensions (must match the model) |
|
|
||||||
| `PAGEPIPER_CHAT_MODEL` | `mistral:7b` | Ollama chat/completion model |
|
|
||||||
|
|
||||||
## cf-orch (managed deployments)
|
|
||||||
|
|
||||||
| Variable | Default | Description |
|
|
||||||
|----------|---------|-------------|
|
|
||||||
| `CF_ORCH_URL` | _(unset)_ | cf-orch coordinator URL for GPU allocation |
|
|
||||||
| `CF_LICENSE_KEY` | _(unset)_ | License key for cf-orch authentication |
|
|
||||||
| `CF_APP_NAME` | `pagepiper` | Application identifier sent to cf-orch |
|
|
||||||
|
|
||||||
## License (cloud tier)
|
|
||||||
|
|
||||||
| Variable | Default | Description |
|
|
||||||
|----------|---------|-------------|
|
|
||||||
| `PAGEPIPER_HEIMDALL_URL` | _(unset)_ | Heimdall license server URL |
|
|
||||||
| `PAGEPIPER_HEIMDALL_TOKEN` | _(unset)_ | Admin token for license validation |
|
|
||||||
|
|
||||||
## Feature flags
|
|
||||||
|
|
||||||
| Variable | Default | Description |
|
|
||||||
|----------|---------|-------------|
|
|
||||||
| `PAGEPIPER_CHAT_FEEDBACK` | `false` | Enable thumbs up/down feedback UI on chat answers |
|
|
||||||
| `CLOUD_MODE` | `false` | Enable cloud-specific middleware (rate limiting, license checks) |
|
|
||||||
|
|
@ -1,23 +0,0 @@
|
||||||
# Tier System
|
|
||||||
|
|
||||||
| Feature | Free | Paid (BYOK) |
|
|
||||||
|---------|------|-------------|
|
|
||||||
| BM25 full-text search | Yes | Yes |
|
|
||||||
| PDF and EPUB upload | Yes | Yes |
|
|
||||||
| Unlimited local ingestion | Yes | Yes |
|
|
||||||
| Directory scan | Yes | Yes |
|
|
||||||
| Hybrid vector search | No | Yes (local Ollama) |
|
|
||||||
| RAG chat with page citations | No | Yes (local Ollama) |
|
|
||||||
| Embedding model choice | No | Yes |
|
|
||||||
|
|
||||||
## BYOK unlock
|
|
||||||
|
|
||||||
Setting `PAGEPIPER_OLLAMA_URL` in your `.env` unlocks all Paid-tier features at no cost. You supply your own compute; Pagepiper supplies the pipeline.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
PAGEPIPER_OLLAMA_URL=http://localhost:11434
|
|
||||||
```
|
|
||||||
|
|
||||||
## Cloud managed tier
|
|
||||||
|
|
||||||
The hosted instance at [pagepiper.circuitforge.tech](https://pagepiper.circuitforge.tech) runs on Circuit Forge infrastructure and requires a Paid tier license key. A free trial is available without a key.
|
|
||||||
Binary file not shown.
|
Before Width: | Height: | Size: 26 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 25 KiB |
|
|
@ -1,39 +0,0 @@
|
||||||
# Chat
|
|
||||||
|
|
||||||
RAG (retrieval-augmented generation) chat lets you ask natural-language questions and get answers grounded in your document library. Requires Ollama — see [Ollama Setup](../getting-started/ollama-setup.md).
|
|
||||||
|
|
||||||
## Asking a question
|
|
||||||
|
|
||||||
1. Click **Chat** in the navigation bar
|
|
||||||
2. Optionally select one or more documents to restrict the search scope
|
|
||||||
3. Type your question and press Enter or click Send
|
|
||||||
|
|
||||||
Pagepiper retrieves the most relevant page excerpts using hybrid BM25 + vector search, then passes them to the local LLM with instructions to answer using only the provided text and cite every claim with a page number.
|
|
||||||
|
|
||||||
## Citations
|
|
||||||
|
|
||||||
Each answer includes a citation panel showing the source pages used. Citations include:
|
|
||||||
|
|
||||||
- Document title
|
|
||||||
- Page number
|
|
||||||
- A short text excerpt from that page
|
|
||||||
|
|
||||||
If the answer says `[p.42]`, you can cross-reference the citation panel to see exactly what text the model read.
|
|
||||||
|
|
||||||
## Multi-document chat
|
|
||||||
|
|
||||||
Leave the document selector empty to search across your entire library. When you have many books indexed, scoping to a specific document gives more precise results.
|
|
||||||
|
|
||||||
## Context window
|
|
||||||
|
|
||||||
Pagepiper fetches the top 10 matching pages plus one adjacent page on each side of every hit. This ensures mid-paragraph chunk boundaries don't cut off context that the model needs to understand a passage.
|
|
||||||
|
|
||||||
## Limitations
|
|
||||||
|
|
||||||
- The model answers using only the retrieved excerpts. If the relevant passage was not retrieved, the model will say it cannot find an answer.
|
|
||||||
- Chat history is kept in the browser session only. Refreshing the page clears the conversation.
|
|
||||||
- RAG chat is gated behind a local Ollama instance. Cloud LLM backends are not currently supported on the Free tier.
|
|
||||||
|
|
||||||
## Feedback
|
|
||||||
|
|
||||||
Use the thumbs up / thumbs down buttons after each answer to flag good and bad responses. Feedback is stored locally in `data/pagepiper.db` for future quality review.
|
|
||||||
|
|
@ -1,48 +0,0 @@
|
||||||
# Library
|
|
||||||
|
|
||||||
The library is the home screen. It shows all indexed documents and lets you add new ones.
|
|
||||||
|
|
||||||
## Adding documents
|
|
||||||
|
|
||||||
**Upload** — click **Upload PDF / EPUB** and select a file. Files up to 200 MB are accepted. The document is saved to `data/uploads/` and queued for indexing immediately.
|
|
||||||
|
|
||||||
**Scan** — set `PAGEPIPER_WATCH_DIR` to a directory in your `.env`, then click **Scan for PDFs**. Any PDF or EPUB not already in the library is queued. Re-scanning is safe; already-indexed documents are skipped.
|
|
||||||
|
|
||||||
## Document states
|
|
||||||
|
|
||||||
| Badge | Meaning |
|
|
||||||
|-------|---------|
|
|
||||||
| PROCESSING | Text extraction or embedding in progress |
|
|
||||||
| READY | Fully indexed and searchable |
|
|
||||||
| ERROR | Indexing failed — see the error message on the card |
|
|
||||||
|
|
||||||
## Ingestion progress
|
|
||||||
|
|
||||||
While a document is processing, its card shows a live progress bar:
|
|
||||||
|
|
||||||
- Animated sliding bar while text is being extracted (before page count is known)
|
|
||||||
- "Embedding N / M pages (X%)" once vectors are being written
|
|
||||||
|
|
||||||
The card refreshes automatically and emits a library reload when indexing completes.
|
|
||||||
|
|
||||||
## Re-indexing
|
|
||||||
|
|
||||||
Click **Re-index** on any document card to re-run the full ingest pipeline. This is useful after:
|
|
||||||
|
|
||||||
- Changing the `PAGEPIPER_EMBED_MODEL` (dimension mismatch auto-detected at startup, but you can also trigger manually)
|
|
||||||
- A failed ingest you want to retry
|
|
||||||
- Updating to a new version of Pagepiper with an improved extractor
|
|
||||||
|
|
||||||
## Removing a document
|
|
||||||
|
|
||||||
Click **Remove** to delete the document's metadata, page chunks, and vectors. The source file on disk is not deleted.
|
|
||||||
|
|
||||||
## Storage
|
|
||||||
|
|
||||||
All data lives in the directory set by `PAGEPIPER_DATA_DIR` (default: `data/`):
|
|
||||||
|
|
||||||
| File | Contents |
|
|
||||||
|------|---------|
|
|
||||||
| `pagepiper.db` | Document metadata, page chunks, chat feedback |
|
|
||||||
| `pagepiper_vecs.db` | sqlite-vec vector store |
|
|
||||||
| `uploads/` | Files added via browser upload |
|
|
||||||
|
|
@ -1,24 +0,0 @@
|
||||||
# Search
|
|
||||||
|
|
||||||
BM25 full-text search is available on the Free tier with no Ollama required.
|
|
||||||
|
|
||||||
## Using search
|
|
||||||
|
|
||||||
1. Click **Search** in the navigation bar
|
|
||||||
2. Type a phrase or keyword — results appear as you submit
|
|
||||||
3. Results show the source document, page number, a text excerpt, and a BM25 relevance score
|
|
||||||
|
|
||||||
## Filtering by document
|
|
||||||
|
|
||||||
Use the document selector to restrict results to one or more specific books. This is useful when your library spans many documents and you know which one contains the answer.
|
|
||||||
|
|
||||||
## BM25 scoring
|
|
||||||
|
|
||||||
BM25 (Best Match 25) ranks pages by term frequency weighted against how rare each term is across the whole corpus. A page that uses your query term frequently AND that term is rare across all documents ranks highest.
|
|
||||||
|
|
||||||
!!! tip
|
|
||||||
For short queries like "chimes" or "protocol", BM25 tends to surface later chapters where the term appears repeatedly in action scenes. If you want the introductory definition, try a longer phrase like "what are the chimes" to give BM25 more signal.
|
|
||||||
|
|
||||||
## Hybrid search (requires Ollama)
|
|
||||||
|
|
||||||
When Ollama is configured, the Chat endpoint uses hybrid search behind the scenes: BM25 results are merged with semantic vector results using a 50/50 score blend. The Search page always uses BM25 only.
|
|
||||||
|
|
@ -14,8 +14,6 @@ dependencies:
|
||||||
- pdfplumber>=0.11
|
- pdfplumber>=0.11
|
||||||
- pytesseract>=0.3
|
- pytesseract>=0.3
|
||||||
- Pillow>=10.0
|
- Pillow>=10.0
|
||||||
- ebooklib>=0.18
|
|
||||||
- beautifulsoup4>=4.12
|
|
||||||
- sqlite-vec>=0.1
|
- sqlite-vec>=0.1
|
||||||
- pytest>=8.0
|
- pytest>=8.0
|
||||||
- pytest-asyncio>=0.23
|
- pytest-asyncio>=0.23
|
||||||
|
|
|
||||||
|
|
@ -1,9 +0,0 @@
|
||||||
-- chat answer thumbs up/down signals (local SQLite, always available)
|
|
||||||
CREATE TABLE IF NOT EXISTS chat_feedback (
|
|
||||||
id TEXT PRIMARY KEY DEFAULT (lower(hex(randomblob(16)))),
|
|
||||||
rating INTEGER NOT NULL CHECK (rating IN (1, -1)),
|
|
||||||
question TEXT NOT NULL DEFAULT '',
|
|
||||||
answer TEXT NOT NULL DEFAULT '',
|
|
||||||
doc_ids TEXT NOT NULL DEFAULT '[]',
|
|
||||||
created_at TEXT NOT NULL DEFAULT (datetime('now'))
|
|
||||||
);
|
|
||||||
64
mkdocs.yml
64
mkdocs.yml
|
|
@ -1,64 +0,0 @@
|
||||||
site_name: Pagepiper
|
|
||||||
site_description: Self-hosted PDF and EPUB library with BM25 full-text search, hybrid vector retrieval, and LLM-powered RAG chat.
|
|
||||||
site_author: Circuit Forge LLC
|
|
||||||
site_url: https://docs.circuitforge.tech/pagepiper
|
|
||||||
repo_url: https://git.opensourcesolarpunk.com/Circuit-Forge/pagepiper
|
|
||||||
repo_name: Circuit-Forge/pagepiper
|
|
||||||
|
|
||||||
theme:
|
|
||||||
name: material
|
|
||||||
palette:
|
|
||||||
- scheme: default
|
|
||||||
primary: deep purple
|
|
||||||
accent: purple
|
|
||||||
toggle:
|
|
||||||
icon: material/brightness-7
|
|
||||||
name: Switch to dark mode
|
|
||||||
- scheme: slate
|
|
||||||
primary: deep purple
|
|
||||||
accent: purple
|
|
||||||
toggle:
|
|
||||||
icon: material/brightness-4
|
|
||||||
name: Switch to light mode
|
|
||||||
features:
|
|
||||||
- navigation.tabs
|
|
||||||
- navigation.sections
|
|
||||||
- navigation.expand
|
|
||||||
- navigation.top
|
|
||||||
- search.suggest
|
|
||||||
- search.highlight
|
|
||||||
- content.code.copy
|
|
||||||
|
|
||||||
markdown_extensions:
|
|
||||||
- admonition
|
|
||||||
- pymdownx.details
|
|
||||||
- pymdownx.superfences:
|
|
||||||
custom_fences:
|
|
||||||
- name: mermaid
|
|
||||||
class: mermaid
|
|
||||||
format: !!python/name:pymdownx.superfences.fence_code_format
|
|
||||||
- pymdownx.highlight:
|
|
||||||
anchor_linenums: true
|
|
||||||
- pymdownx.tabbed:
|
|
||||||
alternate_style: true
|
|
||||||
- tables
|
|
||||||
- toc:
|
|
||||||
permalink: true
|
|
||||||
|
|
||||||
nav:
|
|
||||||
- Home: index.md
|
|
||||||
- Getting Started:
|
|
||||||
- Installation: getting-started/installation.md
|
|
||||||
- Quick Start: getting-started/quick-start.md
|
|
||||||
- Ollama Setup: getting-started/ollama-setup.md
|
|
||||||
- User Guide:
|
|
||||||
- Library: user-guide/library.md
|
|
||||||
- Search: user-guide/search.md
|
|
||||||
- Chat: user-guide/chat.md
|
|
||||||
- Reference:
|
|
||||||
- Architecture: reference/architecture.md
|
|
||||||
- Tier System: reference/tier-system.md
|
|
||||||
- Environment Variables: reference/environment-variables.md
|
|
||||||
|
|
||||||
extra_javascript:
|
|
||||||
- plausible.js
|
|
||||||
|
|
@ -1,239 +0,0 @@
|
||||||
# scripts/ingest_epub.py
|
|
||||||
"""
|
|
||||||
cf-orch task: pagepiper/ingest_epub
|
|
||||||
|
|
||||||
Extracts text from an EPUB file, stores chapter chunks in SQLite, and (if Ollama is
|
|
||||||
configured) generates embeddings and stores them in the sqlite-vec store.
|
|
||||||
|
|
||||||
Each EPUB chapter becomes one chunk (equivalent to a PDF page).
|
|
||||||
|
|
||||||
Entry point:
|
|
||||||
python scripts/ingest_epub.py --doc-id X --file-path Y --db-path Z --vec-db-path W
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import logging
|
|
||||||
import os
|
|
||||||
import sqlite3
|
|
||||||
from dataclasses import dataclass
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
logger = logging.getLogger("pagepiper.ingest_epub")
|
|
||||||
|
|
||||||
EMBED_BATCH_SIZE = 64
|
|
||||||
_WORDS_PER_CHUNK = 500 # target chunk size for word-count fallback
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class _Chunk:
|
|
||||||
page_number: int
|
|
||||||
text: str
|
|
||||||
source: str
|
|
||||||
word_count: int
|
|
||||||
|
|
||||||
|
|
||||||
def _paragraphs_from_soup(soup) -> list[str]:
|
|
||||||
"""Extract non-trivial, artifact-free text lines from parsed HTML."""
|
|
||||||
from scripts.text_clean import filter_paragraphs
|
|
||||||
raw = soup.get_text(separator="\n", strip=True)
|
|
||||||
return filter_paragraphs(raw.splitlines())
|
|
||||||
|
|
||||||
|
|
||||||
def _chunks_from_paragraphs(paragraphs: list[str], start_num: int) -> list[_Chunk]:
|
|
||||||
"""Accumulate paragraphs into ~_WORDS_PER_CHUNK-word chunks."""
|
|
||||||
chunks: list[_Chunk] = []
|
|
||||||
current: list[str] = []
|
|
||||||
current_count = 0
|
|
||||||
chunk_num = start_num
|
|
||||||
|
|
||||||
for para in paragraphs:
|
|
||||||
words = para.split()
|
|
||||||
if current_count + len(words) > _WORDS_PER_CHUNK and current:
|
|
||||||
text = "\n".join(current)
|
|
||||||
chunks.append(_Chunk(chunk_num, text, "text", current_count))
|
|
||||||
chunk_num += 1
|
|
||||||
current, current_count = [], 0
|
|
||||||
current.append(para)
|
|
||||||
current_count += len(words)
|
|
||||||
|
|
||||||
if current:
|
|
||||||
text = "\n".join(current)
|
|
||||||
chunks.append(_Chunk(chunk_num, text, "text", current_count))
|
|
||||||
|
|
||||||
return chunks
|
|
||||||
|
|
||||||
|
|
||||||
def _extract_chunks(file_path: str) -> list[_Chunk]:
|
|
||||||
import ebooklib
|
|
||||||
from ebooklib import epub
|
|
||||||
from bs4 import BeautifulSoup
|
|
||||||
from scripts.text_clean import clean_line, is_artifact_line
|
|
||||||
|
|
||||||
book = epub.read_epub(file_path, options={"ignore_ncx": True})
|
|
||||||
all_chunks: list[_Chunk] = []
|
|
||||||
|
|
||||||
for item in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
|
|
||||||
soup = BeautifulSoup(item.get_content(), "html.parser")
|
|
||||||
headings = soup.find_all(["h1", "h2", "h3", "h4"])
|
|
||||||
|
|
||||||
if len(headings) >= 2:
|
|
||||||
# Heading-based split: one chunk per section
|
|
||||||
current_parts: list[str] = []
|
|
||||||
for elem in soup.find_all(["h1", "h2", "h3", "h4", "p", "li", "blockquote"]):
|
|
||||||
if elem.name in ("h1", "h2", "h3", "h4"):
|
|
||||||
if current_parts:
|
|
||||||
text = "\n".join(current_parts).strip()
|
|
||||||
if text:
|
|
||||||
n = len(all_chunks) + 1
|
|
||||||
all_chunks.append(_Chunk(n, text, "text", len(text.split())))
|
|
||||||
current_parts = [elem.get_text(" ", strip=True)]
|
|
||||||
else:
|
|
||||||
t = clean_line(elem.get_text(" ", strip=True))
|
|
||||||
if t and not is_artifact_line(t):
|
|
||||||
current_parts.append(t)
|
|
||||||
if current_parts:
|
|
||||||
text = "\n".join(current_parts).strip()
|
|
||||||
if text:
|
|
||||||
n = len(all_chunks) + 1
|
|
||||||
all_chunks.append(_Chunk(n, text, "text", len(text.split())))
|
|
||||||
else:
|
|
||||||
# Word-count fallback: accumulate paragraphs into ~500-word chunks
|
|
||||||
paragraphs = _paragraphs_from_soup(soup)
|
|
||||||
if paragraphs:
|
|
||||||
all_chunks.extend(_chunks_from_paragraphs(paragraphs, len(all_chunks) + 1))
|
|
||||||
|
|
||||||
return all_chunks
|
|
||||||
|
|
||||||
|
|
||||||
def _update_status(
|
|
||||||
conn: sqlite3.Connection,
|
|
||||||
doc_id: str,
|
|
||||||
status: str,
|
|
||||||
page_count: int | None = None,
|
|
||||||
error_msg: str | None = None,
|
|
||||||
) -> None:
|
|
||||||
if page_count is not None:
|
|
||||||
conn.execute(
|
|
||||||
"UPDATE documents SET status=?, page_count=?, updated_at=datetime('now') WHERE id=?",
|
|
||||||
[status, page_count, doc_id],
|
|
||||||
)
|
|
||||||
elif error_msg is not None:
|
|
||||||
conn.execute(
|
|
||||||
"UPDATE documents SET status=?, error_msg=?, updated_at=datetime('now') WHERE id=?",
|
|
||||||
[status, error_msg, doc_id],
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
conn.execute(
|
|
||||||
"UPDATE documents SET status=?, updated_at=datetime('now') WHERE id=?",
|
|
||||||
[status, doc_id],
|
|
||||||
)
|
|
||||||
conn.commit()
|
|
||||||
|
|
||||||
|
|
||||||
def run(doc_id: str, file_path: str, db_path: str, vec_db_path: str) -> None:
|
|
||||||
"""Run the full ingest pipeline for one EPUB. Called by cf-orch or BackgroundTasks."""
|
|
||||||
conn: sqlite3.Connection | None = None
|
|
||||||
try:
|
|
||||||
conn = sqlite3.connect(db_path, timeout=30)
|
|
||||||
conn.execute("PRAGMA journal_mode = WAL")
|
|
||||||
conn.execute("PRAGMA foreign_keys = ON")
|
|
||||||
_update_status(conn, doc_id, "processing")
|
|
||||||
|
|
||||||
logger.info("Extracting chapters from %s", file_path)
|
|
||||||
chunks = _extract_chunks(file_path)
|
|
||||||
logger.info("Extracted %d chapters", len(chunks))
|
|
||||||
|
|
||||||
conn.execute("DELETE FROM page_chunks WHERE doc_id=?", [doc_id])
|
|
||||||
chunk_rows: list[tuple[str, int, str]] = []
|
|
||||||
for chunk in chunks:
|
|
||||||
row = conn.execute(
|
|
||||||
"""INSERT INTO page_chunks(doc_id, page_number, text, source, word_count)
|
|
||||||
VALUES (?,?,?,?,?) RETURNING id""",
|
|
||||||
[doc_id, chunk.page_number, chunk.text, chunk.source, chunk.word_count],
|
|
||||||
).fetchone()
|
|
||||||
chunk_rows.append((row[0], chunk.page_number, chunk.text))
|
|
||||||
conn.commit()
|
|
||||||
|
|
||||||
# Embedding failure is non-fatal: document remains BM25-searchable.
|
|
||||||
ollama_url = os.environ.get("PAGEPIPER_OLLAMA_URL", "").strip()
|
|
||||||
if ollama_url and chunks:
|
|
||||||
try:
|
|
||||||
logger.info("Embedding %d chapters via Ollama at %s", len(chunks), ollama_url)
|
|
||||||
from circuitforge_core.llm import LLMRouter
|
|
||||||
from circuitforge_core.vector.sqlite_vec import LocalSQLiteVecStore
|
|
||||||
|
|
||||||
_clean = ollama_url.rstrip("/")
|
|
||||||
base_url = _clean if _clean.endswith("/v1") else _clean + "/v1"
|
|
||||||
router = LLMRouter({
|
|
||||||
"fallback_order": ["ollama"],
|
|
||||||
"backends": {
|
|
||||||
"ollama": {
|
|
||||||
"type": "openai_compat",
|
|
||||||
"base_url": base_url,
|
|
||||||
"model": os.environ.get("PAGEPIPER_CHAT_MODEL", "mistral:7b"),
|
|
||||||
"embedding_model": os.environ.get(
|
|
||||||
"PAGEPIPER_EMBED_MODEL", "nomic-embed-text"
|
|
||||||
),
|
|
||||||
"supports_images": False,
|
|
||||||
}
|
|
||||||
},
|
|
||||||
})
|
|
||||||
embed_dims = int(os.environ.get("PAGEPIPER_EMBED_DIMS", "1024"))
|
|
||||||
vec_store = LocalSQLiteVecStore(
|
|
||||||
db_path=vec_db_path, table="page_vecs", dimensions=embed_dims
|
|
||||||
)
|
|
||||||
vec_store.delete_where({"doc_id": doc_id})
|
|
||||||
|
|
||||||
texts = [text for _, _, text in chunk_rows]
|
|
||||||
vectors: list[list[float]] = []
|
|
||||||
for i in range(0, len(texts), EMBED_BATCH_SIZE):
|
|
||||||
vectors.extend(router.embed(texts[i : i + EMBED_BATCH_SIZE]))
|
|
||||||
|
|
||||||
for (chunk_id, page_number, _), vector in zip(chunk_rows, vectors):
|
|
||||||
vec_store.upsert(
|
|
||||||
entry_id=chunk_id,
|
|
||||||
vector=vector,
|
|
||||||
metadata={"doc_id": doc_id, "page_number": page_number},
|
|
||||||
)
|
|
||||||
logger.info("Stored %d embeddings", len(vectors))
|
|
||||||
except Exception as embed_exc:
|
|
||||||
logger.warning(
|
|
||||||
"Embedding skipped for doc %s — BM25 only (reason: %s)",
|
|
||||||
doc_id, embed_exc,
|
|
||||||
)
|
|
||||||
|
|
||||||
_update_status(conn, doc_id, "ready", page_count=len(chunks))
|
|
||||||
logger.info("Ingest complete for doc %s (%d chapters)", doc_id, len(chunks))
|
|
||||||
|
|
||||||
except Exception as exc:
|
|
||||||
logger.error("Ingest failed for doc %s: %s", doc_id, exc, exc_info=True)
|
|
||||||
if conn is not None:
|
|
||||||
try:
|
|
||||||
_update_status(conn, doc_id, "error", error_msg=str(exc))
|
|
||||||
except Exception:
|
|
||||||
logger.warning("Could not write error status for doc %s", doc_id)
|
|
||||||
raise
|
|
||||||
finally:
|
|
||||||
if conn is not None:
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
import argparse
|
|
||||||
|
|
||||||
logging.basicConfig(level=logging.INFO)
|
|
||||||
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Ingest an EPUB (cf-orch task entry point)"
|
|
||||||
)
|
|
||||||
parser.add_argument("--doc-id", required=True)
|
|
||||||
parser.add_argument("--file-path", required=True)
|
|
||||||
parser.add_argument("--db-path", required=True)
|
|
||||||
parser.add_argument("--vec-db-path", required=True)
|
|
||||||
a = parser.parse_args()
|
|
||||||
run(
|
|
||||||
doc_id=a.doc_id,
|
|
||||||
file_path=a.file_path,
|
|
||||||
db_path=a.db_path,
|
|
||||||
vec_db_path=a.vec_db_path,
|
|
||||||
)
|
|
||||||
|
|
@ -52,8 +52,7 @@ def run(doc_id: str, file_path: str, db_path: str, vec_db_path: str) -> None:
|
||||||
|
|
||||||
conn: sqlite3.Connection | None = None
|
conn: sqlite3.Connection | None = None
|
||||||
try:
|
try:
|
||||||
conn = sqlite3.connect(db_path, timeout=30)
|
conn = sqlite3.connect(db_path)
|
||||||
conn.execute("PRAGMA journal_mode = WAL")
|
|
||||||
conn.execute("PRAGMA foreign_keys = ON")
|
conn.execute("PRAGMA foreign_keys = ON")
|
||||||
_update_status(conn, doc_id, "processing")
|
_update_status(conn, doc_id, "processing")
|
||||||
|
|
||||||
|
|
@ -64,71 +63,59 @@ def run(doc_id: str, file_path: str, db_path: str, vec_db_path: str) -> None:
|
||||||
logger.info("Extracted %d pages", len(chunks))
|
logger.info("Extracted %d pages", len(chunks))
|
||||||
|
|
||||||
# Step 2: Store chunks (replace any existing for this doc)
|
# Step 2: Store chunks (replace any existing for this doc)
|
||||||
from scripts.text_clean import clean_paragraph
|
|
||||||
conn.execute("DELETE FROM page_chunks WHERE doc_id=?", [doc_id])
|
conn.execute("DELETE FROM page_chunks WHERE doc_id=?", [doc_id])
|
||||||
chunk_rows: list[tuple[str, int, str]] = []
|
chunk_rows: list[tuple[str, int, str]] = []
|
||||||
for chunk in chunks:
|
for chunk in chunks:
|
||||||
cleaned_text = clean_paragraph(chunk.text)
|
|
||||||
if not cleaned_text:
|
|
||||||
continue
|
|
||||||
row = conn.execute(
|
row = conn.execute(
|
||||||
"""INSERT INTO page_chunks(doc_id, page_number, text, source, word_count)
|
"""INSERT INTO page_chunks(doc_id, page_number, text, source, word_count)
|
||||||
VALUES (?,?,?,?,?) RETURNING id""",
|
VALUES (?,?,?,?,?) RETURNING id""",
|
||||||
[doc_id, chunk.page_number, cleaned_text, chunk.source, len(cleaned_text.split())],
|
[doc_id, chunk.page_number, chunk.text, chunk.source, chunk.word_count],
|
||||||
).fetchone()
|
).fetchone()
|
||||||
chunk_rows.append((row[0], chunk.page_number, cleaned_text))
|
chunk_rows.append((row[0], chunk.page_number, chunk.text))
|
||||||
conn.commit()
|
conn.commit()
|
||||||
|
|
||||||
# Step 3: Embed and store vectors if Ollama is configured (BYOK gate)
|
# Step 3: Embed and store vectors if Ollama is configured (BYOK gate)
|
||||||
# Embedding failure is non-fatal: document remains BM25-searchable.
|
|
||||||
ollama_url = os.environ.get("PAGEPIPER_OLLAMA_URL", "").strip()
|
ollama_url = os.environ.get("PAGEPIPER_OLLAMA_URL", "").strip()
|
||||||
if ollama_url and chunks:
|
if ollama_url and chunks:
|
||||||
try:
|
logger.info("Embedding %d pages via Ollama at %s", len(chunks), ollama_url)
|
||||||
logger.info("Embedding %d pages via Ollama at %s", len(chunks), ollama_url)
|
from circuitforge_core.llm import LLMRouter
|
||||||
from circuitforge_core.llm import LLMRouter
|
from circuitforge_core.vector.sqlite_vec import LocalSQLiteVecStore
|
||||||
from circuitforge_core.vector.sqlite_vec import LocalSQLiteVecStore
|
|
||||||
|
|
||||||
_clean = ollama_url.rstrip("/")
|
_clean = ollama_url.rstrip("/")
|
||||||
base_url = _clean if _clean.endswith("/v1") else _clean + "/v1"
|
base_url = _clean if _clean.endswith("/v1") else _clean + "/v1"
|
||||||
router = LLMRouter({
|
router = LLMRouter({
|
||||||
"fallback_order": ["ollama"],
|
"fallback_order": ["ollama"],
|
||||||
"backends": {
|
"backends": {
|
||||||
"ollama": {
|
"ollama": {
|
||||||
"type": "openai_compat",
|
"type": "openai_compat",
|
||||||
"base_url": base_url,
|
"base_url": base_url,
|
||||||
"model": os.environ.get("PAGEPIPER_CHAT_MODEL", "mistral:7b"),
|
"model": os.environ.get("PAGEPIPER_CHAT_MODEL", "mistral:7b"),
|
||||||
"embedding_model": os.environ.get(
|
"embedding_model": os.environ.get(
|
||||||
"PAGEPIPER_EMBED_MODEL", "nomic-embed-text"
|
"PAGEPIPER_EMBED_MODEL", "nomic-embed-text"
|
||||||
),
|
),
|
||||||
"supports_images": False,
|
"supports_images": False,
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
})
|
})
|
||||||
embed_dims = int(os.environ.get("PAGEPIPER_EMBED_DIMS", "1024"))
|
vec_store = LocalSQLiteVecStore(
|
||||||
vec_store = LocalSQLiteVecStore(
|
db_path=vec_db_path, table="page_vecs", dimensions=768
|
||||||
db_path=vec_db_path, table="page_vecs", dimensions=embed_dims
|
)
|
||||||
)
|
# Remove old vectors before re-inserting. If embedding fails mid-way,
|
||||||
# Remove old vectors before re-inserting. If embedding fails mid-way,
|
# old vectors are gone but new ones are partial — re-ingest recovers.
|
||||||
# old vectors are gone but new ones are partial — re-ingest recovers.
|
vec_store.delete_where({"doc_id": doc_id})
|
||||||
vec_store.delete_where({"doc_id": doc_id})
|
|
||||||
|
texts = [text for _, _, text in chunk_rows]
|
||||||
texts = [text for _, _, text in chunk_rows]
|
vectors: list[list[float]] = []
|
||||||
vectors: list[list[float]] = []
|
for i in range(0, len(texts), EMBED_BATCH_SIZE):
|
||||||
for i in range(0, len(texts), EMBED_BATCH_SIZE):
|
vectors.extend(router.embed(texts[i : i + EMBED_BATCH_SIZE]))
|
||||||
vectors.extend(router.embed(texts[i : i + EMBED_BATCH_SIZE]))
|
|
||||||
|
for (chunk_id, page_number, _), vector in zip(chunk_rows, vectors):
|
||||||
for (chunk_id, page_number, _), vector in zip(chunk_rows, vectors):
|
vec_store.upsert(
|
||||||
vec_store.upsert(
|
id=chunk_id,
|
||||||
entry_id=chunk_id,
|
vector=vector,
|
||||||
vector=vector,
|
metadata={"doc_id": doc_id, "page_number": page_number},
|
||||||
metadata={"doc_id": doc_id, "page_number": page_number},
|
|
||||||
)
|
|
||||||
logger.info("Stored %d embeddings", len(vectors))
|
|
||||||
except Exception as embed_exc:
|
|
||||||
logger.warning(
|
|
||||||
"Embedding skipped for doc %s — BM25 only (reason: %s)",
|
|
||||||
doc_id, embed_exc,
|
|
||||||
)
|
)
|
||||||
|
logger.info("Stored %d embeddings", len(vectors))
|
||||||
|
|
||||||
_update_status(conn, doc_id, "ready", page_count=len(chunks))
|
_update_status(conn, doc_id, "ready", page_count=len(chunks))
|
||||||
logger.info("Ingest complete for doc %s (%d pages)", doc_id, len(chunks))
|
logger.info("Ingest complete for doc %s (%d pages)", doc_id, len(chunks))
|
||||||
|
|
|
||||||
|
|
@ -1,72 +0,0 @@
|
||||||
# scripts/text_clean.py
|
|
||||||
"""
|
|
||||||
Shared text-cleaning utilities for ingest pipelines.
|
|
||||||
|
|
||||||
Removes boilerplate lines injected by ebook converters, piracy watermarks,
|
|
||||||
and other non-content artifacts before chunks are stored or embedded.
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import re
|
|
||||||
|
|
||||||
# Lines that match any of these patterns are dropped entirely.
|
|
||||||
# Each pattern is matched against the stripped line (case-insensitive).
|
|
||||||
_LINE_DROP_PATTERNS: list[re.Pattern] = [
|
|
||||||
# ABC Amber converter family
|
|
||||||
re.compile(r'generated by abc amber', re.IGNORECASE),
|
|
||||||
re.compile(r'processtext\.com', re.IGNORECASE),
|
|
||||||
# Calibre / sigil metadata lines
|
|
||||||
re.compile(r'calibre \d+\.\d+', re.IGNORECASE),
|
|
||||||
# Standalone URLs (line is just a URL, no surrounding prose)
|
|
||||||
re.compile(r'^https?://\S+$'),
|
|
||||||
# Common piracy / file-sharing watermarks
|
|
||||||
re.compile(r'www\.\w+\.(com|net|org)/\S*book', re.IGNORECASE),
|
|
||||||
re.compile(r'downloaded from', re.IGNORECASE),
|
|
||||||
re.compile(r'scanned by', re.IGNORECASE),
|
|
||||||
re.compile(r'provided by', re.IGNORECASE),
|
|
||||||
# Page-number-only lines from PDF extraction (e.g. "- 42 -" or "42")
|
|
||||||
re.compile(r'^\s*-?\s*\d{1,4}\s*-?\s*$'),
|
|
||||||
]
|
|
||||||
|
|
||||||
# Inline substrings to strip from within a line before further processing.
|
|
||||||
_INLINE_STRIP_PATTERNS: list[re.Pattern] = [
|
|
||||||
re.compile(r'generated by abc amber \w+ converter,?\s*https?://\S*', re.IGNORECASE),
|
|
||||||
re.compile(r'https?://www\.processtext\.com/\S*', re.IGNORECASE),
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
def is_artifact_line(line: str) -> bool:
|
|
||||||
"""Return True if the line is a known conversion artifact and should be dropped."""
|
|
||||||
stripped = line.strip()
|
|
||||||
return any(p.search(stripped) for p in _LINE_DROP_PATTERNS)
|
|
||||||
|
|
||||||
|
|
||||||
def clean_line(line: str) -> str:
|
|
||||||
"""Strip inline converter artifacts from a line, returning the cleaned version."""
|
|
||||||
for p in _INLINE_STRIP_PATTERNS:
|
|
||||||
line = p.sub("", line)
|
|
||||||
return line.strip()
|
|
||||||
|
|
||||||
|
|
||||||
def clean_paragraph(text: str) -> str:
|
|
||||||
"""Clean a multi-line paragraph: drop artifact lines, strip inline artifacts."""
|
|
||||||
lines = []
|
|
||||||
for line in text.splitlines():
|
|
||||||
if is_artifact_line(line):
|
|
||||||
continue
|
|
||||||
cleaned = clean_line(line)
|
|
||||||
if cleaned:
|
|
||||||
lines.append(cleaned)
|
|
||||||
return "\n".join(lines)
|
|
||||||
|
|
||||||
|
|
||||||
def filter_paragraphs(paragraphs: list[str]) -> list[str]:
|
|
||||||
"""Remove artifact lines from a list of paragraph strings."""
|
|
||||||
result = []
|
|
||||||
for para in paragraphs:
|
|
||||||
if is_artifact_line(para):
|
|
||||||
continue
|
|
||||||
cleaned = clean_line(para)
|
|
||||||
if cleaned and len(cleaned.split()) >= 4:
|
|
||||||
result.append(cleaned)
|
|
||||||
return result
|
|
||||||
|
|
@ -30,10 +30,8 @@ def client(test_db, tmp_path, monkeypatch):
|
||||||
from app.main import app, _bm25
|
from app.main import app, _bm25
|
||||||
from app.deps import get_db
|
from app.deps import get_db
|
||||||
|
|
||||||
# Suppress startup side effects — test_db fixture already applies the schema,
|
# Suppress migrations during tests — test_db fixture already applies the schema
|
||||||
# and vec schema validation is tested separately in test_startup.py
|
|
||||||
monkeypatch.setattr(_main_module, "_apply_migrations", lambda: None)
|
monkeypatch.setattr(_main_module, "_apply_migrations", lambda: None)
|
||||||
monkeypatch.setattr(_main_module, "_check_vec_schema", lambda *a, **kw: None)
|
|
||||||
|
|
||||||
def override_db():
|
def override_db():
|
||||||
conn = sqlite3.connect(test_db)
|
conn = sqlite3.connect(test_db)
|
||||||
|
|
|
||||||
|
|
@ -1,170 +0,0 @@
|
||||||
# tests/test_startup.py
|
|
||||||
"""Tests for startup vec DB schema validation (_check_vec_schema)."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import os
|
|
||||||
import sqlite3
|
|
||||||
import threading
|
|
||||||
from unittest.mock import MagicMock, patch
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
from app.main import _check_vec_schema, _reembed_docs
|
|
||||||
|
|
||||||
|
|
||||||
def _make_vec_db(path: str, dims: int) -> None:
|
|
||||||
"""Create a minimal sqlite-vec-style DB with the given dimension."""
|
|
||||||
conn = sqlite3.connect(path)
|
|
||||||
conn.execute("PRAGMA journal_mode=WAL")
|
|
||||||
# Replicate the virtual table name used by LocalSQLiteVecStore
|
|
||||||
conn.execute(f"CREATE TABLE page_vecs_vecs (embedding float[{dims}])")
|
|
||||||
conn.execute(
|
|
||||||
"INSERT INTO sqlite_master(type, name, tbl_name, sql) VALUES (?,?,?,?)"
|
|
||||||
if False else ""
|
|
||||||
)
|
|
||||||
# Write a real sqlite_master entry via a virtual table workaround:
|
|
||||||
# Easiest is to put the dimension marker directly in a metadata table.
|
|
||||||
# But _check_vec_schema reads sqlite_master, so we need the real DDL there.
|
|
||||||
conn.close()
|
|
||||||
# sqlite_master is read-only — recreate using the real CREATE VIRTUAL TABLE path
|
|
||||||
# by faking it via a regular table with the matching name pattern.
|
|
||||||
conn2 = sqlite3.connect(path)
|
|
||||||
conn2.execute("DROP TABLE IF EXISTS page_vecs_vecs")
|
|
||||||
# Write a row that _check_vec_schema will parse via its regex
|
|
||||||
conn2.execute(
|
|
||||||
"CREATE TABLE _schema_hint (sql TEXT)"
|
|
||||||
)
|
|
||||||
conn2.execute(
|
|
||||||
"INSERT INTO _schema_hint VALUES (?)",
|
|
||||||
[f"CREATE VIRTUAL TABLE page_vecs_vecs USING vec0(embedding float[{dims}])"],
|
|
||||||
)
|
|
||||||
conn2.commit()
|
|
||||||
conn2.close()
|
|
||||||
|
|
||||||
|
|
||||||
def _make_real_vec_db(path: str, dims: int) -> None:
|
|
||||||
"""Create a vec DB whose sqlite_master actually contains the dimension DDL."""
|
|
||||||
import sqlite3 as _sq
|
|
||||||
# We can't load sqlite-vec in tests, so simulate by writing sqlite_master directly
|
|
||||||
# via a shadow table that _check_vec_schema reads.
|
|
||||||
conn = _sq.connect(path)
|
|
||||||
conn.execute(
|
|
||||||
f"""CREATE TABLE page_vecs_vecs (
|
|
||||||
embedding float[{dims}]
|
|
||||||
)"""
|
|
||||||
)
|
|
||||||
conn.commit()
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
|
|
||||||
class TestCheckVecSchema:
|
|
||||||
def test_no_file_is_noop(self, tmp_path):
|
|
||||||
"""Missing vec DB should not raise."""
|
|
||||||
_check_vec_schema(str(tmp_path / "missing.db"), 1024, str(tmp_path / "main.db"))
|
|
||||||
|
|
||||||
def test_matching_dims_keeps_file(self, tmp_path):
|
|
||||||
"""Correct dimensions: vec DB must not be deleted."""
|
|
||||||
vec_path = str(tmp_path / "vecs.db")
|
|
||||||
conn = sqlite3.connect(vec_path)
|
|
||||||
conn.execute("CREATE TABLE page_vecs_vecs (embedding float[1024])")
|
|
||||||
conn.commit()
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
_check_vec_schema(vec_path, 1024, str(tmp_path / "main.db"))
|
|
||||||
|
|
||||||
assert os.path.exists(vec_path), "Vec DB should not be deleted when dims match"
|
|
||||||
|
|
||||||
def test_mismatched_dims_deletes_file(self, tmp_path):
|
|
||||||
"""Dimension mismatch: vec DB must be deleted."""
|
|
||||||
vec_path = str(tmp_path / "vecs.db")
|
|
||||||
conn = sqlite3.connect(vec_path)
|
|
||||||
conn.execute("CREATE TABLE page_vecs_vecs (embedding float[768])")
|
|
||||||
conn.commit()
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
db_path = str(tmp_path / "main.db")
|
|
||||||
_check_vec_schema(vec_path, 1024, db_path)
|
|
||||||
|
|
||||||
assert not os.path.exists(vec_path), "Vec DB should be deleted on dimension mismatch"
|
|
||||||
|
|
||||||
def test_mismatched_dims_queues_reembed(self, tmp_path):
|
|
||||||
"""Dimension mismatch: re-embed thread must be started for ready docs."""
|
|
||||||
vec_path = str(tmp_path / "vecs.db")
|
|
||||||
conn = sqlite3.connect(vec_path)
|
|
||||||
conn.execute("CREATE TABLE page_vecs_vecs (embedding float[768])")
|
|
||||||
conn.commit()
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
db_path = str(tmp_path / "main.db")
|
|
||||||
schema = (
|
|
||||||
"CREATE TABLE documents ("
|
|
||||||
"id TEXT PRIMARY KEY, title TEXT, file_path TEXT, "
|
|
||||||
"status TEXT, task_id TEXT, page_count INTEGER, "
|
|
||||||
"error_msg TEXT, created_at TEXT, updated_at TEXT)"
|
|
||||||
)
|
|
||||||
main_conn = sqlite3.connect(db_path)
|
|
||||||
main_conn.execute(schema)
|
|
||||||
main_conn.execute(
|
|
||||||
"INSERT INTO documents VALUES ('abc123', 'Book', '/tmp/book.pdf', 'ready', NULL, 10, NULL, '2026-01-01', '2026-01-01')"
|
|
||||||
)
|
|
||||||
main_conn.commit()
|
|
||||||
main_conn.close()
|
|
||||||
|
|
||||||
started = []
|
|
||||||
real_thread_start = threading.Thread.start
|
|
||||||
|
|
||||||
def _capture_start(self):
|
|
||||||
started.append(self)
|
|
||||||
# Don't actually run the re-embed to keep tests fast
|
|
||||||
self.run = lambda: None
|
|
||||||
real_thread_start(self)
|
|
||||||
|
|
||||||
with patch.object(threading.Thread, "start", _capture_start):
|
|
||||||
_check_vec_schema(vec_path, 1024, db_path)
|
|
||||||
|
|
||||||
assert len(started) == 1, "Exactly one re-embed thread should be started"
|
|
||||||
assert started[0].name == "pagepiper-reembed"
|
|
||||||
|
|
||||||
def test_no_ready_docs_skips_thread(self, tmp_path):
|
|
||||||
"""Mismatch with no ready docs: no thread should be started."""
|
|
||||||
vec_path = str(tmp_path / "vecs.db")
|
|
||||||
conn = sqlite3.connect(vec_path)
|
|
||||||
conn.execute("CREATE TABLE page_vecs_vecs (embedding float[768])")
|
|
||||||
conn.commit()
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
db_path = str(tmp_path / "main.db")
|
|
||||||
schema = (
|
|
||||||
"CREATE TABLE documents ("
|
|
||||||
"id TEXT PRIMARY KEY, title TEXT, file_path TEXT, "
|
|
||||||
"status TEXT, task_id TEXT, page_count INTEGER, "
|
|
||||||
"error_msg TEXT, created_at TEXT, updated_at TEXT)"
|
|
||||||
)
|
|
||||||
main_conn = sqlite3.connect(db_path)
|
|
||||||
main_conn.execute(schema)
|
|
||||||
main_conn.commit()
|
|
||||||
main_conn.close()
|
|
||||||
|
|
||||||
started = []
|
|
||||||
with patch.object(threading.Thread, "start", lambda self: started.append(self)):
|
|
||||||
_check_vec_schema(vec_path, 1024, db_path)
|
|
||||||
|
|
||||||
assert len(started) == 0
|
|
||||||
|
|
||||||
def test_empty_db_no_table_is_noop(self, tmp_path):
|
|
||||||
"""Vec DB exists but has no page_vecs_vecs table yet: no deletion."""
|
|
||||||
vec_path = str(tmp_path / "vecs.db")
|
|
||||||
sqlite3.connect(vec_path).close() # create empty file
|
|
||||||
|
|
||||||
_check_vec_schema(vec_path, 1024, str(tmp_path / "main.db"))
|
|
||||||
|
|
||||||
assert os.path.exists(vec_path)
|
|
||||||
|
|
||||||
def test_corrupt_db_does_not_raise(self, tmp_path):
|
|
||||||
"""Corrupt or unreadable vec DB must not propagate exceptions."""
|
|
||||||
vec_path = str(tmp_path / "vecs.db")
|
|
||||||
with open(vec_path, "w") as f:
|
|
||||||
f.write("not a sqlite database")
|
|
||||||
|
|
||||||
_check_vec_schema(vec_path, 1024, str(tmp_path / "main.db"))
|
|
||||||
# No assertion needed — just must not raise
|
|
||||||
|
|
@ -1,108 +0,0 @@
|
||||||
# tests/test_text_clean.py
|
|
||||||
"""Tests for ebook artifact filtering in scripts/text_clean.py."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
from scripts.text_clean import (
|
|
||||||
clean_line,
|
|
||||||
clean_paragraph,
|
|
||||||
filter_paragraphs,
|
|
||||||
is_artifact_line,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
class TestIsArtifactLine:
|
|
||||||
def test_abc_amber_lit(self):
|
|
||||||
assert is_artifact_line(
|
|
||||||
"Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html"
|
|
||||||
)
|
|
||||||
|
|
||||||
def test_abc_amber_rtf(self):
|
|
||||||
assert is_artifact_line("Generated by ABC Amber RTF Converter")
|
|
||||||
|
|
||||||
def test_processtext_url_only(self):
|
|
||||||
assert is_artifact_line("http://www.processtext.com/abclit.html")
|
|
||||||
|
|
||||||
def test_standalone_url(self):
|
|
||||||
assert is_artifact_line("https://www.example.com/book")
|
|
||||||
|
|
||||||
def test_page_number_only(self):
|
|
||||||
assert is_artifact_line("42")
|
|
||||||
assert is_artifact_line("- 42 -")
|
|
||||||
assert is_artifact_line(" 7 ")
|
|
||||||
|
|
||||||
def test_downloaded_from(self):
|
|
||||||
assert is_artifact_line("Downloaded from www.fictionsite.net")
|
|
||||||
|
|
||||||
def test_scanned_by(self):
|
|
||||||
assert is_artifact_line("Scanned by SomeUser")
|
|
||||||
|
|
||||||
def test_normal_prose_not_artifact(self):
|
|
||||||
assert not is_artifact_line(
|
|
||||||
'"And what if food isn\'t the only reason Jagang is going to Anderith?"'
|
|
||||||
)
|
|
||||||
|
|
||||||
def test_url_embedded_in_prose_not_dropped(self):
|
|
||||||
# A URL inside a sentence is not a standalone-URL artifact line
|
|
||||||
assert not is_artifact_line(
|
|
||||||
"You can read more about this at https://example.com and continue."
|
|
||||||
)
|
|
||||||
|
|
||||||
def test_short_page_header_not_dropped(self):
|
|
||||||
# "Chapter 1" is not an artifact — 4-digit number check only drops bare numbers
|
|
||||||
assert not is_artifact_line("Chapter 1")
|
|
||||||
|
|
||||||
|
|
||||||
class TestCleanLine:
|
|
||||||
def test_strips_inline_abc_amber(self):
|
|
||||||
line = "Some prose. Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html"
|
|
||||||
result = clean_line(line)
|
|
||||||
assert "ABC Amber" not in result
|
|
||||||
assert "processtext" not in result
|
|
||||||
assert "Some prose." in result
|
|
||||||
|
|
||||||
def test_passes_clean_line_unchanged(self):
|
|
||||||
line = "He cocked an eyebrow and smiled."
|
|
||||||
assert clean_line(line) == line
|
|
||||||
|
|
||||||
|
|
||||||
class TestCleanParagraph:
|
|
||||||
def test_drops_artifact_lines_from_paragraph(self):
|
|
||||||
text = (
|
|
||||||
"Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html\n"
|
|
||||||
'"And what if food isn\'t the only reason Jagang is going to Anderith?"\n'
|
|
||||||
"He cocked an eyebrow."
|
|
||||||
)
|
|
||||||
result = clean_paragraph(text)
|
|
||||||
assert "ABC Amber" not in result
|
|
||||||
assert "Jagang" in result
|
|
||||||
assert "eyebrow" in result
|
|
||||||
|
|
||||||
def test_all_artifact_paragraph_returns_empty(self):
|
|
||||||
text = "Generated by ABC Amber LIT Converter\nhttp://www.processtext.com/abclit.html"
|
|
||||||
assert clean_paragraph(text) == ""
|
|
||||||
|
|
||||||
def test_clean_paragraph_unchanged(self):
|
|
||||||
text = "Richard raised his sword.\nThe magic surged through him."
|
|
||||||
assert clean_paragraph(text) == text
|
|
||||||
|
|
||||||
|
|
||||||
class TestFilterParagraphs:
|
|
||||||
def test_drops_artifact_paragraphs(self):
|
|
||||||
paras = [
|
|
||||||
"Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html",
|
|
||||||
'"And what if food isn\'t the only reason Jagang is going to Anderith?"',
|
|
||||||
"He cocked an eyebrow at the question.",
|
|
||||||
]
|
|
||||||
result = filter_paragraphs(paras)
|
|
||||||
assert len(result) == 2
|
|
||||||
assert all("ABC Amber" not in p for p in result)
|
|
||||||
|
|
||||||
def test_drops_short_lines_under_4_words(self):
|
|
||||||
paras = ["Hi", "OK sure", "Valid sentence with enough words here."]
|
|
||||||
result = filter_paragraphs(paras)
|
|
||||||
assert result == ["Valid sentence with enough words here."]
|
|
||||||
|
|
||||||
def test_empty_input(self):
|
|
||||||
assert filter_paragraphs([]) == []
|
|
||||||
|
|
@ -6,13 +6,11 @@
|
||||||
<RouterLink to="/chat" class="nav-link">Chat</RouterLink>
|
<RouterLink to="/chat" class="nav-link">Chat</RouterLink>
|
||||||
</nav>
|
</nav>
|
||||||
<RouterView />
|
<RouterView />
|
||||||
<FeedbackButton />
|
|
||||||
</div>
|
</div>
|
||||||
</template>
|
</template>
|
||||||
|
|
||||||
<script setup lang="ts">
|
<script setup lang="ts">
|
||||||
import { RouterLink, RouterView } from "vue-router"
|
import { RouterLink, RouterView } from "vue-router"
|
||||||
import FeedbackButton from "@/components/FeedbackButton.vue"
|
|
||||||
</script>
|
</script>
|
||||||
|
|
||||||
<style>
|
<style>
|
||||||
|
|
|
||||||
|
|
@ -37,15 +37,6 @@ export interface TaskStatus {
|
||||||
error?: string
|
error?: string
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface DocumentStatus {
|
|
||||||
id: string
|
|
||||||
status: "pending" | "processing" | "ready" | "error"
|
|
||||||
task_id: string | null
|
|
||||||
page_count: number | null
|
|
||||||
vec_count: number
|
|
||||||
error_msg: string | null
|
|
||||||
}
|
|
||||||
|
|
||||||
export interface ChatMessage {
|
export interface ChatMessage {
|
||||||
role: string
|
role: string
|
||||||
content: string
|
content: string
|
||||||
|
|
@ -71,23 +62,11 @@ export const api = {
|
||||||
const r = await fetch(`${BASE}/api/library/${docId}`, { method: "DELETE" })
|
const r = await fetch(`${BASE}/api/library/${docId}`, { method: "DELETE" })
|
||||||
if (!r.ok) throw new Error(await r.text())
|
if (!r.ok) throw new Error(await r.text())
|
||||||
},
|
},
|
||||||
async uploadDocument(file: File): Promise<{ doc_id: string; task_id: string | null; filename: string; status: string }> {
|
|
||||||
const form = new FormData()
|
|
||||||
form.append("file", file)
|
|
||||||
const r = await fetch(`${BASE}/api/library/upload`, { method: "POST", body: form })
|
|
||||||
if (!r.ok) throw new Error(await r.text())
|
|
||||||
return r.json()
|
|
||||||
},
|
|
||||||
async getTaskStatus(taskId: string): Promise<TaskStatus> {
|
async getTaskStatus(taskId: string): Promise<TaskStatus> {
|
||||||
const r = await fetch(`${BASE}/api/ingest/${taskId}`)
|
const r = await fetch(`${BASE}/api/ingest/${taskId}`)
|
||||||
if (!r.ok) throw new Error(await r.text())
|
if (!r.ok) throw new Error(await r.text())
|
||||||
return r.json()
|
return r.json()
|
||||||
},
|
},
|
||||||
async getDocumentStatus(docId: string): Promise<DocumentStatus> {
|
|
||||||
const r = await fetch(`${BASE}/api/library/${docId}/status`)
|
|
||||||
if (!r.ok) throw new Error(await r.text())
|
|
||||||
return r.json()
|
|
||||||
},
|
|
||||||
async search(query: string, topK = 10, docIds?: string[]): Promise<SearchResult[]> {
|
async search(query: string, topK = 10, docIds?: string[]): Promise<SearchResult[]> {
|
||||||
const r = await fetch(`${BASE}/api/search`, {
|
const r = await fetch(`${BASE}/api/search`, {
|
||||||
method: "POST",
|
method: "POST",
|
||||||
|
|
@ -119,21 +98,4 @@ export const api = {
|
||||||
}
|
}
|
||||||
return r.json()
|
return r.json()
|
||||||
},
|
},
|
||||||
async chatFeedbackStatus(): Promise<{ enabled: boolean }> {
|
|
||||||
const r = await fetch(`${BASE}/api/chat/feedback/status`)
|
|
||||||
if (!r.ok) return { enabled: false }
|
|
||||||
return r.json()
|
|
||||||
},
|
|
||||||
async submitChatFeedback(
|
|
||||||
rating: 1 | -1,
|
|
||||||
question: string,
|
|
||||||
answer: string,
|
|
||||||
docIds: string[],
|
|
||||||
): Promise<void> {
|
|
||||||
await fetch(`${BASE}/api/chat/feedback`, {
|
|
||||||
method: "POST",
|
|
||||||
headers: { "Content-Type": "application/json" },
|
|
||||||
body: JSON.stringify({ rating, question, answer, doc_ids: docIds }),
|
|
||||||
})
|
|
||||||
},
|
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -1,24 +1,18 @@
|
||||||
<template>
|
<template>
|
||||||
<div class="doc-card" :class="`status-${currentStatus}`">
|
<div class="doc-card" :class="`status-${doc.status}`">
|
||||||
<div class="doc-status-badge" :class="`badge-${currentStatus}`">{{ currentStatus }}</div>
|
<div class="doc-status-badge">{{ doc.status }}</div>
|
||||||
<div class="doc-title">{{ doc.title }}</div>
|
<div class="doc-title">{{ doc.title }}</div>
|
||||||
<div class="doc-meta" v-if="displayPageCount != null">{{ displayPageCount }} pages</div>
|
<div class="doc-meta" v-if="doc.page_count != null">{{ doc.page_count }} pages</div>
|
||||||
<div class="doc-meta path">{{ shortPath }}</div>
|
<div class="doc-meta path">{{ shortPath }}</div>
|
||||||
|
|
||||||
<div class="ingest-progress" v-if="isProcessing">
|
<IngestProgress
|
||||||
<div class="progress-label">
|
v-if="doc.status === 'processing' && doc.task_id"
|
||||||
<span>{{ progressLabel }}</span>
|
:task-id="doc.task_id"
|
||||||
<span class="progress-pct" v-if="progressPct != null">{{ progressPct }}%</span>
|
@done="emit('refresh')"
|
||||||
</div>
|
/>
|
||||||
<div class="progress-bar">
|
|
||||||
<div class="progress-fill" :class="{ indeterminate: progressPct == null }" :style="progressPct != null ? { width: `${progressPct}%` } : {}" />
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<p class="doc-error" v-if="currentStatus === 'error'">{{ errorMsg ?? 'Indexing failed.' }}</p>
|
|
||||||
|
|
||||||
<div class="doc-actions">
|
<div class="doc-actions">
|
||||||
<button class="btn-sm" @click="emit('reingest', doc.id)" :disabled="isProcessing">
|
<button class="btn-sm" @click="emit('reingest', doc.id)" :disabled="doc.status === 'processing'">
|
||||||
Re-index
|
Re-index
|
||||||
</button>
|
</button>
|
||||||
<button class="btn-sm danger" @click="emit('delete', doc.id)">Remove</button>
|
<button class="btn-sm danger" @click="emit('delete', doc.id)">Remove</button>
|
||||||
|
|
@ -27,9 +21,9 @@
|
||||||
</template>
|
</template>
|
||||||
|
|
||||||
<script setup lang="ts">
|
<script setup lang="ts">
|
||||||
import { computed, onMounted, onUnmounted, ref } from "vue"
|
import { computed } from "vue"
|
||||||
import type { Document } from "@/api"
|
import type { Document } from "@/api"
|
||||||
import { api } from "@/api"
|
import IngestProgress from "@/components/IngestProgress.vue"
|
||||||
|
|
||||||
const props = defineProps<{ doc: Document }>()
|
const props = defineProps<{ doc: Document }>()
|
||||||
const emit = defineEmits<{ reingest: [id: string]; delete: [id: string]; refresh: [] }>()
|
const emit = defineEmits<{ reingest: [id: string]; delete: [id: string]; refresh: [] }>()
|
||||||
|
|
@ -38,54 +32,6 @@ const shortPath = computed(() => {
|
||||||
const parts = props.doc.file_path.split("/")
|
const parts = props.doc.file_path.split("/")
|
||||||
return parts.slice(-2).join("/")
|
return parts.slice(-2).join("/")
|
||||||
})
|
})
|
||||||
|
|
||||||
// Live-updating fields polled from /api/library/{id}/status
|
|
||||||
const currentStatus = ref(props.doc.status)
|
|
||||||
const displayPageCount = ref(props.doc.page_count)
|
|
||||||
const vecCount = ref(0)
|
|
||||||
const errorMsg = ref<string | null>(null)
|
|
||||||
|
|
||||||
const isProcessing = computed(() => currentStatus.value === "processing")
|
|
||||||
|
|
||||||
const progressLabel = computed(() => {
|
|
||||||
if (displayPageCount.value == null || vecCount.value === 0) return "Extracting text…"
|
|
||||||
return `Embedding ${vecCount.value} / ${displayPageCount.value} pages`
|
|
||||||
})
|
|
||||||
|
|
||||||
const progressPct = computed((): number | null => {
|
|
||||||
if (displayPageCount.value == null || displayPageCount.value === 0) return null
|
|
||||||
if (vecCount.value === 0) return null
|
|
||||||
return Math.min(Math.round((vecCount.value / displayPageCount.value) * 100), 99)
|
|
||||||
})
|
|
||||||
|
|
||||||
let timer: ReturnType<typeof setInterval> | null = null
|
|
||||||
|
|
||||||
async function pollStatus() {
|
|
||||||
try {
|
|
||||||
const s = await api.getDocumentStatus(props.doc.id)
|
|
||||||
currentStatus.value = s.status
|
|
||||||
displayPageCount.value = s.page_count
|
|
||||||
vecCount.value = s.vec_count
|
|
||||||
errorMsg.value = s.error_msg
|
|
||||||
if (s.status !== "processing") {
|
|
||||||
stopPoll()
|
|
||||||
if (s.status === "ready") emit("refresh")
|
|
||||||
}
|
|
||||||
} catch (_e: unknown) { /* non-fatal — keep polling */ }
|
|
||||||
}
|
|
||||||
|
|
||||||
function stopPoll() {
|
|
||||||
if (timer) { clearInterval(timer); timer = null }
|
|
||||||
}
|
|
||||||
|
|
||||||
onMounted(() => {
|
|
||||||
if (props.doc.status === "processing") {
|
|
||||||
pollStatus()
|
|
||||||
timer = setInterval(pollStatus, 3000)
|
|
||||||
}
|
|
||||||
})
|
|
||||||
|
|
||||||
onUnmounted(stopPoll)
|
|
||||||
</script>
|
</script>
|
||||||
|
|
||||||
<style scoped>
|
<style scoped>
|
||||||
|
|
@ -102,7 +48,6 @@ onUnmounted(stopPoll)
|
||||||
}
|
}
|
||||||
.doc-card.status-error { border-color: var(--color-error); }
|
.doc-card.status-error { border-color: var(--color-error); }
|
||||||
.doc-card.status-ready { border-color: var(--color-success); }
|
.doc-card.status-ready { border-color: var(--color-success); }
|
||||||
.doc-card.status-processing { border-color: var(--color-accent); }
|
|
||||||
.doc-title { font-weight: 600; font-size: 1rem; }
|
.doc-title { font-weight: 600; font-size: 1rem; }
|
||||||
.doc-meta { font-size: 0.8rem; color: var(--color-text-muted); }
|
.doc-meta { font-size: 0.8rem; color: var(--color-text-muted); }
|
||||||
.doc-meta.path { font-family: var(--font-mono); word-break: break-all; }
|
.doc-meta.path { font-family: var(--font-mono); word-break: break-all; }
|
||||||
|
|
@ -112,9 +57,6 @@ onUnmounted(stopPoll)
|
||||||
padding: 2px 6px; border-radius: var(--radius-sm);
|
padding: 2px 6px; border-radius: var(--radius-sm);
|
||||||
background: var(--color-surface-alt);
|
background: var(--color-surface-alt);
|
||||||
}
|
}
|
||||||
.badge-processing { background: var(--color-accent); color: #fff; }
|
|
||||||
.badge-ready { background: var(--color-success); color: #fff; }
|
|
||||||
.badge-error { background: var(--color-error); color: #fff; }
|
|
||||||
.doc-actions { display: flex; gap: 0.5rem; margin-top: 0.5rem; }
|
.doc-actions { display: flex; gap: 0.5rem; margin-top: 0.5rem; }
|
||||||
.btn-sm {
|
.btn-sm {
|
||||||
padding: 4px 10px; border: 1px solid var(--color-border); border-radius: var(--radius-sm);
|
padding: 4px 10px; border: 1px solid var(--color-border); border-radius: var(--radius-sm);
|
||||||
|
|
@ -123,23 +65,4 @@ onUnmounted(stopPoll)
|
||||||
.btn-sm:hover { border-color: var(--color-accent); }
|
.btn-sm:hover { border-color: var(--color-accent); }
|
||||||
.btn-sm.danger:hover { border-color: var(--color-error); color: var(--color-error); }
|
.btn-sm.danger:hover { border-color: var(--color-error); color: var(--color-error); }
|
||||||
.btn-sm:disabled { opacity: 0.4; cursor: default; }
|
.btn-sm:disabled { opacity: 0.4; cursor: default; }
|
||||||
.doc-error { color: var(--color-error); font-size: 0.8rem; }
|
|
||||||
|
|
||||||
/* Progress bar */
|
|
||||||
.ingest-progress { margin-top: 0.25rem; }
|
|
||||||
.progress-label {
|
|
||||||
display: flex; justify-content: space-between;
|
|
||||||
font-size: 0.78rem; color: var(--color-text-muted); margin-bottom: 4px;
|
|
||||||
}
|
|
||||||
.progress-pct { font-variant-numeric: tabular-nums; }
|
|
||||||
.progress-bar { height: 4px; background: var(--color-border); border-radius: 2px; overflow: hidden; }
|
|
||||||
.progress-fill { height: 100%; background: var(--color-accent); transition: width 0.4s ease; }
|
|
||||||
.progress-fill.indeterminate {
|
|
||||||
width: 40%;
|
|
||||||
animation: slide 1.4s ease-in-out infinite;
|
|
||||||
}
|
|
||||||
@keyframes slide {
|
|
||||||
0% { transform: translateX(-100%); }
|
|
||||||
100% { transform: translateX(300%); }
|
|
||||||
}
|
|
||||||
</style>
|
</style>
|
||||||
|
|
|
||||||
|
|
@ -1,631 +0,0 @@
|
||||||
<template>
|
|
||||||
<!-- Floating trigger button -->
|
|
||||||
<button
|
|
||||||
v-if="enabled"
|
|
||||||
class="feedback-fab"
|
|
||||||
@click="open = true"
|
|
||||||
aria-label="Send feedback or report a bug"
|
|
||||||
title="Send feedback or report a bug"
|
|
||||||
>
|
|
||||||
<svg class="feedback-fab-icon" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.8" stroke-linecap="round" stroke-linejoin="round">
|
|
||||||
<path d="M21 15a2 2 0 01-2 2H7l-4 4V5a2 2 0 012-2h14a2 2 0 012 2z"/>
|
|
||||||
</svg>
|
|
||||||
<span class="feedback-fab-label">Feedback</span>
|
|
||||||
</button>
|
|
||||||
|
|
||||||
<!-- Modal — teleported to body to avoid z-index / overflow clipping -->
|
|
||||||
<Teleport to="body">
|
|
||||||
<Transition name="modal-fade">
|
|
||||||
<div v-if="open" class="feedback-overlay" @click.self="close">
|
|
||||||
<div class="feedback-modal" role="dialog" aria-modal="true" aria-label="Send Feedback">
|
|
||||||
|
|
||||||
<!-- Header -->
|
|
||||||
<div class="feedback-header">
|
|
||||||
<h2 class="feedback-title">{{ step === 1 ? "What's on your mind?" : "Review & submit" }}</h2>
|
|
||||||
<button class="feedback-close" @click="close" aria-label="Close">
|
|
||||||
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" width="18" height="18">
|
|
||||||
<line x1="18" y1="6" x2="6" y2="18"/><line x1="6" y1="6" x2="18" y2="18"/>
|
|
||||||
</svg>
|
|
||||||
</button>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- ── Step 1: Form ─────────────────────────────────────────── -->
|
|
||||||
<div v-if="step === 1" class="feedback-body">
|
|
||||||
<div class="form-group">
|
|
||||||
<label class="form-label">Type</label>
|
|
||||||
<div class="filter-chip-row">
|
|
||||||
<button
|
|
||||||
v-for="t in types"
|
|
||||||
:key="t.value"
|
|
||||||
:class="['btn-chip', { active: form.type === t.value }]"
|
|
||||||
@click="form.type = t.value"
|
|
||||||
type="button"
|
|
||||||
>{{ t.label }}</button>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<div class="form-group">
|
|
||||||
<label class="form-label">Title <span class="form-required">*</span></label>
|
|
||||||
<input
|
|
||||||
v-model="form.title"
|
|
||||||
class="form-input"
|
|
||||||
type="text"
|
|
||||||
placeholder="Short summary of the issue or idea"
|
|
||||||
maxlength="120"
|
|
||||||
/>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<div class="form-group">
|
|
||||||
<label class="form-label">Description <span class="form-required">*</span></label>
|
|
||||||
<textarea
|
|
||||||
v-model="form.description"
|
|
||||||
class="form-input feedback-textarea"
|
|
||||||
placeholder="Describe what happened or what you'd like to see…"
|
|
||||||
rows="4"
|
|
||||||
/>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<div v-if="form.type === 'bug'" class="form-group">
|
|
||||||
<label class="form-label">Reproduction steps</label>
|
|
||||||
<textarea
|
|
||||||
v-model="form.repro"
|
|
||||||
class="form-input feedback-textarea"
|
|
||||||
placeholder="1. Go to… 2. Tap… 3. See error"
|
|
||||||
rows="3"
|
|
||||||
/>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<div class="form-group">
|
|
||||||
<label class="form-label">Screenshot <span class="text-muted text-xs">(optional, max 5 MB)</span></label>
|
|
||||||
<input
|
|
||||||
type="file"
|
|
||||||
accept="image/*"
|
|
||||||
class="form-input-file"
|
|
||||||
@change="onScreenshotChange"
|
|
||||||
ref="fileInput"
|
|
||||||
/>
|
|
||||||
<div v-if="screenshotPreview" class="screenshot-preview">
|
|
||||||
<img :src="screenshotPreview" alt="Screenshot preview" />
|
|
||||||
<button class="screenshot-remove btn-link" type="button" @click="clearScreenshot" aria-label="Remove screenshot">Remove</button>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<p v-if="stepError" class="feedback-error">{{ stepError }}</p>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- ── Step 2: Attribution + confirm ──────────────────────────── -->
|
|
||||||
<div v-if="step === 2" class="feedback-body">
|
|
||||||
<div class="feedback-summary card">
|
|
||||||
<div class="feedback-summary-row">
|
|
||||||
<span class="text-muted text-sm">Type</span>
|
|
||||||
<span class="text-sm font-semibold">{{ typeLabel }}</span>
|
|
||||||
</div>
|
|
||||||
<div class="feedback-summary-row">
|
|
||||||
<span class="text-muted text-sm">Title</span>
|
|
||||||
<span class="text-sm">{{ form.title }}</span>
|
|
||||||
</div>
|
|
||||||
<div class="feedback-summary-row">
|
|
||||||
<span class="text-muted text-sm">Description</span>
|
|
||||||
<span class="text-sm feedback-summary-desc">{{ form.description }}</span>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<div class="form-group mt-md">
|
|
||||||
<label class="form-label">Attribution (optional)</label>
|
|
||||||
<input
|
|
||||||
v-model="form.submitter"
|
|
||||||
class="form-input"
|
|
||||||
type="text"
|
|
||||||
placeholder="Your name <email@example.com>"
|
|
||||||
/>
|
|
||||||
<p class="text-muted text-xs mt-xs">Include your name and email in the issue if you'd like a response. Never required.</p>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<p v-if="submitError" class="feedback-error">{{ submitError }}</p>
|
|
||||||
<div v-if="submitted" class="feedback-success">
|
|
||||||
Issue filed! <a :href="issueUrl" target="_blank" rel="noopener" class="feedback-link">View on Forgejo →</a>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Footer nav -->
|
|
||||||
<div class="feedback-footer">
|
|
||||||
<button v-if="step === 2 && !submitted" class="btn btn-ghost" @click="step = 1" :disabled="loading">← Back</button>
|
|
||||||
<button v-if="!submitted" class="btn btn-ghost" @click="close" :disabled="loading">Cancel</button>
|
|
||||||
<button
|
|
||||||
v-if="step === 1"
|
|
||||||
class="btn btn-primary"
|
|
||||||
@click="nextStep"
|
|
||||||
>Next →</button>
|
|
||||||
<button
|
|
||||||
v-if="step === 2 && !submitted"
|
|
||||||
class="btn btn-primary"
|
|
||||||
@click="submit"
|
|
||||||
:disabled="loading"
|
|
||||||
>{{ loading ? 'Filing…' : 'Submit' }}</button>
|
|
||||||
<button v-if="submitted" class="btn btn-primary" @click="close">Done</button>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</Transition>
|
|
||||||
</Teleport>
|
|
||||||
</template>
|
|
||||||
|
|
||||||
<script setup lang="ts">
|
|
||||||
import { ref, computed, onMounted } from 'vue'
|
|
||||||
|
|
||||||
const props = defineProps<{ currentTab?: string }>()
|
|
||||||
|
|
||||||
const fileInput = ref<HTMLInputElement | null>(null)
|
|
||||||
const screenshotB64 = ref<string | null>(null)
|
|
||||||
const screenshotPreview = ref<string | null>(null)
|
|
||||||
const screenshotFilename = ref('screenshot.png')
|
|
||||||
|
|
||||||
function onScreenshotChange(event: Event) {
|
|
||||||
const file = (event.target as HTMLInputElement).files?.[0]
|
|
||||||
if (!file) return
|
|
||||||
screenshotFilename.value = file.name
|
|
||||||
const reader = new FileReader()
|
|
||||||
reader.onload = (e) => {
|
|
||||||
const result = e.target?.result as string
|
|
||||||
screenshotB64.value = result
|
|
||||||
screenshotPreview.value = result
|
|
||||||
}
|
|
||||||
reader.readAsDataURL(file)
|
|
||||||
}
|
|
||||||
|
|
||||||
function clearScreenshot() {
|
|
||||||
screenshotB64.value = null
|
|
||||||
screenshotPreview.value = null
|
|
||||||
if (fileInput.value) fileInput.value.value = ''
|
|
||||||
}
|
|
||||||
|
|
||||||
const apiBase = (import.meta.env.VITE_API_BASE as string) ?? ''
|
|
||||||
|
|
||||||
// Probe once on mount — hidden until confirmed enabled so button never flashes
|
|
||||||
const enabled = ref(false)
|
|
||||||
onMounted(async () => {
|
|
||||||
try {
|
|
||||||
const res = await fetch(`${apiBase}/api/v1/feedback/status`)
|
|
||||||
if (res.ok) {
|
|
||||||
const data = await res.json()
|
|
||||||
enabled.value = data.enabled === true
|
|
||||||
}
|
|
||||||
} catch { /* network error — stay hidden */ }
|
|
||||||
})
|
|
||||||
|
|
||||||
const open = ref(false)
|
|
||||||
const step = ref(1)
|
|
||||||
const loading = ref(false)
|
|
||||||
const stepError = ref('')
|
|
||||||
const submitError = ref('')
|
|
||||||
const submitted = ref(false)
|
|
||||||
const issueUrl = ref('')
|
|
||||||
|
|
||||||
const types: { value: 'bug' | 'feature' | 'other'; label: string }[] = [
|
|
||||||
{ value: 'bug', label: '🐛 Bug' },
|
|
||||||
{ value: 'feature', label: '✨ Feature request' },
|
|
||||||
{ value: 'other', label: '💬 Other' },
|
|
||||||
]
|
|
||||||
|
|
||||||
const form = ref({
|
|
||||||
type: 'bug' as 'bug' | 'feature' | 'other',
|
|
||||||
title: '',
|
|
||||||
description: '',
|
|
||||||
repro: '',
|
|
||||||
submitter: '',
|
|
||||||
})
|
|
||||||
|
|
||||||
const typeLabel = computed(() => types.find(t => t.value === form.value.type)?.label ?? '')
|
|
||||||
|
|
||||||
function close() {
|
|
||||||
open.value = false
|
|
||||||
// reset after transition
|
|
||||||
setTimeout(reset, 300)
|
|
||||||
}
|
|
||||||
|
|
||||||
function reset() {
|
|
||||||
step.value = 1
|
|
||||||
loading.value = false
|
|
||||||
stepError.value = ''
|
|
||||||
submitError.value = ''
|
|
||||||
submitted.value = false
|
|
||||||
issueUrl.value = ''
|
|
||||||
form.value = { type: 'bug', title: '', description: '', repro: '', submitter: '' }
|
|
||||||
clearScreenshot()
|
|
||||||
}
|
|
||||||
|
|
||||||
function nextStep() {
|
|
||||||
stepError.value = ''
|
|
||||||
if (!form.value.title.trim() || !form.value.description.trim()) {
|
|
||||||
stepError.value = 'Please fill in both Title and Description.'
|
|
||||||
return
|
|
||||||
}
|
|
||||||
step.value = 2
|
|
||||||
}
|
|
||||||
|
|
||||||
async function submit() {
|
|
||||||
loading.value = true
|
|
||||||
submitError.value = ''
|
|
||||||
try {
|
|
||||||
const res = await fetch(`${apiBase}/api/v1/feedback`, {
|
|
||||||
method: 'POST',
|
|
||||||
headers: { 'Content-Type': 'application/json' },
|
|
||||||
body: JSON.stringify({
|
|
||||||
title: form.value.title.trim(),
|
|
||||||
description: form.value.description.trim(),
|
|
||||||
type: form.value.type,
|
|
||||||
repro: form.value.repro.trim(),
|
|
||||||
tab: props.currentTab ?? 'unknown',
|
|
||||||
submitter: form.value.submitter.trim(),
|
|
||||||
}),
|
|
||||||
})
|
|
||||||
if (!res.ok) {
|
|
||||||
const err = await res.json().catch(() => ({ detail: res.statusText }))
|
|
||||||
submitError.value = err.detail ?? 'Submission failed.'
|
|
||||||
return
|
|
||||||
}
|
|
||||||
const data = await res.json()
|
|
||||||
issueUrl.value = data.issue_url
|
|
||||||
|
|
||||||
// Upload screenshot if provided
|
|
||||||
if (screenshotB64.value) {
|
|
||||||
try {
|
|
||||||
await fetch(`${apiBase}/api/v1/feedback/attach`, {
|
|
||||||
method: 'POST',
|
|
||||||
headers: { 'Content-Type': 'application/json' },
|
|
||||||
body: JSON.stringify({
|
|
||||||
issue_number: data.issue_number,
|
|
||||||
filename: screenshotFilename.value,
|
|
||||||
image_b64: screenshotB64.value,
|
|
||||||
}),
|
|
||||||
})
|
|
||||||
// Non-fatal: if attach fails, the issue was still filed
|
|
||||||
} catch { /* ignore attach errors */ }
|
|
||||||
}
|
|
||||||
|
|
||||||
submitted.value = true
|
|
||||||
} catch (e) {
|
|
||||||
submitError.value = 'Network error — please try again.'
|
|
||||||
} finally {
|
|
||||||
loading.value = false
|
|
||||||
}
|
|
||||||
}
|
|
||||||
</script>
|
|
||||||
|
|
||||||
<style scoped>
|
|
||||||
/* ── Floating action button ─────────────────────────────────────────── */
|
|
||||||
.feedback-fab {
|
|
||||||
position: fixed;
|
|
||||||
right: var(--spacing-md);
|
|
||||||
bottom: calc(68px + var(--spacing-md)); /* above mobile bottom nav */
|
|
||||||
z-index: 190;
|
|
||||||
display: flex;
|
|
||||||
align-items: center;
|
|
||||||
gap: var(--spacing-xs);
|
|
||||||
padding: 9px var(--spacing-md);
|
|
||||||
background: var(--color-bg-elevated);
|
|
||||||
border: 1px solid var(--color-border);
|
|
||||||
border-radius: 999px;
|
|
||||||
color: var(--color-text-secondary);
|
|
||||||
font-size: var(--font-size-sm);
|
|
||||||
font-family: var(--font-body);
|
|
||||||
font-weight: 500;
|
|
||||||
cursor: pointer;
|
|
||||||
box-shadow: var(--shadow-md);
|
|
||||||
transition: background 0.15s, color 0.15s, box-shadow 0.15s, border-color 0.15s;
|
|
||||||
}
|
|
||||||
.feedback-fab:hover {
|
|
||||||
background: var(--color-bg-card);
|
|
||||||
color: var(--color-text-primary);
|
|
||||||
border-color: var(--color-border-focus);
|
|
||||||
box-shadow: var(--shadow-lg);
|
|
||||||
}
|
|
||||||
.feedback-fab-icon { width: 15px; height: 15px; flex-shrink: 0; }
|
|
||||||
.feedback-fab-label { white-space: nowrap; }
|
|
||||||
|
|
||||||
/* On desktop, bottom nav is gone — drop to standard corner */
|
|
||||||
@media (min-width: 769px) {
|
|
||||||
.feedback-fab {
|
|
||||||
bottom: var(--spacing-lg);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/* ── Overlay ──────────────────────────────────────────────────────────── */
|
|
||||||
.feedback-overlay {
|
|
||||||
position: fixed;
|
|
||||||
inset: 0;
|
|
||||||
background: rgba(0, 0, 0, 0.55);
|
|
||||||
z-index: 1000;
|
|
||||||
display: flex;
|
|
||||||
align-items: flex-end;
|
|
||||||
justify-content: center;
|
|
||||||
padding: 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
@media (min-width: 500px) {
|
|
||||||
.feedback-overlay {
|
|
||||||
align-items: center;
|
|
||||||
padding: var(--spacing-md);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/* ── Modal ────────────────────────────────────────────────────────────── */
|
|
||||||
.feedback-modal {
|
|
||||||
background: var(--color-bg-elevated);
|
|
||||||
border: 1px solid var(--color-border);
|
|
||||||
border-radius: var(--radius-lg) var(--radius-lg) 0 0;
|
|
||||||
width: 100%;
|
|
||||||
max-height: 90vh;
|
|
||||||
overflow-y: auto;
|
|
||||||
display: flex;
|
|
||||||
flex-direction: column;
|
|
||||||
box-shadow: var(--shadow-xl);
|
|
||||||
}
|
|
||||||
|
|
||||||
@media (min-width: 500px) {
|
|
||||||
.feedback-modal {
|
|
||||||
border-radius: var(--radius-lg);
|
|
||||||
width: 100%;
|
|
||||||
max-width: 520px;
|
|
||||||
max-height: 85vh;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
.feedback-header {
|
|
||||||
display: flex;
|
|
||||||
align-items: center;
|
|
||||||
justify-content: space-between;
|
|
||||||
padding: var(--spacing-md) var(--spacing-md) var(--spacing-sm);
|
|
||||||
border-bottom: 1px solid var(--color-border);
|
|
||||||
flex-shrink: 0;
|
|
||||||
}
|
|
||||||
.feedback-title {
|
|
||||||
font-family: var(--font-display);
|
|
||||||
font-size: var(--font-size-lg);
|
|
||||||
font-weight: 600;
|
|
||||||
margin: 0;
|
|
||||||
}
|
|
||||||
.feedback-close {
|
|
||||||
background: transparent;
|
|
||||||
border: none;
|
|
||||||
color: var(--color-text-muted);
|
|
||||||
cursor: pointer;
|
|
||||||
padding: 4px;
|
|
||||||
border-radius: var(--radius-sm);
|
|
||||||
display: flex;
|
|
||||||
align-items: center;
|
|
||||||
justify-content: center;
|
|
||||||
}
|
|
||||||
.feedback-close:hover { color: var(--color-text-primary); }
|
|
||||||
|
|
||||||
.feedback-body {
|
|
||||||
padding: var(--spacing-md);
|
|
||||||
flex: 1;
|
|
||||||
overflow-y: auto;
|
|
||||||
display: flex;
|
|
||||||
flex-direction: column;
|
|
||||||
gap: var(--spacing-md);
|
|
||||||
}
|
|
||||||
|
|
||||||
.feedback-footer {
|
|
||||||
display: flex;
|
|
||||||
align-items: center;
|
|
||||||
justify-content: flex-end;
|
|
||||||
gap: var(--spacing-sm);
|
|
||||||
padding: var(--spacing-sm) var(--spacing-md);
|
|
||||||
border-top: 1px solid var(--color-border);
|
|
||||||
flex-shrink: 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
.feedback-textarea {
|
|
||||||
resize: vertical;
|
|
||||||
min-height: 80px;
|
|
||||||
font-family: var(--font-body);
|
|
||||||
font-size: var(--font-size-sm);
|
|
||||||
}
|
|
||||||
|
|
||||||
.form-required { color: var(--color-error); margin-left: 2px; }
|
|
||||||
|
|
||||||
.feedback-error {
|
|
||||||
color: var(--color-error);
|
|
||||||
font-size: var(--font-size-sm);
|
|
||||||
margin: 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
.feedback-success {
|
|
||||||
color: var(--color-success);
|
|
||||||
font-size: var(--font-size-sm);
|
|
||||||
padding: var(--spacing-sm) var(--spacing-md);
|
|
||||||
background: var(--color-success-bg);
|
|
||||||
border: 1px solid var(--color-success-border);
|
|
||||||
border-radius: var(--radius-md);
|
|
||||||
}
|
|
||||||
.feedback-link { color: var(--color-success); font-weight: 600; text-decoration: underline; }
|
|
||||||
|
|
||||||
/* Summary card (step 2) */
|
|
||||||
.feedback-summary {
|
|
||||||
display: flex;
|
|
||||||
flex-direction: column;
|
|
||||||
gap: var(--spacing-xs);
|
|
||||||
padding: var(--spacing-sm) var(--spacing-md);
|
|
||||||
background: var(--color-bg-secondary);
|
|
||||||
border-radius: var(--radius-md);
|
|
||||||
border: 1px solid var(--color-border);
|
|
||||||
}
|
|
||||||
.feedback-summary-row {
|
|
||||||
display: flex;
|
|
||||||
gap: var(--spacing-md);
|
|
||||||
align-items: flex-start;
|
|
||||||
}
|
|
||||||
.feedback-summary-row > :first-child { min-width: 72px; flex-shrink: 0; }
|
|
||||||
.feedback-summary-desc {
|
|
||||||
white-space: pre-wrap;
|
|
||||||
word-break: break-word;
|
|
||||||
}
|
|
||||||
|
|
||||||
.mt-md { margin-top: var(--spacing-md); }
|
|
||||||
.mt-xs { margin-top: var(--spacing-xs); }
|
|
||||||
|
|
||||||
/* ── Form elements ────────────────────────────────────────────────────── */
|
|
||||||
.form-group {
|
|
||||||
display: flex;
|
|
||||||
flex-direction: column;
|
|
||||||
gap: var(--spacing-xs);
|
|
||||||
}
|
|
||||||
|
|
||||||
.form-label {
|
|
||||||
font-size: var(--font-size-sm);
|
|
||||||
font-weight: 600;
|
|
||||||
color: var(--color-text-muted);
|
|
||||||
text-transform: uppercase;
|
|
||||||
letter-spacing: 0.06em;
|
|
||||||
}
|
|
||||||
|
|
||||||
.form-input {
|
|
||||||
width: 100%;
|
|
||||||
padding: var(--spacing-xs) var(--spacing-sm);
|
|
||||||
background: var(--color-bg-secondary);
|
|
||||||
border: 1px solid var(--color-border);
|
|
||||||
border-radius: var(--radius-md);
|
|
||||||
color: var(--color-text-primary);
|
|
||||||
font-family: var(--font-body);
|
|
||||||
font-size: var(--font-size-sm);
|
|
||||||
line-height: 1.5;
|
|
||||||
transition: border-color 0.15s;
|
|
||||||
box-sizing: border-box;
|
|
||||||
}
|
|
||||||
.form-input:focus {
|
|
||||||
outline: none;
|
|
||||||
border-color: var(--color-border-focus);
|
|
||||||
}
|
|
||||||
.form-input::placeholder { color: var(--color-text-muted); opacity: 0.7; }
|
|
||||||
|
|
||||||
/* ── Buttons ──────────────────────────────────────────────────────────── */
|
|
||||||
.btn {
|
|
||||||
display: inline-flex;
|
|
||||||
align-items: center;
|
|
||||||
justify-content: center;
|
|
||||||
gap: var(--spacing-xs);
|
|
||||||
padding: var(--spacing-xs) var(--spacing-md);
|
|
||||||
border-radius: var(--radius-md);
|
|
||||||
font-family: var(--font-body);
|
|
||||||
font-size: var(--font-size-sm);
|
|
||||||
font-weight: 500;
|
|
||||||
cursor: pointer;
|
|
||||||
transition: background 0.15s, color 0.15s, border-color 0.15s;
|
|
||||||
white-space: nowrap;
|
|
||||||
}
|
|
||||||
.btn:disabled { opacity: 0.5; cursor: not-allowed; }
|
|
||||||
|
|
||||||
.btn-primary {
|
|
||||||
background: var(--color-primary);
|
|
||||||
color: #fff;
|
|
||||||
border: 1px solid var(--color-primary);
|
|
||||||
}
|
|
||||||
.btn-primary:hover:not(:disabled) { filter: brightness(1.1); }
|
|
||||||
|
|
||||||
.btn-ghost {
|
|
||||||
background: transparent;
|
|
||||||
color: var(--color-text-secondary);
|
|
||||||
border: 1px solid var(--color-border);
|
|
||||||
}
|
|
||||||
.btn-ghost:hover:not(:disabled) {
|
|
||||||
background: var(--color-bg-secondary);
|
|
||||||
color: var(--color-text-primary);
|
|
||||||
border-color: var(--color-border-focus);
|
|
||||||
}
|
|
||||||
|
|
||||||
/* ── Filter chips ─────────────────────────────────────────────────────── */
|
|
||||||
.filter-chip-row {
|
|
||||||
display: flex;
|
|
||||||
flex-wrap: wrap;
|
|
||||||
gap: var(--spacing-xs);
|
|
||||||
}
|
|
||||||
|
|
||||||
.btn-chip {
|
|
||||||
padding: 5px var(--spacing-sm);
|
|
||||||
background: var(--color-bg-secondary);
|
|
||||||
border: 1px solid var(--color-border);
|
|
||||||
border-radius: 999px;
|
|
||||||
font-family: var(--font-body);
|
|
||||||
font-size: var(--font-size-sm);
|
|
||||||
font-weight: 500;
|
|
||||||
color: var(--color-text-secondary);
|
|
||||||
cursor: pointer;
|
|
||||||
transition: background 0.15s, color 0.15s, border-color 0.15s;
|
|
||||||
}
|
|
||||||
.btn-chip.active,
|
|
||||||
.btn-chip:hover {
|
|
||||||
background: color-mix(in srgb, var(--color-primary) 15%, transparent);
|
|
||||||
border-color: var(--color-primary);
|
|
||||||
color: var(--color-primary);
|
|
||||||
}
|
|
||||||
|
|
||||||
/* ── Card ─────────────────────────────────────────────────────────────── */
|
|
||||||
.card {
|
|
||||||
background: var(--color-bg-card);
|
|
||||||
border: 1px solid var(--color-border);
|
|
||||||
border-radius: var(--radius-md);
|
|
||||||
}
|
|
||||||
|
|
||||||
/* ── Text utilities ───────────────────────────────────────────────────── */
|
|
||||||
.text-muted { color: var(--color-text-muted); }
|
|
||||||
.text-sm { font-size: var(--font-size-sm); line-height: 1.5; }
|
|
||||||
.text-xs { font-size: 0.75rem; line-height: 1.5; }
|
|
||||||
.font-semibold { font-weight: 600; }
|
|
||||||
|
|
||||||
/* ── Screenshot attachment ────────────────────────────────────────────── */
|
|
||||||
.form-input-file {
|
|
||||||
display: block;
|
|
||||||
width: 100%;
|
|
||||||
padding: var(--spacing-xs) var(--spacing-sm);
|
|
||||||
background: var(--color-bg-secondary);
|
|
||||||
border: 1px dashed var(--color-border);
|
|
||||||
border-radius: var(--radius-md);
|
|
||||||
color: var(--color-text-secondary);
|
|
||||||
font-family: var(--font-body);
|
|
||||||
font-size: var(--font-size-sm);
|
|
||||||
cursor: pointer;
|
|
||||||
box-sizing: border-box;
|
|
||||||
}
|
|
||||||
.form-input-file:focus { outline: 2px solid var(--color-border-focus); outline-offset: 2px; }
|
|
||||||
|
|
||||||
.screenshot-preview {
|
|
||||||
margin-top: var(--spacing-xs);
|
|
||||||
display: flex;
|
|
||||||
align-items: flex-start;
|
|
||||||
gap: var(--spacing-sm);
|
|
||||||
}
|
|
||||||
.screenshot-preview img {
|
|
||||||
max-width: 160px;
|
|
||||||
max-height: 100px;
|
|
||||||
border-radius: var(--radius-sm);
|
|
||||||
border: 1px solid var(--color-border);
|
|
||||||
object-fit: cover;
|
|
||||||
}
|
|
||||||
.screenshot-remove {
|
|
||||||
font-size: var(--font-size-xs);
|
|
||||||
color: var(--color-text-muted);
|
|
||||||
background: none;
|
|
||||||
border: none;
|
|
||||||
cursor: pointer;
|
|
||||||
padding: 2px 4px;
|
|
||||||
min-height: 24px;
|
|
||||||
}
|
|
||||||
.screenshot-remove:hover { color: var(--color-error); }
|
|
||||||
|
|
||||||
.btn-link {
|
|
||||||
background: none;
|
|
||||||
border: none;
|
|
||||||
color: var(--color-primary);
|
|
||||||
cursor: pointer;
|
|
||||||
padding: 0;
|
|
||||||
font-family: var(--font-body);
|
|
||||||
font-size: inherit;
|
|
||||||
text-decoration: underline;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Transition */
|
|
||||||
.modal-fade-enter-active, .modal-fade-leave-active { transition: opacity 0.2s ease; }
|
|
||||||
.modal-fade-enter-from, .modal-fade-leave-to { opacity: 0; }
|
|
||||||
</style>
|
|
||||||
|
|
@ -20,35 +20,6 @@
|
||||||
--radius-lg: 16px;
|
--radius-lg: 16px;
|
||||||
--shadow-card: 0 2px 8px rgba(0,0,0,0.4);
|
--shadow-card: 0 2px 8px rgba(0,0,0,0.4);
|
||||||
--transition-fast: 150ms ease;
|
--transition-fast: 150ms ease;
|
||||||
|
|
||||||
/* Spacing scale */
|
|
||||||
--spacing-xs: 0.25rem;
|
|
||||||
--spacing-sm: 0.5rem;
|
|
||||||
--spacing-md: 1rem;
|
|
||||||
--spacing-lg: 1.5rem;
|
|
||||||
|
|
||||||
/* Font scale */
|
|
||||||
--font-body: var(--font-base);
|
|
||||||
--font-display: var(--font-base);
|
|
||||||
--font-size-xs: 0.75rem;
|
|
||||||
--font-size-sm: 0.875rem;
|
|
||||||
--font-size-lg: 1.125rem;
|
|
||||||
|
|
||||||
/* Shadow aliases */
|
|
||||||
--shadow-md: var(--shadow-card);
|
|
||||||
--shadow-lg: var(--shadow-card);
|
|
||||||
--shadow-xl: 0 4px 20px rgba(0,0,0,0.5);
|
|
||||||
|
|
||||||
/* Color aliases for shared component compat */
|
|
||||||
--color-primary: var(--color-accent);
|
|
||||||
--color-text-primary: var(--color-text);
|
|
||||||
--color-text-secondary: var(--color-text-muted);
|
|
||||||
--color-bg-elevated: var(--color-surface);
|
|
||||||
--color-bg-card: var(--color-surface);
|
|
||||||
--color-bg-secondary: var(--color-bg);
|
|
||||||
--color-border-focus: var(--color-accent);
|
|
||||||
--color-success-bg: color-mix(in srgb, var(--color-success) 15%, transparent);
|
|
||||||
--color-success-border: color-mix(in srgb, var(--color-success) 35%, transparent);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
@media (prefers-color-scheme: light) {
|
@media (prefers-color-scheme: light) {
|
||||||
|
|
|
||||||
|
|
@ -4,8 +4,8 @@
|
||||||
<div class="chat-pane">
|
<div class="chat-pane">
|
||||||
<div class="chat-messages" ref="messagesEl">
|
<div class="chat-messages" ref="messagesEl">
|
||||||
<p class="empty-chat" v-if="history.length === 0">
|
<p class="empty-chat" v-if="history.length === 0">
|
||||||
Ask a question across your indexed documents.
|
Ask a question across your indexed rulebooks.
|
||||||
No documents indexed? Go to <RouterLink to="/">Library</RouterLink> first.
|
No rulebooks indexed? Go to <RouterLink to="/">Library</RouterLink> first.
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
<div
|
<div
|
||||||
|
|
@ -25,25 +25,6 @@
|
||||||
:bm25-score="cite.bm25_score ?? undefined"
|
:bm25-score="cite.bm25_score ?? undefined"
|
||||||
/>
|
/>
|
||||||
</div>
|
</div>
|
||||||
<div v-if="msg.role === 'assistant' && chatFeedbackEnabled" class="message-thumbs" :aria-label="`Rate this answer`">
|
|
||||||
<button
|
|
||||||
class="thumb-btn"
|
|
||||||
:class="{ active: msg.rating === 1 }"
|
|
||||||
@click="rate(i, 1)"
|
|
||||||
:disabled="msg.rating != null"
|
|
||||||
title="Helpful"
|
|
||||||
aria-label="Mark as helpful"
|
|
||||||
>👍</button>
|
|
||||||
<button
|
|
||||||
class="thumb-btn"
|
|
||||||
:class="{ active: msg.rating === -1 }"
|
|
||||||
@click="rate(i, -1)"
|
|
||||||
:disabled="msg.rating != null"
|
|
||||||
title="Not helpful"
|
|
||||||
aria-label="Mark as not helpful"
|
|
||||||
>👎</button>
|
|
||||||
<span v-if="msg.rating != null" class="thumb-thanks">Thanks!</span>
|
|
||||||
</div>
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div class="message assistant loading" v-if="thinking">
|
<div class="message assistant loading" v-if="thinking">
|
||||||
|
|
@ -61,7 +42,7 @@
|
||||||
ref="inputEl"
|
ref="inputEl"
|
||||||
v-model="draft"
|
v-model="draft"
|
||||||
class="chat-input"
|
class="chat-input"
|
||||||
placeholder="Ask about your documents…"
|
placeholder="Ask about your rulebooks…"
|
||||||
:disabled="thinking"
|
:disabled="thinking"
|
||||||
aria-label="Chat message"
|
aria-label="Chat message"
|
||||||
autofocus
|
autofocus
|
||||||
|
|
@ -96,7 +77,6 @@ interface ChatMessage {
|
||||||
role: "user" | "assistant"
|
role: "user" | "assistant"
|
||||||
content: string
|
content: string
|
||||||
citations?: Citation[]
|
citations?: Citation[]
|
||||||
rating?: 1 | -1
|
|
||||||
}
|
}
|
||||||
|
|
||||||
const history = ref<ChatMessage[]>([])
|
const history = ref<ChatMessage[]>([])
|
||||||
|
|
@ -108,7 +88,6 @@ const messagesEl = ref<HTMLElement | null>(null)
|
||||||
const inputEl = ref<HTMLInputElement | null>(null)
|
const inputEl = ref<HTMLInputElement | null>(null)
|
||||||
const allDocs = ref<Document[]>([])
|
const allDocs = ref<Document[]>([])
|
||||||
const selectedDocs = ref<string[]>([])
|
const selectedDocs = ref<string[]>([])
|
||||||
const chatFeedbackEnabled = ref(false)
|
|
||||||
|
|
||||||
const readyDocs = computed(() => allDocs.value.filter(d => d.status === "ready"))
|
const readyDocs = computed(() => allDocs.value.filter(d => d.status === "ready"))
|
||||||
const docTitles = computed(() =>
|
const docTitles = computed(() =>
|
||||||
|
|
@ -117,7 +96,6 @@ const docTitles = computed(() =>
|
||||||
|
|
||||||
onMounted(async () => {
|
onMounted(async () => {
|
||||||
allDocs.value = await api.getLibrary().catch(() => [])
|
allDocs.value = await api.getLibrary().catch(() => [])
|
||||||
api.chatFeedbackStatus().then(s => { chatFeedbackEnabled.value = s.enabled }).catch(() => {})
|
|
||||||
inputEl.value?.focus()
|
inputEl.value?.focus()
|
||||||
})
|
})
|
||||||
|
|
||||||
|
|
@ -159,17 +137,6 @@ function scrollBottom() {
|
||||||
messagesEl.value.scrollTop = messagesEl.value.scrollHeight
|
messagesEl.value.scrollTop = messagesEl.value.scrollHeight
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
async function rate(index: number, rating: 1 | -1) {
|
|
||||||
const msg = history.value[index]
|
|
||||||
if (!msg || msg.role !== "assistant" || msg.rating != null) return
|
|
||||||
// Update UI immediately (optimistic)
|
|
||||||
history.value[index] = { ...msg, rating }
|
|
||||||
const question = index > 0 ? (history.value[index - 1]?.content ?? "") : ""
|
|
||||||
await api.submitChatFeedback(rating, question, msg.content, selectedDocs.value).catch(() => {
|
|
||||||
// Non-fatal — rating is cosmetic, ignore network errors
|
|
||||||
})
|
|
||||||
}
|
|
||||||
</script>
|
</script>
|
||||||
|
|
||||||
<style scoped>
|
<style scoped>
|
||||||
|
|
@ -265,27 +232,6 @@ async function rate(index: number, rating: 1 | -1) {
|
||||||
font-size: 0.85rem; margin-bottom: 0.5rem; cursor: pointer; line-height: 1.4;
|
font-size: 0.85rem; margin-bottom: 0.5rem; cursor: pointer; line-height: 1.4;
|
||||||
}
|
}
|
||||||
|
|
||||||
.message-thumbs {
|
|
||||||
display: flex;
|
|
||||||
align-items: center;
|
|
||||||
gap: 0.35rem;
|
|
||||||
margin-top: 0.4rem;
|
|
||||||
}
|
|
||||||
.thumb-btn {
|
|
||||||
background: transparent;
|
|
||||||
border: 1px solid var(--color-border);
|
|
||||||
border-radius: var(--radius-sm);
|
|
||||||
cursor: pointer;
|
|
||||||
font-size: 0.9rem;
|
|
||||||
padding: 2px 6px;
|
|
||||||
line-height: 1;
|
|
||||||
transition: background var(--transition-fast), border-color var(--transition-fast);
|
|
||||||
}
|
|
||||||
.thumb-btn:hover:not(:disabled) { background: var(--color-surface-alt); border-color: var(--color-accent); }
|
|
||||||
.thumb-btn.active { background: var(--color-surface-alt); border-color: var(--color-accent); }
|
|
||||||
.thumb-btn:disabled { opacity: 0.4; cursor: default; }
|
|
||||||
.thumb-thanks { font-size: 0.75rem; color: var(--color-text-muted); }
|
|
||||||
|
|
||||||
@media (max-width: 640px) {
|
@media (max-width: 640px) {
|
||||||
.chat-layout { flex-direction: column-reverse; }
|
.chat-layout { flex-direction: column-reverse; }
|
||||||
.sidebar { width: 100%; height: auto; max-height: 30vh; border-left: none; border-top: 1px solid var(--color-border); }
|
.sidebar { width: 100%; height: auto; max-height: 30vh; border-left: none; border-top: 1px solid var(--color-border); }
|
||||||
|
|
|
||||||
|
|
@ -2,23 +2,16 @@
|
||||||
<main class="library">
|
<main class="library">
|
||||||
<header class="library-header">
|
<header class="library-header">
|
||||||
<h1>Library</h1>
|
<h1>Library</h1>
|
||||||
<div class="header-actions">
|
<button class="btn-primary" @click="scan" :disabled="scanning">
|
||||||
<button class="btn-secondary" @click="triggerUpload" :disabled="uploading">
|
{{ scanning ? "Scanning..." : "Scan for PDFs" }}
|
||||||
{{ uploading ? "Uploading..." : "Upload PDF / EPUB" }}
|
</button>
|
||||||
</button>
|
|
||||||
<input ref="fileInput" type="file" accept=".pdf,.epub" style="display:none" @change="handleUpload">
|
|
||||||
<button class="btn-primary" @click="scan" :disabled="scanning">
|
|
||||||
{{ scanning ? "Scanning..." : "Scan for PDFs" }}
|
|
||||||
</button>
|
|
||||||
</div>
|
|
||||||
</header>
|
</header>
|
||||||
|
|
||||||
<p class="error-msg" v-if="error">{{ error }}</p>
|
<p class="error-msg" v-if="error">{{ error }}</p>
|
||||||
|
|
||||||
<p class="empty-state" v-if="!loading && docs.length === 0">
|
<p class="empty-state" v-if="!loading && docs.length === 0">
|
||||||
No documents indexed yet.<br>
|
No books indexed yet. Click "Scan for PDFs" to discover PDFs in your books directory.<br>
|
||||||
<strong>Upload a PDF</strong> using the button above, or mount a directory and click
|
Make sure your PDF directory is mounted at <code>/books</code> inside the container.
|
||||||
<strong>Scan for PDFs</strong> to index an entire collection.
|
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
<div class="doc-grid" v-else>
|
<div class="doc-grid" v-else>
|
||||||
|
|
@ -46,10 +39,8 @@ import DocumentCard from "@/components/DocumentCard.vue"
|
||||||
const docs = ref<Document[]>([])
|
const docs = ref<Document[]>([])
|
||||||
const loading = ref(true)
|
const loading = ref(true)
|
||||||
const scanning = ref(false)
|
const scanning = ref(false)
|
||||||
const uploading = ref(false)
|
|
||||||
const error = ref<string | null>(null)
|
const error = ref<string | null>(null)
|
||||||
const scanResult = ref<{ discovered: number; queued: number } | null>(null)
|
const scanResult = ref<{ discovered: number; queued: number } | null>(null)
|
||||||
const fileInput = ref<HTMLInputElement | null>(null)
|
|
||||||
|
|
||||||
async function load() {
|
async function load() {
|
||||||
loading.value = true
|
loading.value = true
|
||||||
|
|
@ -97,45 +88,18 @@ async function remove(id: string) {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
function triggerUpload() {
|
|
||||||
fileInput.value?.click()
|
|
||||||
}
|
|
||||||
|
|
||||||
async function handleUpload(event: Event) {
|
|
||||||
const input = event.target as HTMLInputElement
|
|
||||||
const file = input.files?.[0]
|
|
||||||
if (!file) return
|
|
||||||
uploading.value = true
|
|
||||||
error.value = null
|
|
||||||
try {
|
|
||||||
await api.uploadDocument(file)
|
|
||||||
await load()
|
|
||||||
} catch (e) {
|
|
||||||
error.value = e instanceof Error ? e.message : "Upload failed"
|
|
||||||
} finally {
|
|
||||||
uploading.value = false
|
|
||||||
input.value = ""
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
onMounted(load)
|
onMounted(load)
|
||||||
</script>
|
</script>
|
||||||
|
|
||||||
<style scoped>
|
<style scoped>
|
||||||
.library { padding: 1.5rem; max-width: 1200px; margin: 0 auto; }
|
.library { padding: 1.5rem; max-width: 1200px; margin: 0 auto; }
|
||||||
.library-header { display: flex; align-items: center; justify-content: space-between; margin-bottom: 1.5rem; flex-wrap: wrap; gap: 1rem; }
|
.library-header { display: flex; align-items: center; justify-content: space-between; margin-bottom: 1.5rem; flex-wrap: wrap; gap: 1rem; }
|
||||||
.header-actions { display: flex; gap: 0.5rem; flex-wrap: wrap; }
|
|
||||||
h1 { font-size: 1.5rem; }
|
h1 { font-size: 1.5rem; }
|
||||||
.btn-primary {
|
.btn-primary {
|
||||||
background: var(--color-accent); color: #fff; border: none; padding: 0.6rem 1.2rem;
|
background: var(--color-accent); color: #fff; border: none; padding: 0.6rem 1.2rem;
|
||||||
border-radius: var(--radius-sm); cursor: pointer; font-size: 0.95rem;
|
border-radius: var(--radius-sm); cursor: pointer; font-size: 0.95rem;
|
||||||
}
|
}
|
||||||
.btn-primary:disabled { opacity: 0.5; cursor: default; }
|
.btn-primary:disabled { opacity: 0.5; cursor: default; }
|
||||||
.btn-secondary {
|
|
||||||
background: transparent; color: var(--color-accent); border: 1px solid var(--color-accent);
|
|
||||||
padding: 0.6rem 1.2rem; border-radius: var(--radius-sm); cursor: pointer; font-size: 0.95rem;
|
|
||||||
}
|
|
||||||
.btn-secondary:disabled { opacity: 0.5; cursor: default; }
|
|
||||||
.doc-grid { display: grid; grid-template-columns: repeat(auto-fill, minmax(280px, 1fr)); gap: 1rem; }
|
.doc-grid { display: grid; grid-template-columns: repeat(auto-fill, minmax(280px, 1fr)); gap: 1rem; }
|
||||||
.empty-state { color: var(--color-text-muted); line-height: 1.8; }
|
.empty-state { color: var(--color-text-muted); line-height: 1.8; }
|
||||||
.empty-state code { font-family: var(--font-mono); background: var(--color-surface-alt); padding: 2px 6px; border-radius: 3px; }
|
.empty-state code { font-family: var(--font-mono); background: var(--color-surface-alt); padding: 2px 6px; border-radius: 3px; }
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue