Retrieval:
- Add _fetch_adjacent() to retriever: fetches page ± 1 chunks from DB
after ranking so mid-sentence EPUB chunk boundaries don't lose context
- Fix vec DB doc-filter: oversample to top_k*20 before Python filter
instead of post-filtering an already-small global pool (fixes wrong-book
results when searching within a single document)
- top_k default 5 → 10; context per chunk 500 → 1500 chars; citation
snippet 200 → 400 chars
Artifact cleaning:
- Add scripts/text_clean.py: strips ABC Amber LIT Converter watermarks,
processtext.com URLs, bare page numbers, piracy stamps from extracted text
- Wire clean_paragraph() into ingest_pdf.py and new ingest_epub.py
Startup validation:
- _check_vec_schema() at boot: detects embedding dimension mismatch,
deletes stale vec DB, and queues sequential re-embed in background thread
- Sequential _reembed_docs() prevents SQLite lock races on startup re-embed
cf-orch integration:
- Wire CF_ORCH_URL / CF_LICENSE_KEY into LLMRouter backend config so
allocate() fires and keeps the Ollama model warm between requests
Ingestion progress UI:
- GET /api/library/{doc_id}/status now returns vec_count from page_vecs_meta
- DocumentCard.vue polls status every 3 s while processing and shows
two-phase progress: indeterminate animation during extraction,
determinate "Embedding N/M pages" bar once vectors start landing
Other:
- Chat feedback endpoint + thumbs up/down UI (FeedbackButton.vue)
- EPUB ingest script (ingest_epub.py) with heading-based chunking
- migration 002: chat_feedback table
- README.md with setup and feature overview
53 lines
1.8 KiB
YAML
53 lines
1.8 KiB
YAML
# Pagepiper — cloud managed instance
|
|
# Project: pagepiper-cloud (docker compose -f compose.cloud.yml -p pagepiper-cloud ...)
|
|
# Web: http://127.0.0.1:8533 → pagepiper.circuitforge.tech (primary)
|
|
# → menagerie.circuitforge.tech/pagepiper (secondary)
|
|
# API: internal only on pagepiper-cloud-net (nginx proxies /api/ → api:8522)
|
|
|
|
services:
|
|
api:
|
|
build:
|
|
context: ..
|
|
dockerfile: pagepiper/Dockerfile
|
|
restart: unless-stopped
|
|
env_file: .env
|
|
environment:
|
|
CLOUD_MODE: "true"
|
|
PAGEPIPER_DATA_DIR: /devl/pagepiper-cloud-data
|
|
PAGEPIPER_BOOKS_DIR: /devl/pagepiper-cloud-data/books
|
|
# PAGEPIPER_OLLAMA_URL — set in .env (BYOK gate for hybrid search + RAG)
|
|
# HEIMDALL_URL, HEIMDALL_ADMIN_TOKEN — set in .env for license validation
|
|
# cf-orch: route LLM inference through coordinator for managed GPU access
|
|
CF_ORCH_URL: http://host.docker.internal:7700
|
|
CF_APP_NAME: pagepiper
|
|
# CF_LICENSE_KEY is the auth token CFOrchClient sends to the coordinator
|
|
CF_LICENSE_KEY: ${COORDINATOR_PAGEPIPER_KEY:-}
|
|
COORDINATOR_URL: http://10.1.10.71:7700
|
|
COORDINATOR_PAGEPIPER_KEY: ${COORDINATOR_PAGEPIPER_KEY:-}
|
|
extra_hosts:
|
|
- "host.docker.internal:host-gateway"
|
|
volumes:
|
|
- /devl/pagepiper-cloud-data:/devl/pagepiper-cloud-data
|
|
- ${HOME}/.config/circuitforge:/root/.config/circuitforge:ro
|
|
networks:
|
|
- pagepiper-cloud-net
|
|
|
|
web:
|
|
build:
|
|
context: .
|
|
dockerfile: docker/web/Dockerfile
|
|
args:
|
|
VITE_BASE_URL: /pagepiper
|
|
VITE_API_BASE: /pagepiper
|
|
NGINX_CONF: docker/web/nginx.cloud.conf
|
|
restart: unless-stopped
|
|
ports:
|
|
- "8533:80"
|
|
networks:
|
|
- pagepiper-cloud-net
|
|
depends_on:
|
|
- api
|
|
|
|
networks:
|
|
pagepiper-cloud-net:
|
|
driver: bridge
|