Retrieval:
- Add _fetch_adjacent() to retriever: fetches page ± 1 chunks from DB
after ranking so mid-sentence EPUB chunk boundaries don't lose context
- Fix vec DB doc-filter: oversample to top_k*20 before Python filter
instead of post-filtering an already-small global pool (fixes wrong-book
results when searching within a single document)
- top_k default 5 → 10; context per chunk 500 → 1500 chars; citation
snippet 200 → 400 chars
Artifact cleaning:
- Add scripts/text_clean.py: strips ABC Amber LIT Converter watermarks,
processtext.com URLs, bare page numbers, piracy stamps from extracted text
- Wire clean_paragraph() into ingest_pdf.py and new ingest_epub.py
Startup validation:
- _check_vec_schema() at boot: detects embedding dimension mismatch,
deletes stale vec DB, and queues sequential re-embed in background thread
- Sequential _reembed_docs() prevents SQLite lock races on startup re-embed
cf-orch integration:
- Wire CF_ORCH_URL / CF_LICENSE_KEY into LLMRouter backend config so
allocate() fires and keeps the Ollama model warm between requests
Ingestion progress UI:
- GET /api/library/{doc_id}/status now returns vec_count from page_vecs_meta
- DocumentCard.vue polls status every 3 s while processing and shows
two-phase progress: indeterminate animation during extraction,
determinate "Embedding N/M pages" bar once vectors start landing
Other:
- Chat feedback endpoint + thumbs up/down UI (FeedbackButton.vue)
- EPUB ingest script (ingest_epub.py) with heading-based chunking
- migration 002: chat_feedback table
- README.md with setup and feature overview
- CF_ORCH_URL, CF_APP_NAME, COORDINATOR_URL env vars in api service
- COORDINATOR_PAGEPIPER_KEY wired from .env
- extra_hosts: host.docker.internal:host-gateway for container → host Ollama
- .env.cloud.example updated with COORDINATOR_PAGEPIPER_KEY placeholder
- compose.cloud.yml: pagepiper-cloud project on port 8533 (avoids
conflict with Linnet dev on 8521/Magpie on 8531)
- docker/web/nginx.cloud.conf: handles both /pagepiper/* path (primary
domain, no Caddy strip) and / path (menagerie, Caddy strips prefix)
- docker/web/Dockerfile: NGINX_CONF build arg to select dev vs cloud conf
- .env.cloud.example: cloud env template with BYOK gate vars
- manage.sh: cloud:start|stop|restart|status|logs|build commands
Caddy config updated separately (not in this repo).
DNS record needed: pagepiper.circuitforge.tech → Heimdall edge IP.