pagepiper/tests
pyr0ball e52bdb5128 feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI
Retrieval:
- Add _fetch_adjacent() to retriever: fetches page ± 1 chunks from DB
  after ranking so mid-sentence EPUB chunk boundaries don't lose context
- Fix vec DB doc-filter: oversample to top_k*20 before Python filter
  instead of post-filtering an already-small global pool (fixes wrong-book
  results when searching within a single document)
- top_k default 5 → 10; context per chunk 500 → 1500 chars; citation
  snippet 200 → 400 chars

Artifact cleaning:
- Add scripts/text_clean.py: strips ABC Amber LIT Converter watermarks,
  processtext.com URLs, bare page numbers, piracy stamps from extracted text
- Wire clean_paragraph() into ingest_pdf.py and new ingest_epub.py

Startup validation:
- _check_vec_schema() at boot: detects embedding dimension mismatch,
  deletes stale vec DB, and queues sequential re-embed in background thread
- Sequential _reembed_docs() prevents SQLite lock races on startup re-embed

cf-orch integration:
- Wire CF_ORCH_URL / CF_LICENSE_KEY into LLMRouter backend config so
  allocate() fires and keeps the Ollama model warm between requests

Ingestion progress UI:
- GET /api/library/{doc_id}/status now returns vec_count from page_vecs_meta
- DocumentCard.vue polls status every 3 s while processing and shows
  two-phase progress: indeterminate animation during extraction,
  determinate "Embedding N/M pages" bar once vectors start landing

Other:
- Chat feedback endpoint + thumbs up/down UI (FeedbackButton.vue)
- EPUB ingest script (ingest_epub.py) with heading-based chunking
- migration 002: chat_feedback table
- README.md with setup and feature overview
2026-05-06 08:25:58 -07:00
..
__init__.py feat: add database schema and migration runner 2026-05-04 17:10:38 -07:00
conftest.py feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
test_bm25_index.py fix(services): add SQLite error handling and strengthen top_k test 2026-05-04 17:20:26 -07:00
test_chat_api.py feat(api): add retriever, synthesizer, and chat endpoint (BSL — BYOK gate) 2026-05-04 17:47:10 -07:00
test_db_migrate.py fix(config): handle /v1 suffix in PAGEPIPER_OLLAMA_URL; add DATA_DIR mkdir guard 2026-05-04 17:13:50 -07:00
test_ingest.py fix(ingest): batch embedding, connection guard, correct upsert id param, module-level imports in tests 2026-05-04 17:36:18 -07:00
test_library_api.py feat(ingest): add full PDF ingest pipeline (cf-orch task, BYOK embed) 2026-05-04 17:33:02 -07:00
test_search_api.py feat(api): add BM25 search endpoint (MIT, no tier gate) 2026-05-04 17:41:49 -07:00
test_startup.py feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00
test_synthesizer.py fix: T7 quality — SynthesisResult.citations tuple, retriever comments, test assertion 2026-05-04 17:51:22 -07:00
test_text_clean.py feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI 2026-05-06 08:25:58 -07:00