pagepiper

Author	SHA1	Message	Date
pyr0ball	3765fbc0f9	fix: quote-first prompt structure + escape phrase post-processing to kill hallucinations three-layer approach to stop 7B model from supplementing retrieved context with training-data knowledge: 1. system prompt redesigned: 'no memory of books/stories/authors' eliminates the model's self-permission to draw on parametric knowledge 2. quote-first prompt structure: model must commit to a specific quoted passage before generating an answer — explicit NOT FOUND required when excerpts lack the answer, preventing the 'excerpt doesn't say X... however in the series...' escape pattern 3. _strip_escape() post-processor: catches any residual leakage by scanning for known escape phrases ('in the series', 'by terry goodkind', 'it can be assumed', etc.) and replacing the response with the canned no-answer message	2026-05-06 10:30:11 -07:00
pyr0ball	32cb21e2cd	fix: reinforce no-hallucination constraint in user-turn prompt; cap per-doc retrieval synthesizer: repeat the no-outside-knowledge rule inside the user message turn — small models (7B) follow user-turn instructions more reliably than system-prompt alone when parametric memory competes with the retrieved context retriever: cap each document to max(2, top_k//3) slots in the ranked list so one book cannot flood all result slots on character-name BM25 matches — forces coverage across more documents when the answer may be in any of them	2026-05-06 10:26:51 -07:00
pyr0ball	347b391c6e	fix: prevent LLM hallucination when retrieval returns low-signal results - Strengthen synthesizer system prompt: hard 'respond with exactly' constraint instead of soft 'say so'; removes any wiggle room for the model to supplement from training data - Add early return in synthesize() when chunks is empty (belt-and-suspenders alongside the existing guard in chat.py) - Add MIN_SIGNAL threshold (0.01) in retriever: if the top combined score is below the threshold, return empty so the caller's no-results path fires instead of sending noise chunks to the LLM	2026-05-06 10:17:51 -07:00
pyr0ball	e52bdb5128	feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI Retrieval: - Add _fetch_adjacent() to retriever: fetches page ± 1 chunks from DB after ranking so mid-sentence EPUB chunk boundaries don't lose context - Fix vec DB doc-filter: oversample to top_k*20 before Python filter instead of post-filtering an already-small global pool (fixes wrong-book results when searching within a single document) - top_k default 5 → 10; context per chunk 500 → 1500 chars; citation snippet 200 → 400 chars Artifact cleaning: - Add scripts/text_clean.py: strips ABC Amber LIT Converter watermarks, processtext.com URLs, bare page numbers, piracy stamps from extracted text - Wire clean_paragraph() into ingest_pdf.py and new ingest_epub.py Startup validation: - _check_vec_schema() at boot: detects embedding dimension mismatch, deletes stale vec DB, and queues sequential re-embed in background thread - Sequential _reembed_docs() prevents SQLite lock races on startup re-embed cf-orch integration: - Wire CF_ORCH_URL / CF_LICENSE_KEY into LLMRouter backend config so allocate() fires and keeps the Ollama model warm between requests Ingestion progress UI: - GET /api/library/{doc_id}/status now returns vec_count from page_vecs_meta - DocumentCard.vue polls status every 3 s while processing and shows two-phase progress: indeterminate animation during extraction, determinate "Embedding N/M pages" bar once vectors start landing Other: - Chat feedback endpoint + thumbs up/down UI (FeedbackButton.vue) - EPUB ingest script (ingest_epub.py) with heading-based chunking - migration 002: chat_feedback table - README.md with setup and feature overview	2026-05-06 08:25:58 -07:00
pyr0ball	6fc8e7faa6	fix: wire bm25_score through Citation so Natural 20 easter egg fires Citation dataclass gains bm25_score field populated from the retrieved chunk. chat.py serializes it. api.ts interface updated to include it. ChatView passes :bm25-score to CitationPanel so the Nat20 threshold check in onMounted actually has data to evaluate.	2026-05-04 20:01:20 -07:00
pyr0ball	17cdb552a3	fix: T7 quality — SynthesisResult.citations tuple, retriever comments, test assertion - SynthesisResult.citations changed from list[Citation] to tuple[Citation, ...] so frozen=True dataclass is genuinely immutable end-to-end - synthesize() now builds tuple via generator expression - retriever._combined: add comment explaining L2 distance inversion - retriever.hybrid_search: comment on _bm25._chunks private access - test_synthesizer_builds_context_from_chunks: drop vacuous str(call_args) fallback; assert directly on call_args.args[0]	2026-05-04 17:51:22 -07:00
pyr0ball	0e493ab560	feat(api): add retriever, synthesizer, and chat endpoint (BSL — BYOK gate) - app/services/retriever.py: hybrid BM25 + semantic Retriever with BM25-only fallback when llm=None - app/services/synthesizer.py: LLM answer synthesis with citation assembly over retrieved chunks - app/api/chat.py: POST /api/chat endpoint with 402 gate when PAGEPIPER_OLLAMA_URL is unset - tests/test_synthesizer.py: 3 TDD unit tests (mocked LLM, context building, system prompt) - tests/test_chat_api.py: 2 integration tests (402 without Ollama, 200 with mocked retriever+LLM)	2026-05-04 17:47:10 -07:00
pyr0ball	47914cebeb	fix(services): add SQLite error handling and strengthen top_k test	2026-05-04 17:20:26 -07:00
pyr0ball	2253cd7da3	feat(services): add BM25 index service (MIT)	2026-05-04 17:17:50 -07:00
pyr0ball	9797e76931	feat: add database schema and migration runner	2026-05-04 17:10:38 -07:00

10 commits