three-layer approach to stop 7B model from supplementing retrieved context
with training-data knowledge:
1. system prompt redesigned: 'no memory of books/stories/authors' eliminates
the model's self-permission to draw on parametric knowledge
2. quote-first prompt structure: model must commit to a specific quoted passage
before generating an answer — explicit NOT FOUND required when excerpts lack
the answer, preventing the 'excerpt doesn't say X... however in the series...'
escape pattern
3. _strip_escape() post-processor: catches any residual leakage by scanning for
known escape phrases ('in the series', 'by terry goodkind', 'it can be assumed',
etc.) and replacing the response with the canned no-answer message
synthesizer: repeat the no-outside-knowledge rule inside the user message turn —
small models (7B) follow user-turn instructions more reliably than system-prompt
alone when parametric memory competes with the retrieved context
retriever: cap each document to max(2, top_k//3) slots in the ranked list so
one book cannot flood all result slots on character-name BM25 matches — forces
coverage across more documents when the answer may be in any of them
- Strengthen synthesizer system prompt: hard 'respond with exactly' constraint
instead of soft 'say so'; removes any wiggle room for the model to supplement
from training data
- Add early return in synthesize() when chunks is empty (belt-and-suspenders
alongside the existing guard in chat.py)
- Add MIN_SIGNAL threshold (0.01) in retriever: if the top combined score is
below the threshold, return empty so the caller's no-results path fires instead
of sending noise chunks to the LLM
Retrieval:
- Add _fetch_adjacent() to retriever: fetches page ± 1 chunks from DB
after ranking so mid-sentence EPUB chunk boundaries don't lose context
- Fix vec DB doc-filter: oversample to top_k*20 before Python filter
instead of post-filtering an already-small global pool (fixes wrong-book
results when searching within a single document)
- top_k default 5 → 10; context per chunk 500 → 1500 chars; citation
snippet 200 → 400 chars
Artifact cleaning:
- Add scripts/text_clean.py: strips ABC Amber LIT Converter watermarks,
processtext.com URLs, bare page numbers, piracy stamps from extracted text
- Wire clean_paragraph() into ingest_pdf.py and new ingest_epub.py
Startup validation:
- _check_vec_schema() at boot: detects embedding dimension mismatch,
deletes stale vec DB, and queues sequential re-embed in background thread
- Sequential _reembed_docs() prevents SQLite lock races on startup re-embed
cf-orch integration:
- Wire CF_ORCH_URL / CF_LICENSE_KEY into LLMRouter backend config so
allocate() fires and keeps the Ollama model warm between requests
Ingestion progress UI:
- GET /api/library/{doc_id}/status now returns vec_count from page_vecs_meta
- DocumentCard.vue polls status every 3 s while processing and shows
two-phase progress: indeterminate animation during extraction,
determinate "Embedding N/M pages" bar once vectors start landing
Other:
- Chat feedback endpoint + thumbs up/down UI (FeedbackButton.vue)
- EPUB ingest script (ingest_epub.py) with heading-based chunking
- migration 002: chat_feedback table
- README.md with setup and feature overview
Citation dataclass gains bm25_score field populated from the retrieved
chunk. chat.py serializes it. api.ts interface updated to include it.
ChatView passes :bm25-score to CitationPanel so the Nat20 threshold
check in onMounted actually has data to evaluate.
- app/services/retriever.py: hybrid BM25 + semantic Retriever with BM25-only fallback when llm=None
- app/services/synthesizer.py: LLM answer synthesis with citation assembly over retrieved chunks
- app/api/chat.py: POST /api/chat endpoint with 402 gate when PAGEPIPER_OLLAMA_URL is unset
- tests/test_synthesizer.py: 3 TDD unit tests (mocked LLM, context building, system prompt)
- tests/test_chat_api.py: 2 integration tests (402 without Ollama, 200 with mocked retriever+LLM)