Commit graph

25 commits

Author SHA1 Message Date
e52bdb5128 feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI
Retrieval:
- Add _fetch_adjacent() to retriever: fetches page ± 1 chunks from DB
  after ranking so mid-sentence EPUB chunk boundaries don't lose context
- Fix vec DB doc-filter: oversample to top_k*20 before Python filter
  instead of post-filtering an already-small global pool (fixes wrong-book
  results when searching within a single document)
- top_k default 5 → 10; context per chunk 500 → 1500 chars; citation
  snippet 200 → 400 chars

Artifact cleaning:
- Add scripts/text_clean.py: strips ABC Amber LIT Converter watermarks,
  processtext.com URLs, bare page numbers, piracy stamps from extracted text
- Wire clean_paragraph() into ingest_pdf.py and new ingest_epub.py

Startup validation:
- _check_vec_schema() at boot: detects embedding dimension mismatch,
  deletes stale vec DB, and queues sequential re-embed in background thread
- Sequential _reembed_docs() prevents SQLite lock races on startup re-embed

cf-orch integration:
- Wire CF_ORCH_URL / CF_LICENSE_KEY into LLMRouter backend config so
  allocate() fires and keeps the Ollama model warm between requests

Ingestion progress UI:
- GET /api/library/{doc_id}/status now returns vec_count from page_vecs_meta
- DocumentCard.vue polls status every 3 s while processing and shows
  two-phase progress: indeterminate animation during extraction,
  determinate "Embedding N/M pages" bar once vectors start landing

Other:
- Chat feedback endpoint + thumbs up/down UI (FeedbackButton.vue)
- EPUB ingest script (ingest_epub.py) with heading-based chunking
- migration 002: chat_feedback table
- README.md with setup and feature overview
2026-05-06 08:25:58 -07:00
be7a076f34 fix: use http_host for proxy Host header to preserve port in redirects 2026-05-05 12:04:56 -07:00
4fb3b7d143 fix: switch dev compose to bridge network, configurable API_PORT and WEB_PORT 2026-05-05 11:46:45 -07:00
88e18d9dac docs: add docs/index.md and docs/screenshots for cloud launch 2026-05-05 11:25:23 -07:00
42ae3bc39b fix: use generic 'documents' copy in ChatView instead of 'rulebooks' 2026-05-05 11:21:11 -07:00
2e24808d91 feat(deploy): add cf-orch routing to cloud compose
- CF_ORCH_URL, CF_APP_NAME, COORDINATOR_URL env vars in api service
- COORDINATOR_PAGEPIPER_KEY wired from .env
- extra_hosts: host.docker.internal:host-gateway for container → host Ollama
- .env.cloud.example updated with COORDINATOR_PAGEPIPER_KEY placeholder
2026-05-05 08:08:12 -07:00
c24bd33478 feat(deploy): add cloud deploy config for pagepiper.circuitforge.tech
- compose.cloud.yml: pagepiper-cloud project on port 8533 (avoids
  conflict with Linnet dev on 8521/Magpie on 8531)
- docker/web/nginx.cloud.conf: handles both /pagepiper/* path (primary
  domain, no Caddy strip) and / path (menagerie, Caddy strips prefix)
- docker/web/Dockerfile: NGINX_CONF build arg to select dev vs cloud conf
- .env.cloud.example: cloud env template with BYOK gate vars
- manage.sh: cloud:start|stop|restart|status|logs|build commands

Caddy config updated separately (not in this repo).
DNS record needed: pagepiper.circuitforge.tech → Heimdall edge IP.
2026-05-05 07:12:48 -07:00
6fc8e7faa6 fix: wire bm25_score through Citation so Natural 20 easter egg fires
Citation dataclass gains bm25_score field populated from the retrieved
chunk. chat.py serializes it. api.ts interface updated to include it.
ChatView passes :bm25-score to CitationPanel so the Nat20 threshold
check in onMounted actually has data to evaluate.
2026-05-04 20:01:20 -07:00
6bda1143cc feat(web): add ChatView, CitationPanel, and Natural 20 easter egg 2026-05-04 18:32:20 -07:00
e401cb5f48 fix(web): error handling in LibraryView, taskId watch in IngestProgress, type fixes 2026-05-04 18:02:36 -07:00
b4837163d5 feat(web): add Vue 3 frontend scaffold -- LibraryView, DocumentCard, IngestProgress
Vue 3 + Vite + TypeScript scaffold with theme-aware CSS variables, router,
LibraryView (PDF library grid), DocumentCard (per-doc status + actions),
IngestProgress (polling progress bar), and ChatView stub for Task 9.
2026-05-04 17:57:48 -07:00
17cdb552a3 fix: T7 quality — SynthesisResult.citations tuple, retriever comments, test assertion
- SynthesisResult.citations changed from list[Citation] to tuple[Citation, ...]
  so frozen=True dataclass is genuinely immutable end-to-end
- synthesize() now builds tuple via generator expression
- retriever._combined: add comment explaining L2 distance inversion
- retriever.hybrid_search: comment on _bm25._chunks private access
- test_synthesizer_builds_context_from_chunks: drop vacuous str(call_args)
  fallback; assert directly on call_args.args[0]
2026-05-04 17:51:22 -07:00
0e493ab560 feat(api): add retriever, synthesizer, and chat endpoint (BSL — BYOK gate)
- app/services/retriever.py: hybrid BM25 + semantic Retriever with BM25-only fallback when llm=None
- app/services/synthesizer.py: LLM answer synthesis with citation assembly over retrieved chunks
- app/api/chat.py: POST /api/chat endpoint with 402 gate when PAGEPIPER_OLLAMA_URL is unset
- tests/test_synthesizer.py: 3 TDD unit tests (mocked LLM, context building, system prompt)
- tests/test_chat_api.py: 2 integration tests (402 without Ollama, 200 with mocked retriever+LLM)
2026-05-04 17:47:10 -07:00
eb5c7383ed fix(search): defensive _get_bm25 guard, null-safe text_snippet 2026-05-04 17:43:16 -07:00
6869f32392 feat(api): add BM25 search endpoint (MIT, no tier gate) 2026-05-04 17:41:49 -07:00
c6fa9baf2c fix(ingest): batch embedding, connection guard, correct upsert id param, module-level imports in tests 2026-05-04 17:36:18 -07:00
f4574dd05e feat(ingest): add full PDF ingest pipeline (cf-orch task, BYOK embed) 2026-05-04 17:33:02 -07:00
751faf1679 fix(api): lazy config reads, log ingest exceptions, suppress migrations in tests 2026-05-04 17:28:23 -07:00
4c2370f1de feat(api): add library CRUD endpoints and FastAPI factory
Implements GET/DELETE /api/library, POST /api/library/{id}/reingest,
POST /api/library/scan, and GET /api/library/{id}/status. Adds FastAPI
app factory with lifespan migrations, BM25 singleton wiring, get_db
dependency, ingest task registry with cf-orch/BackgroundTasks fallback,
and placeholder search/chat routers. All 5 new tests pass (14 total).
2026-05-04 17:24:50 -07:00
47914cebeb fix(services): add SQLite error handling and strengthen top_k test 2026-05-04 17:20:26 -07:00
2253cd7da3 feat(services): add BM25 index service (MIT) 2026-05-04 17:17:50 -07:00
abeb6089e5 fix(config): handle /v1 suffix in PAGEPIPER_OLLAMA_URL; add DATA_DIR mkdir guard 2026-05-04 17:13:50 -07:00
9797e76931 feat: add database schema and migration runner 2026-05-04 17:10:38 -07:00
3c9598c443 fix(scaffold): split api:8522/web:8521, fix nginx proxy to host.docker.internal 2026-05-04 17:02:41 -07:00
3a0608ff98 chore: initial pagepiper repo scaffold
Adds pyproject.toml, environment.yml, Dockerfile, docker/web (Vue+nginx),
compose.yml, compose.override.yml.example, manage.sh, .env.example,
.gitignore, and config stubs for the pagepiper self-hosted PDF library tool.
Port 8521. No secrets committed.
2026-05-04 16:54:08 -07:00