pagepiper

Circuit-Forge/pagepiper

Fork 0

Commit graph

Author	SHA1	Message	Date
pyr0ball	8eef52a054	feat: per-user database isolation for cloud instances (closes #4 ) Implements Option A from the issue design: each cloud user gets their own data directory (DATA_DIR/users/{user_id}/) with separate pagepiper.db, pagepiper_vecs.db, uploads/, and books/. Local mode is unchanged. Key changes: - app/startup.py: extract apply_migrations, reembed_docs, check_and_rebuild_vec_schema out of main.py (no circular imports) - app/config.py: add LOCAL_USER_ID constant and user_data_dir() helper - app/cloud_session.py: extract resolve_authenticated_user(); require_paid_tier now returns user_id (str) instead of None - app/deps.py: add UserCtx dataclass (db_path, vec_db_path, data_dir, watch_dir, bm25) + get_user_ctx dependency; per-user startup guard runs migrations + vec schema check once per process per user - app/main.py: _bm25 singleton -> _bm25_map dict keyed by user_id; add _get_bm25_for(); lifespan only runs startup checks in local mode - app/api/library.py, search.py, chat.py: thread UserCtx through all endpoints; remove module-level _mark_bm25_dirty injection pattern - tests/conftest.py: override get_user_ctx in addition to get_db so all endpoints get a consistent test UserCtx	2026-05-13 16:31:51 -07:00
pyr0ball	e52bdb5128	feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI Retrieval: - Add _fetch_adjacent() to retriever: fetches page ± 1 chunks from DB after ranking so mid-sentence EPUB chunk boundaries don't lose context - Fix vec DB doc-filter: oversample to top_k*20 before Python filter instead of post-filtering an already-small global pool (fixes wrong-book results when searching within a single document) - top_k default 5 → 10; context per chunk 500 → 1500 chars; citation snippet 200 → 400 chars Artifact cleaning: - Add scripts/text_clean.py: strips ABC Amber LIT Converter watermarks, processtext.com URLs, bare page numbers, piracy stamps from extracted text - Wire clean_paragraph() into ingest_pdf.py and new ingest_epub.py Startup validation: - _check_vec_schema() at boot: detects embedding dimension mismatch, deletes stale vec DB, and queues sequential re-embed in background thread - Sequential _reembed_docs() prevents SQLite lock races on startup re-embed cf-orch integration: - Wire CF_ORCH_URL / CF_LICENSE_KEY into LLMRouter backend config so allocate() fires and keeps the Ollama model warm between requests Ingestion progress UI: - GET /api/library/{doc_id}/status now returns vec_count from page_vecs_meta - DocumentCard.vue polls status every 3 s while processing and shows two-phase progress: indeterminate animation during extraction, determinate "Embedding N/M pages" bar once vectors start landing Other: - Chat feedback endpoint + thumbs up/down UI (FeedbackButton.vue) - EPUB ingest script (ingest_epub.py) with heading-based chunking - migration 002: chat_feedback table - README.md with setup and feature overview	2026-05-06 08:25:58 -07:00
pyr0ball	4c2370f1de	feat(api): add library CRUD endpoints and FastAPI factory Implements GET/DELETE /api/library, POST /api/library/{id}/reingest, POST /api/library/scan, and GET /api/library/{id}/status. Adds FastAPI app factory with lifespan migrations, BM25 singleton wiring, get_db dependency, ingest task registry with cf-orch/BackgroundTasks fallback, and placeholder search/chat routers. All 5 new tests pass (14 total).	2026-05-04 17:24:50 -07:00

Author

SHA1

Message

Date

pyr0ball

8eef52a054

feat: per-user database isolation for cloud instances (closes #4 )

Implements Option A from the issue design: each cloud user gets their own
data directory (DATA_DIR/users/{user_id}/) with separate pagepiper.db,
pagepiper_vecs.db, uploads/, and books/. Local mode is unchanged.

Key changes:
- app/startup.py: extract apply_migrations, reembed_docs,
  check_and_rebuild_vec_schema out of main.py (no circular imports)
- app/config.py: add LOCAL_USER_ID constant and user_data_dir() helper
- app/cloud_session.py: extract resolve_authenticated_user(); require_paid_tier
  now returns user_id (str) instead of None
- app/deps.py: add UserCtx dataclass (db_path, vec_db_path, data_dir,
  watch_dir, bm25) + get_user_ctx dependency; per-user startup guard runs
  migrations + vec schema check once per process per user
- app/main.py: _bm25 singleton -> _bm25_map dict keyed by user_id;
  add _get_bm25_for(); lifespan only runs startup checks in local mode
- app/api/library.py, search.py, chat.py: thread UserCtx through all
  endpoints; remove module-level _mark_bm25_dirty injection pattern
- tests/conftest.py: override get_user_ctx in addition to get_db so all
  endpoints get a consistent test UserCtx

2026-05-13 16:31:51 -07:00

pyr0ball

e52bdb5128

feat: RAG retrieval quality, artifact cleaning, and ingestion progress UI

Retrieval:
- Add _fetch_adjacent() to retriever: fetches page ± 1 chunks from DB
  after ranking so mid-sentence EPUB chunk boundaries don't lose context
- Fix vec DB doc-filter: oversample to top_k*20 before Python filter
  instead of post-filtering an already-small global pool (fixes wrong-book
  results when searching within a single document)
- top_k default 5 → 10; context per chunk 500 → 1500 chars; citation
  snippet 200 → 400 chars

Artifact cleaning:
- Add scripts/text_clean.py: strips ABC Amber LIT Converter watermarks,
  processtext.com URLs, bare page numbers, piracy stamps from extracted text
- Wire clean_paragraph() into ingest_pdf.py and new ingest_epub.py

Startup validation:
- _check_vec_schema() at boot: detects embedding dimension mismatch,
  deletes stale vec DB, and queues sequential re-embed in background thread
- Sequential _reembed_docs() prevents SQLite lock races on startup re-embed

cf-orch integration:
- Wire CF_ORCH_URL / CF_LICENSE_KEY into LLMRouter backend config so
  allocate() fires and keeps the Ollama model warm between requests

Ingestion progress UI:
- GET /api/library/{doc_id}/status now returns vec_count from page_vecs_meta
- DocumentCard.vue polls status every 3 s while processing and shows
  two-phase progress: indeterminate animation during extraction,
  determinate "Embedding N/M pages" bar once vectors start landing

Other:
- Chat feedback endpoint + thumbs up/down UI (FeedbackButton.vue)
- EPUB ingest script (ingest_epub.py) with heading-based chunking
- migration 002: chat_feedback table
- README.md with setup and feature overview

2026-05-06 08:25:58 -07:00

pyr0ball

4c2370f1de

feat(api): add library CRUD endpoints and FastAPI factory

Implements GET/DELETE /api/library, POST /api/library/{id}/reingest,
POST /api/library/scan, and GET /api/library/{id}/status. Adds FastAPI
app factory with lifespan migrations, BM25 singleton wiring, get_db
dependency, ingest task registry with cf-orch/BackgroundTasks fallback,
and placeholder search/chat routers. All 5 new tests pass (14 total).

2026-05-04 17:24:50 -07:00

3 commits