408ab64c55
test(documents): add OCR and ImportError coverage for PDFExtractor
...
- Add module-level guards for pytesseract and PIL.Image (enables patching in tests)
- Move `import io` from inside _ocr_page to module-level stdlib imports
- Extract _ensure_pil_image() helper with TypeError guard so isinstance check
does not blow up when Image is patched to a MagicMock in tests
- Add 3 new tests: pdfplumber=None ImportError, sparse-page OCR fallback,
OCR render failure returns empty chunk
- Coverage: 96% (up from 64%)
2026-05-04 08:39:31 -07:00
bbb146b361
feat(documents): add PDFExtractor text-layer extraction and PageChunk
...
Adds circuitforge_core/documents/pdf.py with:
- PageChunk frozen dataclass (page_number, text, source, word_count)
- PDFExtractor.chunk_pages() — pdfplumber text-layer per page, OCR fallback via pytesseract for sparse pages
- Module-level graceful ImportError guard on pdfplumber (patchable, follows cf-core optional-extra pattern)
- pdf and pdf-ocr optional extras declared in pyproject.toml
3 tests, all passing.
2026-05-04 08:33:10 -07:00
cd9864b5e8
feat: hardware detection, cf-docuvision service, documents ingestion pipeline
...
Closes #5 , #7 , #8 , #13
## hardware module (closes #5 )
- HardwareSpec, LLMBackendConfig, LLMConfig dataclasses
- VramTier ladder (CPU / 2 / 4 / 6 / 8 / 16 / 24 GB) with select_tier()
- generate_profile() maps HardwareSpec → LLMConfig for llm.yaml generation
- detect_hardware() with nvidia-smi / rocm-smi / system_profiler / cpu fallback
- 31 tests across tiers, generator, and detect
## cf-docuvision service (closes #8 )
- FastAPI service wrapping ByteDance/Dolphin-v2 (Qwen2.5-VL backbone)
- POST /extract: image_b64 or image_path + hint → ExtractResponse
- Lazy model loading; JSON-structured output with plain-text fallback
- ProcessSpec managed blocks added to all four GPU profiles (6/8/16/24 GB)
- 14 tests
## documents module (closes #7 )
- StructuredDocument, Element, ParsedTable dataclasses (frozen, composable)
- DocuvisionClient: thin HTTP client for cf-docuvision POST /extract
- ingest(): primary cf-docuvision path → LLMRouter vision fallback → empty doc
- CF_DOCUVISION_URL env var for URL override
- 22 tests
## coordinator probe loop (closes #13 )
- _run_instance_probe_loop: starting → running on 200; starting → stopped on timeout
- 4 async tests with CancelledError-based tick control
2026-04-02 18:53:25 -07:00