circuitforge-core

Circuit-Forge/circuitforge-core

Fork 0

Commit graph

Author	SHA1	Message	Date
pyr0ball	bbb146b361	feat(documents): add PDFExtractor text-layer extraction and PageChunk Adds circuitforge_core/documents/pdf.py with: - PageChunk frozen dataclass (page_number, text, source, word_count) - PDFExtractor.chunk_pages() — pdfplumber text-layer per page, OCR fallback via pytesseract for sparse pages - Module-level graceful ImportError guard on pdfplumber (patchable, follows cf-core optional-extra pattern) - pdf and pdf-ocr optional extras declared in pyproject.toml 3 tests, all passing.	2026-05-04 08:33:10 -07:00
pyr0ball	cd9864b5e8	feat: hardware detection, cf-docuvision service, documents ingestion pipeline Closes #5, #7, #8, #13 ## hardware module (closes #5) - HardwareSpec, LLMBackendConfig, LLMConfig dataclasses - VramTier ladder (CPU / 2 / 4 / 6 / 8 / 16 / 24 GB) with select_tier() - generate_profile() maps HardwareSpec → LLMConfig for llm.yaml generation - detect_hardware() with nvidia-smi / rocm-smi / system_profiler / cpu fallback - 31 tests across tiers, generator, and detect ## cf-docuvision service (closes #8) - FastAPI service wrapping ByteDance/Dolphin-v2 (Qwen2.5-VL backbone) - POST /extract: image_b64 or image_path + hint → ExtractResponse - Lazy model loading; JSON-structured output with plain-text fallback - ProcessSpec managed blocks added to all four GPU profiles (6/8/16/24 GB) - 14 tests ## documents module (closes #7) - StructuredDocument, Element, ParsedTable dataclasses (frozen, composable) - DocuvisionClient: thin HTTP client for cf-docuvision POST /extract - ingest(): primary cf-docuvision path → LLMRouter vision fallback → empty doc - CF_DOCUVISION_URL env var for URL override - 22 tests ## coordinator probe loop (closes #13) - _run_instance_probe_loop: starting → running on 200; starting → stopped on timeout - 4 async tests with CancelledError-based tick control	2026-04-02 18:53:25 -07:00

Author

SHA1

Message

Date

pyr0ball

bbb146b361

feat(documents): add PDFExtractor text-layer extraction and PageChunk

Adds circuitforge_core/documents/pdf.py with:
- PageChunk frozen dataclass (page_number, text, source, word_count)
- PDFExtractor.chunk_pages() — pdfplumber text-layer per page, OCR fallback via pytesseract for sparse pages
- Module-level graceful ImportError guard on pdfplumber (patchable, follows cf-core optional-extra pattern)
- pdf and pdf-ocr optional extras declared in pyproject.toml

3 tests, all passing.

2026-05-04 08:33:10 -07:00

pyr0ball

cd9864b5e8

feat: hardware detection, cf-docuvision service, documents ingestion pipeline

Closes #5, #7, #8, #13

## hardware module (closes #5)
- HardwareSpec, LLMBackendConfig, LLMConfig dataclasses
- VramTier ladder (CPU / 2 / 4 / 6 / 8 / 16 / 24 GB) with select_tier()
- generate_profile() maps HardwareSpec → LLMConfig for llm.yaml generation
- detect_hardware() with nvidia-smi / rocm-smi / system_profiler / cpu fallback
- 31 tests across tiers, generator, and detect

## cf-docuvision service (closes #8)
- FastAPI service wrapping ByteDance/Dolphin-v2 (Qwen2.5-VL backbone)
- POST /extract: image_b64 or image_path + hint → ExtractResponse
- Lazy model loading; JSON-structured output with plain-text fallback
- ProcessSpec managed blocks added to all four GPU profiles (6/8/16/24 GB)
- 14 tests

## documents module (closes #7)
- StructuredDocument, Element, ParsedTable dataclasses (frozen, composable)
- DocuvisionClient: thin HTTP client for cf-docuvision POST /extract
- ingest(): primary cf-docuvision path → LLMRouter vision fallback → empty doc
- CF_DOCUVISION_URL env var for URL override
- 22 tests

## coordinator probe loop (closes #13)
- _run_instance_probe_loop: starting → running on 200; starting → stopped on timeout
- 4 async tests with CancelledError-based tick control

2026-04-02 18:53:25 -07:00

2 commits