circuitforge-core/tests/test_documents
pyr0ball bbb146b361 feat(documents): add PDFExtractor text-layer extraction and PageChunk
Adds circuitforge_core/documents/pdf.py with:
- PageChunk frozen dataclass (page_number, text, source, word_count)
- PDFExtractor.chunk_pages() — pdfplumber text-layer per page, OCR fallback via pytesseract for sparse pages
- Module-level graceful ImportError guard on pdfplumber (patchable, follows cf-core optional-extra pattern)
- pdf and pdf-ocr optional extras declared in pyproject.toml

3 tests, all passing.
2026-05-04 08:33:10 -07:00
..
__init__.py feat: hardware detection, cf-docuvision service, documents ingestion pipeline 2026-04-02 18:53:25 -07:00
test_client.py feat: hardware detection, cf-docuvision service, documents ingestion pipeline 2026-04-02 18:53:25 -07:00
test_ingest.py feat: hardware detection, cf-docuvision service, documents ingestion pipeline 2026-04-02 18:53:25 -07:00
test_models.py feat: hardware detection, cf-docuvision service, documents ingestion pipeline 2026-04-02 18:53:25 -07:00
test_pdf.py feat(documents): add PDFExtractor text-layer extraction and PageChunk 2026-05-04 08:33:10 -07:00