# documents Document ingestion pipeline. Converts PDF, DOCX, ODT, and images into a normalized `StructuredDocument` for downstream processing. ```python from circuitforge_core.documents import ingest, StructuredDocument ``` ## Supported formats | Format | Method | Notes | |--------|--------|-------| | PDF | `pdfplumber` | Two-column detection via gutter analysis | | DOCX | `python-docx` | Paragraph and table extraction | | ODT | stdlib `zipfile` + `ElementTree` | No external deps required | | PNG/JPG | cf-docuvision fast-path, local fallback | OCR via vision router | ## `ingest(path: str | Path) -> StructuredDocument` Main entry point. Detects format by file extension and routes to the appropriate parser. ```python doc = ingest("/tmp/invoice.pdf") print(doc.text) # full extracted text print(doc.pages) # list of per-page content print(doc.metadata) # title, author, creation date if available ``` ## StructuredDocument ```python @dataclass class StructuredDocument: text: str # full plain text pages: list[str] # per-page text (PDFs) sections: dict[str, str] # named sections if detected metadata: dict[str, Any] # format-specific metadata source_path: str format: str # "pdf" | "docx" | "odt" | "image" ``` ## PDF specifics Two-column PDFs (common in resumes and academic papers) are handled by `_find_column_split()`, which detects the gutter via word x-positions and extracts left and right columns separately before merging. CID glyph references (`(cid:NNN)`) from ATS-reembedded fonts are stripped automatically. Common bullet CIDs (127, 149, 183) are mapped to `•`. ## OCR path Image inputs go through the vision router (see the [vision module](vision.md)). In practice this means: 1. cf-docuvision fast-path (if available on the cf-orch coordinator) 2. Local moondream2 fallback OCR results are treated as unstructured text — no section detection is attempted. ## ATS gotcha Some ATS-exported PDFs embed fonts in ways that cause `pdfplumber` to extract garbled text. If `doc.text` looks corrupted (common with Oracle Taleo exports), try the image fallback: ```python doc = ingest(path, force_ocr=True) ```