Circuit-Forge/circuitforge-core

Fork 0

pyr0ball 383897f990

CI / test (push) Waiting to run

Details

Mirror / mirror (push) Waiting to run

Details

Release — PyPI / release (push) Waiting to run

Details

feat: platforms module + docs + scripts

- platforms/: eBay platform adapter (snipe integration layer)
- docs/: developer guide, module reference, getting-started docs
- scripts/: utility scripts for development and deployment

2026-04-24 15:23:16 -07:00

2.2 KiB

Raw Blame History

documents

Document ingestion pipeline. Converts PDF, DOCX, ODT, and images into a normalized StructuredDocument for downstream processing.

from circuitforge_core.documents import ingest, StructuredDocument

Supported formats

Format	Method	Notes
PDF	`pdfplumber`	Two-column detection via gutter analysis
DOCX	`python-docx`	Paragraph and table extraction
ODT	stdlib `zipfile` + `ElementTree`	No external deps required
PNG/JPG	cf-docuvision fast-path, local fallback	OCR via vision router

`ingest(path: str | Path) -> StructuredDocument`

Main entry point. Detects format by file extension and routes to the appropriate parser.

doc = ingest("/tmp/invoice.pdf")
print(doc.text)       # full extracted text
print(doc.pages)      # list of per-page content
print(doc.metadata)   # title, author, creation date if available

StructuredDocument

@dataclass
class StructuredDocument:
    text: str                        # full plain text
    pages: list[str]                 # per-page text (PDFs)
    sections: dict[str, str]         # named sections if detected
    metadata: dict[str, Any]         # format-specific metadata
    source_path: str
    format: str                      # "pdf" | "docx" | "odt" | "image"

PDF specifics

Two-column PDFs (common in resumes and academic papers) are handled by _find_column_split(), which detects the gutter via word x-positions and extracts left and right columns separately before merging.

CID glyph references ((cid:NNN)) from ATS-reembedded fonts are stripped automatically. Common bullet CIDs (127, 149, 183) are mapped to •.

OCR path

Image inputs go through the vision router (see the vision module). In practice this means:

cf-docuvision fast-path (if available on the cf-orch coordinator)
Local moondream2 fallback

OCR results are treated as unstructured text — no section detection is attempted.

ATS gotcha

Some ATS-exported PDFs embed fonts in ways that cause pdfplumber to extract garbled text. If doc.text looks corrupted (common with Oracle Taleo exports), try the image fallback:

doc = ingest(path, force_ocr=True)

2.2 KiB Raw Blame History

documents

Supported formats

ingest(path: str | Path) -> StructuredDocument

StructuredDocument

PDF specifics

OCR path

ATS gotcha

2.2 KiB

Raw Blame History

`ingest(path: str | Path) -> StructuredDocument`