documents module: StructuredDocument interface + ingest() client (Dolphin-v2 runtime in cf-orch) #7

Closed
opened 2026-04-01 22:01:33 -07:00 by pyr0ball · 0 comments
Owner

Scope (interface layer only)

This issue covers the client-side abstraction in circuitforge-core. The Dolphin-v2 inference service runtime lives in circuitforge-orch (see Circuit-Forge/circuitforge-orch#TBD).

Products import from circuitforge_core.documents — they never talk to cf-orch directly.

Module

circuitforge_core/documents/
    __init__.py
    models.py      # StructuredDocument, Element, ParsedTable dataclasses
    ingest.py      # ingest(image_bytes, hint) -> StructuredDocument
    fallback.py    # LLM-only parsing when Dolphin-v2 service is unavailable

API

from circuitforge_core.documents import ingest

doc = ingest(image_bytes: bytes, hint: str = "auto") -> StructuredDocument

StructuredDocument contains:

  • elements: list[Element] — typed, ordered (heading, paragraph, list, table, figure, formula, code…)
  • raw_text: str — full extracted text
  • tables: list[ParsedTable] — HTML tables
  • metadata: dict — page dimensions, source type, confidence

Routing

ingest() delegates to LLMRouter.complete() with images=[...]. The router handles backend selection:

  • If a vision_service backend is configured and reachable (cf-orch Dolphin-v2 agent) → use it
  • Fallback: openai_compat or anthropic backend with supports_images=true
  • Final fallback: fallback.py LLM-only parsing (lower fidelity, no layout awareness)

No direct cf-orch dependency — routing goes through LLMRouter as normal.

Consumers

  • kiwi — recipe card / receipt scanning
  • falcon — government form field identification
  • peregrine — resume image parsing
  • godwit — identity document bundle parsing

Acceptance criteria

  • StructuredDocument + Element dataclasses covering all 21 Dolphin-v2 element types
  • ingest() routes through LLMRouter transparently
  • fallback.py handles graceful degradation when no vision backend available
  • 80%+ test coverage (mock vision_service backend in tests)
  • No direct cf-orch import
## Scope (interface layer only) This issue covers the **client-side abstraction** in `circuitforge-core`. The Dolphin-v2 inference service runtime lives in `circuitforge-orch` (see Circuit-Forge/circuitforge-orch#TBD). Products import from `circuitforge_core.documents` — they never talk to cf-orch directly. ## Module ``` circuitforge_core/documents/ __init__.py models.py # StructuredDocument, Element, ParsedTable dataclasses ingest.py # ingest(image_bytes, hint) -> StructuredDocument fallback.py # LLM-only parsing when Dolphin-v2 service is unavailable ``` ## API ```python from circuitforge_core.documents import ingest doc = ingest(image_bytes: bytes, hint: str = "auto") -> StructuredDocument ``` `StructuredDocument` contains: - `elements: list[Element]` — typed, ordered (heading, paragraph, list, table, figure, formula, code…) - `raw_text: str` — full extracted text - `tables: list[ParsedTable]` — HTML tables - `metadata: dict` — page dimensions, source type, confidence ## Routing `ingest()` delegates to `LLMRouter.complete()` with `images=[...]`. The router handles backend selection: - If a `vision_service` backend is configured and reachable (cf-orch Dolphin-v2 agent) → use it - Fallback: `openai_compat` or `anthropic` backend with `supports_images=true` - Final fallback: `fallback.py` LLM-only parsing (lower fidelity, no layout awareness) No direct cf-orch dependency — routing goes through `LLMRouter` as normal. ## Consumers - `kiwi` — recipe card / receipt scanning - `falcon` — government form field identification - `peregrine` — resume image parsing - `godwit` — identity document bundle parsing ## Acceptance criteria - [ ] `StructuredDocument` + `Element` dataclasses covering all 21 Dolphin-v2 element types - [ ] `ingest()` routes through `LLMRouter` transparently - [ ] `fallback.py` handles graceful degradation when no vision backend available - [ ] 80%+ test coverage (mock vision_service backend in tests) - [ ] No direct cf-orch import
pyr0ball added this to the v0.8.0 — Pipeline + Hardware + Documents modules milestone 2026-04-06 08:25:23 -07:00
pyr0ball changed title from documents module: shared Dolphin-v2 ingestion pipeline (images → StructuredDocument) to documents module: StructuredDocument interface + ingest() client (Dolphin-v2 runtime in cf-orch) 2026-04-06 08:30:59 -07:00
Sign in to join this conversation.
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/circuitforge-core#7
No description provided.