feat: recipe scan labeling task type for Kiwi training pipeline #65

Closed
opened 2026-05-17 08:58:57 -07:00 by pyr0ball · 0 comments
Owner

Background

Kiwi is building a recipe scan training dataset using Purple Carrot recipes as ground truth. The dataset will have multiple input modalities:

  • Flatbed/Aura scanner captures
  • Phone captures
  • Handwritten recipe cards (future)

Each capture is paired with a ground truth structured recipe JSON sourced from the Purple Carrot web corpus.

What Avocet Needs

A new task domain: recipe_scan labeling.

Input per item

  • image_path — path to scan in /Library/Assets/
  • modalityscanner | phone | handwritten
  • source — e.g. purple_carrot
  • extracted — the JSON produced by docuvision + LLM structuring (the model output to review)
  • ground_truth — the canonical structured recipe JSON from the web corpus

Label action

  • Reviewer sees the image + extracted JSON side-by-side with the ground truth
  • Can approve (extracted matches ground truth well enough), edit (correct specific fields), or reject (extraction too broken to salvage)
  • Approved/edited output is saved as the training target

Output format (training pair)

{
  "id": "<uuid>",
  "modality": "phone",
  "source": "purple_carrot",
  "image_path": "/Library/Assets/kiwi/scans/...",
  "messages": [
    {"role": "user", "content": "<ocr_extraction_prompt + OCR text>"},
    {"role": "assistant", "content": "<ground_truth_json>"}
  ]
}

This reuses the existing messages chat format so the fine-tune harness works without changes.

Blocking

Kiwi recipe scan corpus build (Purple Carrot scraper + scan pipeline) can proceed independently. Avocet labeling UI is needed before the fine-tuning phase.

References

  • kiwi/app/services/recipe/recipe_scanner.py — extraction pipeline
  • kiwi/scripts/pipeline/ — corpus build scripts
  • /Library/Assets/kiwi/pipeline/ — existing recipe parquets
  • Avocet data/plan_pairs.jsonl — reference format for training pairs
## Background Kiwi is building a recipe scan training dataset using Purple Carrot recipes as ground truth. The dataset will have multiple input modalities: - Flatbed/Aura scanner captures - Phone captures - Handwritten recipe cards (future) Each capture is paired with a ground truth structured recipe JSON sourced from the Purple Carrot web corpus. ## What Avocet Needs A new task domain: **`recipe_scan`** labeling. ### Input per item - `image_path` — path to scan in `/Library/Assets/` - `modality` — `scanner | phone | handwritten` - `source` — e.g. `purple_carrot` - `extracted` — the JSON produced by docuvision + LLM structuring (the model output to review) - `ground_truth` — the canonical structured recipe JSON from the web corpus ### Label action - Reviewer sees the image + extracted JSON side-by-side with the ground truth - Can approve (extracted matches ground truth well enough), edit (correct specific fields), or reject (extraction too broken to salvage) - Approved/edited output is saved as the training target ### Output format (training pair) ```json { "id": "<uuid>", "modality": "phone", "source": "purple_carrot", "image_path": "/Library/Assets/kiwi/scans/...", "messages": [ {"role": "user", "content": "<ocr_extraction_prompt + OCR text>"}, {"role": "assistant", "content": "<ground_truth_json>"} ] } ``` This reuses the existing `messages` chat format so the fine-tune harness works without changes. ## Blocking Kiwi recipe scan corpus build (Purple Carrot scraper + scan pipeline) can proceed independently. Avocet labeling UI is needed before the fine-tuning phase. ## References - `kiwi/app/services/recipe/recipe_scanner.py` — extraction pipeline - `kiwi/scripts/pipeline/` — corpus build scripts - `/Library/Assets/kiwi/pipeline/` — existing recipe parquets - Avocet `data/plan_pairs.jsonl` — reference format for training pairs
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/avocet#65
No description provided.