Import benchmark SFT candidates for labeling #14
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
The cf-orch benchmark harness (
scripts/benchmark.py) flags model/task failures as fine-tuning candidates and writes them tosft_candidates.jsonlalongside each benchmark run. Avocet needs a new import source that reads this file and surfaces records as cards in the existing labeling UI so a human can write corrected responses and build a supervised fine-tuning (SFT) dataset.Background
When a model scores below threshold on a task it is fast enough to handle, the harness writes a record to
sft_candidates.jsonlwith:Required Work
1. Import source
benchmarkto the Avocet import pipelinesft_candidates.jsonl(or a directory containing one)id)2. Card type
Each benchmark candidate becomes a card in the labeling UI showing:
failure_reasonfield — explains what went wrong)3. Actions per card
Use the existing ASMR bucket-expansion pattern where possible:
corrected_responseandstatus: approvedstatus: discarded(model output was not a useful failure signal)status: model_rejected— used when the model is completely wrong for this task type (routes to a separate analytics bucket, not SFT)4. SFT export
{"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", "content": "<corrected_response>"}]}status: approved+ non-nullcorrected_responseNotes
corrected_response: nullfield is an explicit contract — presence of null means "needs human input", not missingstatusfield drives routing:needs_review→ labeling queue;approved→ SFT export;discarded/model_rejected→ analytics onlybenchmark_run_id— consider a run-level summary view showing how many candidates each run producedRelated
circuitforge-orch/scripts/benchmark.pysft_candidates.jsonlwritten per-run toscripts/bench_results/<timestamp>/