Import benchmark SFT candidates for labeling #14

New issue

Closed

opened 2026-04-07 21:31:31 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-07 21:31:31 -07:00

Owner

Summary

The cf-orch benchmark harness (scripts/benchmark.py) flags model/task failures as fine-tuning candidates and writes them to sft_candidates.jsonl alongside each benchmark run. Avocet needs a new import source that reads this file and surfaces records as cards in the existing labeling UI so a human can write corrected responses and build a supervised fine-tuning (SFT) dataset.

Background

When a model scores below threshold on a task it is fast enough to handle, the harness writes a record to sft_candidates.jsonl with:

{
  "id": "<uuid>",
  "source": "cf-orch-benchmark",
  "benchmark_run_id": "<uuid>",
  "timestamp": "<ISO>",
  "status": "needs_review",
  "prompt_messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."}
  ],
  "model_response": "<bad output>",
  "corrected_response": null,
  "quality_score": 0.3,
  "failure_reason": "pattern_match: 1/3 required patterns matched; missing: [...]",
  "task_id": "code-function",
  "task_type": "code",
  "task_name": "Code: Write a Python function",
  "model_id": "Qwen/Qwen2.5-3B-Instruct",
  "model_name": "Qwen2.5-3B",
  "node_id": "heimdall",
  "gpu_id": 0,
  "tokens_per_sec": 38.4
}

Required Work

1. Import source

Add a new data source type benchmark to the Avocet import pipeline
Accept a path to sft_candidates.jsonl (or a directory containing one)
Skip records already imported (deduplicate on id)

2. Card type

Each benchmark candidate becomes a card in the labeling UI showing:

Task prompt (system + user messages)
Model response (the bad output)
Failure reason (from failure_reason field — explains what went wrong)
Quality score (colour-coded: red < 0.4, orange < 0.7, yellow otherwise)
Model name + task type as metadata chips

3. Actions per card

Use the existing ASMR bucket-expansion pattern where possible:

Write correction: text area to enter the correct response; on submit → sets corrected_response and status: approved
Discard: marks status: discarded (model output was not a useful failure signal)
Flag model: marks status: model_rejected — used when the model is completely wrong for this task type (routes to a separate analytics bucket, not SFT)

4. SFT export

Export approved records as a standard JSONL for SFT: each line is {"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", "content": "<corrected_response>"}]}
Filter: only status: approved + non-null corrected_response

Notes

The corrected_response: null field is an explicit contract — presence of null means "needs human input", not missing
status field drives routing: needs_review → labeling queue; approved → SFT export; discarded / model_rejected → analytics only
Benchmark runs are identified by benchmark_run_id — consider a run-level summary view showing how many candidates each run produced

cf-orch benchmark harness: circuitforge-orch/scripts/benchmark.py
Source format: sft_candidates.jsonl written per-run to scripts/bench_results/<timestamp>/

## Summary The cf-orch benchmark harness (`scripts/benchmark.py`) flags model/task failures as fine-tuning candidates and writes them to `sft_candidates.jsonl` alongside each benchmark run. Avocet needs a new import source that reads this file and surfaces records as cards in the existing labeling UI so a human can write corrected responses and build a supervised fine-tuning (SFT) dataset. ## Background When a model scores below threshold on a task it is fast enough to handle, the harness writes a record to `sft_candidates.jsonl` with: ```json { "id": "<uuid>", "source": "cf-orch-benchmark", "benchmark_run_id": "<uuid>", "timestamp": "<ISO>", "status": "needs_review", "prompt_messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "..."} ], "model_response": "<bad output>", "corrected_response": null, "quality_score": 0.3, "failure_reason": "pattern_match: 1/3 required patterns matched; missing: [...]", "task_id": "code-function", "task_type": "code", "task_name": "Code: Write a Python function", "model_id": "Qwen/Qwen2.5-3B-Instruct", "model_name": "Qwen2.5-3B", "node_id": "heimdall", "gpu_id": 0, "tokens_per_sec": 38.4 } ``` ## Required Work ### 1. Import source - Add a new data source type `benchmark` to the Avocet import pipeline - Accept a path to `sft_candidates.jsonl` (or a directory containing one) - Skip records already imported (deduplicate on `id`) ### 2. Card type Each benchmark candidate becomes a card in the labeling UI showing: - **Task prompt** (system + user messages) - **Model response** (the bad output) - **Failure reason** (from `failure_reason` field — explains what went wrong) - **Quality score** (colour-coded: red < 0.4, orange < 0.7, yellow otherwise) - **Model name + task type** as metadata chips ### 3. Actions per card Use the existing ASMR bucket-expansion pattern where possible: - **Write correction**: text area to enter the correct response; on submit → sets `corrected_response` and `status: approved` - **Discard**: marks `status: discarded` (model output was not a useful failure signal) - **Flag model**: marks `status: model_rejected` — used when the model is completely wrong for this task type (routes to a separate analytics bucket, not SFT) ### 4. SFT export - Export approved records as a standard JSONL for SFT: each line is `{"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", "content": "<corrected_response>"}]}` - Filter: only `status: approved` + non-null `corrected_response` ## Notes - The `corrected_response: null` field is an explicit contract — presence of null means "needs human input", not missing - `status` field drives routing: `needs_review` → labeling queue; `approved` → SFT export; `discarded` / `model_rejected` → analytics only - Benchmark runs are identified by `benchmark_run_id` — consider a run-level summary view showing how many candidates each run produced ## Related - cf-orch benchmark harness: `circuitforge-orch/scripts/benchmark.py` - Source format: `sft_candidates.jsonl` written per-run to `scripts/bench_results/<timestamp>/`