feat: cf-orch LLM benchmark integration (Phase 1) #21

Open

pyr0ball wants to merge 0 commits from feat/cforch-benchmark into main

pyr0ball commented

2026-04-09 10:46:30 -07:00

Owner

Summary

New /api/cforch router: task catalog, model catalog, SSE benchmark runner, results reader, cancel
All paths configurable via label_tool.yaml cforch: section — no hardcoded paths
BenchmarkView: Classifier / LLM Eval mode toggle; LLM panel has task picker (by type) + model picker (by service: ollama/vllm), SSE run log, results table with best-per-column highlight
StatsView: LLM Benchmark table showing quality_by_task_type across models; hidden until first run
SFT candidates from cf-orch runs auto-flow into Corrections tab via existing bench_results_dir config
13 new tests; 79 total passing

Config (add to label_tool.yaml)

cforch:
  bench_script: /Library/Development/CircuitForge/circuitforge-orch/scripts/benchmark.py
  bench_models: /Library/Development/CircuitForge/circuitforge-orch/scripts/bench_models.yaml
  bench_tasks: /Library/Development/CircuitForge/circuitforge-orch/scripts/bench_tasks.yaml
  results_dir: /Library/Development/CircuitForge/circuitforge-orch/scripts/bench_results
  coordinator_url: http://localhost:7700
  ollama_url: http://localhost:11434
  python_bin: /devl/miniconda3/envs/cf/bin/python

Test plan

pytest tests/test_cforch.py — 13/13
pytest tests/ — 79/79 (excl. GPU integration)
Switch to LLM Eval mode on Benchmark tab — task/model pickers render
Run with ollama models selected — SSE log streams, results appear
Stats tab shows LLM Benchmark table after a run
cf-orch sft_candidates.jsonl visible in Corrections tab import

## Summary - New `/api/cforch` router: task catalog, model catalog, SSE benchmark runner, results reader, cancel - All paths configurable via `label_tool.yaml` `cforch:` section — no hardcoded paths - BenchmarkView: Classifier / LLM Eval mode toggle; LLM panel has task picker (by type) + model picker (by service: ollama/vllm), SSE run log, results table with best-per-column highlight - StatsView: LLM Benchmark table showing `quality_by_task_type` across models; hidden until first run - SFT candidates from cf-orch runs auto-flow into Corrections tab via existing `bench_results_dir` config - 13 new tests; 79 total passing ## Config (add to label_tool.yaml) ```yaml cforch: bench_script: /Library/Development/CircuitForge/circuitforge-orch/scripts/benchmark.py bench_models: /Library/Development/CircuitForge/circuitforge-orch/scripts/bench_models.yaml bench_tasks: /Library/Development/CircuitForge/circuitforge-orch/scripts/bench_tasks.yaml results_dir: /Library/Development/CircuitForge/circuitforge-orch/scripts/bench_results coordinator_url: http://localhost:7700 ollama_url: http://localhost:11434 python_bin: /devl/miniconda3/envs/cf/bin/python ``` ## Test plan - [ ] `pytest tests/test_cforch.py` — 13/13 - [ ] `pytest tests/` — 79/79 (excl. GPU integration) - [ ] Switch to LLM Eval mode on Benchmark tab — task/model pickers render - [ ] Run with ollama models selected — SSE log streams, results appear - [ ] Stats tab shows LLM Benchmark table after a run - [ ] cf-orch sft_candidates.jsonl visible in Corrections tab import

pyr0ball added 1 commit 2026-04-09 10:46:30 -07:00

feat: cf-orch LLM benchmark integration (Phase 1) dffb1d0d7a

Backend (app/cforch.py — new APIRouter at /api/cforch):
- GET /tasks — reads bench_tasks.yaml, returns tasks + deduplicated types
- GET /models — reads bench_models.yaml, returns model list with service/tags
- GET /run — SSE endpoint; spawns cf-orch benchmark.py subprocess with
  --filter-tasks, --filter-tags, --coordinator, --ollama-url; strips ANSI
  codes; emits progress/result/complete/error events; 409 guard on concurrency
- GET /results — returns latest bench_results/*/summary.json; 404 if none
- POST /cancel — terminates running benchmark subprocess
- All paths configurable via label_tool.yaml cforch: section
- 13 tests; follows sft.py/models.py testability seam pattern

Frontend:
- BenchmarkView: mode toggle (Classifier / LLM Eval); LLM Eval panel with
  task picker (by type, select-all + indeterminate), model picker (by service),
  SSE run log, results table with best-per-column highlighting
- StatsView: LLM Benchmark section showing quality_by_task_type table across
  models; hidden when no results; fetches /api/cforch/results on mount

SFT candidate pipeline: cf-orch runs that produce sft_candidates.jsonl are
auto-discovered by the existing bench_results_dir config in sft.py — no
additional wiring needed.