feat: cf-orch LLM benchmark integration (Phase 1) #21

Open
pyr0ball wants to merge 0 commits from feat/cforch-benchmark into main
Owner

Summary

  • New /api/cforch router: task catalog, model catalog, SSE benchmark runner, results reader, cancel
  • All paths configurable via label_tool.yaml cforch: section — no hardcoded paths
  • BenchmarkView: Classifier / LLM Eval mode toggle; LLM panel has task picker (by type) + model picker (by service: ollama/vllm), SSE run log, results table with best-per-column highlight
  • StatsView: LLM Benchmark table showing quality_by_task_type across models; hidden until first run
  • SFT candidates from cf-orch runs auto-flow into Corrections tab via existing bench_results_dir config
  • 13 new tests; 79 total passing

Config (add to label_tool.yaml)

cforch:
  bench_script: /Library/Development/CircuitForge/circuitforge-orch/scripts/benchmark.py
  bench_models: /Library/Development/CircuitForge/circuitforge-orch/scripts/bench_models.yaml
  bench_tasks: /Library/Development/CircuitForge/circuitforge-orch/scripts/bench_tasks.yaml
  results_dir: /Library/Development/CircuitForge/circuitforge-orch/scripts/bench_results
  coordinator_url: http://localhost:7700
  ollama_url: http://localhost:11434
  python_bin: /devl/miniconda3/envs/cf/bin/python

Test plan

  • pytest tests/test_cforch.py — 13/13
  • pytest tests/ — 79/79 (excl. GPU integration)
  • Switch to LLM Eval mode on Benchmark tab — task/model pickers render
  • Run with ollama models selected — SSE log streams, results appear
  • Stats tab shows LLM Benchmark table after a run
  • cf-orch sft_candidates.jsonl visible in Corrections tab import
## Summary - New `/api/cforch` router: task catalog, model catalog, SSE benchmark runner, results reader, cancel - All paths configurable via `label_tool.yaml` `cforch:` section — no hardcoded paths - BenchmarkView: Classifier / LLM Eval mode toggle; LLM panel has task picker (by type) + model picker (by service: ollama/vllm), SSE run log, results table with best-per-column highlight - StatsView: LLM Benchmark table showing `quality_by_task_type` across models; hidden until first run - SFT candidates from cf-orch runs auto-flow into Corrections tab via existing `bench_results_dir` config - 13 new tests; 79 total passing ## Config (add to label_tool.yaml) ```yaml cforch: bench_script: /Library/Development/CircuitForge/circuitforge-orch/scripts/benchmark.py bench_models: /Library/Development/CircuitForge/circuitforge-orch/scripts/bench_models.yaml bench_tasks: /Library/Development/CircuitForge/circuitforge-orch/scripts/bench_tasks.yaml results_dir: /Library/Development/CircuitForge/circuitforge-orch/scripts/bench_results coordinator_url: http://localhost:7700 ollama_url: http://localhost:11434 python_bin: /devl/miniconda3/envs/cf/bin/python ``` ## Test plan - [ ] `pytest tests/test_cforch.py` — 13/13 - [ ] `pytest tests/` — 79/79 (excl. GPU integration) - [ ] Switch to LLM Eval mode on Benchmark tab — task/model pickers render - [ ] Run with ollama models selected — SSE log streams, results appear - [ ] Stats tab shows LLM Benchmark table after a run - [ ] cf-orch sft_candidates.jsonl visible in Corrections tab import
pyr0ball added 1 commit 2026-04-09 10:46:30 -07:00
Backend (app/cforch.py — new APIRouter at /api/cforch):
- GET /tasks — reads bench_tasks.yaml, returns tasks + deduplicated types
- GET /models — reads bench_models.yaml, returns model list with service/tags
- GET /run — SSE endpoint; spawns cf-orch benchmark.py subprocess with
  --filter-tasks, --filter-tags, --coordinator, --ollama-url; strips ANSI
  codes; emits progress/result/complete/error events; 409 guard on concurrency
- GET /results — returns latest bench_results/*/summary.json; 404 if none
- POST /cancel — terminates running benchmark subprocess
- All paths configurable via label_tool.yaml cforch: section
- 13 tests; follows sft.py/models.py testability seam pattern

Frontend:
- BenchmarkView: mode toggle (Classifier / LLM Eval); LLM Eval panel with
  task picker (by type, select-all + indeterminate), model picker (by service),
  SSE run log, results table with best-per-column highlighting
- StatsView: LLM Benchmark section showing quality_by_task_type table across
  models; hidden when no results; fetches /api/cforch/results on mount

SFT candidate pipeline: cf-orch runs that produce sft_candidates.jsonl are
auto-discovered by the existing bench_results_dir config in sft.py — no
additional wiring needed.
This branch is already included in the target branch. There is nothing to merge.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin feat/cforch-benchmark:feat/cforch-benchmark
git checkout feat/cforch-benchmark

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git checkout main
git merge --no-ff feat/cforch-benchmark
git checkout feat/cforch-benchmark
git rebase main
git checkout main
git merge --ff-only feat/cforch-benchmark
git checkout feat/cforch-benchmark
git rebase main
git checkout main
git merge --no-ff feat/cforch-benchmark
git checkout main
git merge --squash feat/cforch-benchmark
git checkout main
git merge --ff-only feat/cforch-benchmark
git checkout main
git merge feat/cforch-benchmark
git push origin main
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/avocet#21
No description provided.