refactor: Eval — consolidate 6 benchmark tabs into Benchmark + Compare views #51

New issue

Closed

opened 2026-05-01 12:13:13 -07:00 by pyr0ball · 1 comment

pyr0ball commented

2026-05-01 12:13:13 -07:00

Owner

Context

Current BenchmarkView has 6 sub-tabs (Classifier, Compare, LlmEval, Style, Voice, PlansBench) with duplicated UI patterns. Consolidate into two clean views that route all benchmarks through cf-orch.

Work

BenchmarkView.vue (at /eval/benchmark) — unified benchmark runner:
- Model selector (from Fleet inventory)
- Task suite selector (from bench_tasks.yaml)
- Run button → SSE progress stream
- Results table: per-task score, latency, token counts
- Works for all model types (classifier, LLM, voice, vision) — model type determines available tasks
CompareView.vue (at /eval/compare) — side-by-side output comparison:
- Select 2+ models
- Select task(s)
- See raw outputs side by side
- Replaces old Compare + LlmEval tabs
Remove: ClassifierTab, CompareTab, LlmEvalTab, StyleTab, VoiceTab, PlansBenchTab components

Acceptance

Can benchmark any cf-orch supported model type from one view
Results match what cf-orch harness produces
Old tab functionality covered by new views

## Context Current BenchmarkView has 6 sub-tabs (Classifier, Compare, LlmEval, Style, Voice, PlansBench) with duplicated UI patterns. Consolidate into two clean views that route all benchmarks through cf-orch. ## Work - `BenchmarkView.vue` (at `/eval/benchmark`) — unified benchmark runner: - Model selector (from Fleet inventory) - Task suite selector (from `bench_tasks.yaml`) - Run button → SSE progress stream - Results table: per-task score, latency, token counts - Works for all model types (classifier, LLM, voice, vision) — model type determines available tasks - `CompareView.vue` (at `/eval/compare`) — side-by-side output comparison: - Select 2+ models - Select task(s) - See raw outputs side by side - Replaces old Compare + LlmEval tabs - Remove: ClassifierTab, CompareTab, LlmEvalTab, StyleTab, VoiceTab, PlansBenchTab components ## Acceptance - Can benchmark any cf-orch supported model type from one view - Results match what cf-orch harness produces - Old tab functionality covered by new views