refactor: Eval — consolidate 6 benchmark tabs into Benchmark + Compare views #51

Closed
opened 2026-05-01 12:13:13 -07:00 by pyr0ball · 1 comment
Owner

Context

Current BenchmarkView has 6 sub-tabs (Classifier, Compare, LlmEval, Style, Voice, PlansBench) with duplicated UI patterns. Consolidate into two clean views that route all benchmarks through cf-orch.

Work

  • BenchmarkView.vue (at /eval/benchmark) — unified benchmark runner:
    • Model selector (from Fleet inventory)
    • Task suite selector (from bench_tasks.yaml)
    • Run button → SSE progress stream
    • Results table: per-task score, latency, token counts
    • Works for all model types (classifier, LLM, voice, vision) — model type determines available tasks
  • CompareView.vue (at /eval/compare) — side-by-side output comparison:
    • Select 2+ models
    • Select task(s)
    • See raw outputs side by side
    • Replaces old Compare + LlmEval tabs
  • Remove: ClassifierTab, CompareTab, LlmEvalTab, StyleTab, VoiceTab, PlansBenchTab components

Acceptance

  • Can benchmark any cf-orch supported model type from one view
  • Results match what cf-orch harness produces
  • Old tab functionality covered by new views
## Context Current BenchmarkView has 6 sub-tabs (Classifier, Compare, LlmEval, Style, Voice, PlansBench) with duplicated UI patterns. Consolidate into two clean views that route all benchmarks through cf-orch. ## Work - `BenchmarkView.vue` (at `/eval/benchmark`) — unified benchmark runner: - Model selector (from Fleet inventory) - Task suite selector (from `bench_tasks.yaml`) - Run button → SSE progress stream - Results table: per-task score, latency, token counts - Works for all model types (classifier, LLM, voice, vision) — model type determines available tasks - `CompareView.vue` (at `/eval/compare`) — side-by-side output comparison: - Select 2+ models - Select task(s) - See raw outputs side by side - Replaces old Compare + LlmEval tabs - Remove: ClassifierTab, CompareTab, LlmEvalTab, StyleTab, VoiceTab, PlansBenchTab components ## Acceptance - Can benchmark any cf-orch supported model type from one view - Results match what cf-orch harness produces - Old tab functionality covered by new views
pyr0ball added this to the v2 — Pipeline Architecture milestone 2026-05-01 12:13:13 -07:00
pyr0ball added the
frontend
reorg
labels 2026-05-01 12:13:13 -07:00
Author
Owner

Shipped in the Apr 19–May 4 sprint. CompareView extracted from BenchmarkView to /eval/compare.

Shipped in the Apr 19–May 4 sprint. CompareView extracted from BenchmarkView to /eval/compare.
Sign in to join this conversation.
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/avocet#51
No description provided.