feat: domain-stratified metrics in benchmark reports #26

Open
opened 2026-04-10 21:35:32 -07:00 by pyr0ball · 0 comments
Owner

Context: Aggregate benchmark metrics across a mixed-domain dataset mask domain-specific failure modes. A model that scores 0.82 overall might score 0.91 on acted speech and 0.43 on naturalistic British speech — the aggregate hides the real problem. This was surfaced during SER evaluation against British comedy panel show audio.

Scope:

  • When audio_domain tags are present on samples, break out per-domain precision/recall/F1 alongside the aggregate in benchmark reports
  • Domains with fewer than N samples (configurable, default 20) are reported separately with a low-sample warning
  • Report output: JSON (primary) + optional markdown table
  • Additive change — no breaking changes to existing benchmark report format

Out of scope: Domain tagging itself (see audio domain tagging issue). UI visualization of per-domain metrics (can follow in a separate issue).

Acceptance criteria:

  • benchmark run produces per-domain metric breakdown when domain tags are present
  • Low-sample domains (< configurable threshold) are flagged clearly in output
  • Existing benchmark reports without domain tags are unaffected
  • JSON report schema is additive (new domain_breakdown key, not a replacement)

Related: Depends on audio domain tagging issue. circuitforge-plans/avocet/ — audio model evaluation extension.

**Context:** Aggregate benchmark metrics across a mixed-domain dataset mask domain-specific failure modes. A model that scores 0.82 overall might score 0.91 on acted speech and 0.43 on naturalistic British speech — the aggregate hides the real problem. This was surfaced during SER evaluation against British comedy panel show audio. **Scope:** - [ ] When `audio_domain` tags are present on samples, break out per-domain precision/recall/F1 alongside the aggregate in benchmark reports - [ ] Domains with fewer than N samples (configurable, default 20) are reported separately with a low-sample warning - [ ] Report output: JSON (primary) + optional markdown table - [ ] Additive change — no breaking changes to existing benchmark report format **Out of scope:** Domain tagging itself (see audio domain tagging issue). UI visualization of per-domain metrics (can follow in a separate issue). **Acceptance criteria:** - [ ] `benchmark run` produces per-domain metric breakdown when domain tags are present - [ ] Low-sample domains (< configurable threshold) are flagged clearly in output - [ ] Existing benchmark reports without domain tags are unaffected - [ ] JSON report schema is additive (new `domain_breakdown` key, not a replacement) **Related:** Depends on audio domain tagging issue. `circuitforge-plans/avocet/` — audio model evaluation extension.
pyr0ball added the
enhancement
label 2026-04-10 21:35:32 -07:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/avocet#26
No description provided.