feat: domain-stratified metrics in benchmark reports #26

New issue

Open

opened 2026-04-10 21:35:32 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-10 21:35:32 -07:00

Owner

Context: Aggregate benchmark metrics across a mixed-domain dataset mask domain-specific failure modes. A model that scores 0.82 overall might score 0.91 on acted speech and 0.43 on naturalistic British speech — the aggregate hides the real problem. This was surfaced during SER evaluation against British comedy panel show audio.

Scope:

When audio_domain tags are present on samples, break out per-domain precision/recall/F1 alongside the aggregate in benchmark reports
Domains with fewer than N samples (configurable, default 20) are reported separately with a low-sample warning
Report output: JSON (primary) + optional markdown table
Additive change — no breaking changes to existing benchmark report format

Out of scope: Domain tagging itself (see audio domain tagging issue). UI visualization of per-domain metrics (can follow in a separate issue).

Acceptance criteria:

benchmark run produces per-domain metric breakdown when domain tags are present
Low-sample domains (< configurable threshold) are flagged clearly in output
Existing benchmark reports without domain tags are unaffected
JSON report schema is additive (new domain_breakdown key, not a replacement)

Related: Depends on audio domain tagging issue. circuitforge-plans/avocet/ — audio model evaluation extension.

**Context:** Aggregate benchmark metrics across a mixed-domain dataset mask domain-specific failure modes. A model that scores 0.82 overall might score 0.91 on acted speech and 0.43 on naturalistic British speech — the aggregate hides the real problem. This was surfaced during SER evaluation against British comedy panel show audio. **Scope:** - [ ] When `audio_domain` tags are present on samples, break out per-domain precision/recall/F1 alongside the aggregate in benchmark reports - [ ] Domains with fewer than N samples (configurable, default 20) are reported separately with a low-sample warning - [ ] Report output: JSON (primary) + optional markdown table - [ ] Additive change — no breaking changes to existing benchmark report format **Out of scope:** Domain tagging itself (see audio domain tagging issue). UI visualization of per-domain metrics (can follow in a separate issue). **Acceptance criteria:** - [ ] `benchmark run` produces per-domain metric breakdown when domain tags are present - [ ] Low-sample domains (< configurable threshold) are flagged clearly in output - [ ] Existing benchmark reports without domain tags are unaffected - [ ] JSON report schema is additive (new `domain_breakdown` key, not a replacement) **Related:** Depends on audio domain tagging issue. `circuitforge-plans/avocet/` — audio model evaluation extension.