feat: audio domain tagging for benchmark datasets #25

New issue

Open

opened 2026-04-10 21:35:23 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-10 21:35:23 -07:00

Owner

Context: Audio benchmark datasets mix wildly different recording conditions — acted studio speech, naturalistic conversation, broadcast panel shows, call centre audio. Lumping them into a single pool hides per-domain failure modes; testing against British comedy panel show audio ("As Yet Untitled") showed SER models reading "neutral" across the board on naturalistic non-NA-accent speech.

Scope:

Extend dataset schema with optional audio_domain string field
Implement suggested taxonomy: acted_na, acted_eu, naturalistic_en_gb, naturalistic_en_us, broadcast, call_centre, phone_degraded
Labeling UI shows domain badge alongside sample and allows editing
Export format (JSON + CSV) includes audio_domain field
Schema change is backward-compatible (field is optional, existing datasets unaffected)

Out of scope: Automatic domain prediction (see separate issue for lightweight domain classifier).

Acceptance criteria:

Dataset schema accepts optional audio_domain string field
Labeling UI shows domain badge and allows editing
Export includes domain tag
Existing datasets with no domain tag load and export without errors

Related: circuitforge-plans/avocet/ — audio model evaluation extension; see also cf-voice/Linnet SER evaluation work

**Context:** Audio benchmark datasets mix wildly different recording conditions — acted studio speech, naturalistic conversation, broadcast panel shows, call centre audio. Lumping them into a single pool hides per-domain failure modes; testing against British comedy panel show audio ("As Yet Untitled") showed SER models reading "neutral" across the board on naturalistic non-NA-accent speech. **Scope:** - [ ] Extend dataset schema with optional `audio_domain` string field - [ ] Implement suggested taxonomy: `acted_na`, `acted_eu`, `naturalistic_en_gb`, `naturalistic_en_us`, `broadcast`, `call_centre`, `phone_degraded` - [ ] Labeling UI shows domain badge alongside sample and allows editing - [ ] Export format (JSON + CSV) includes `audio_domain` field - [ ] Schema change is backward-compatible (field is optional, existing datasets unaffected) **Out of scope:** Automatic domain prediction (see separate issue for lightweight domain classifier). **Acceptance criteria:** - [ ] Dataset schema accepts optional `audio_domain` string field - [ ] Labeling UI shows domain badge and allows editing - [ ] Export includes domain tag - [ ] Existing datasets with no domain tag load and export without errors **Related:** `circuitforge-plans/avocet/` — audio model evaluation extension; see also cf-voice/Linnet SER evaluation work