Training corpus: use local media library (audio adventures) for correction labeling and STT evaluation #23

New issue

Open

opened 2026-04-12 17:54:08 -07:00 by pyr0ball · 1 comment

pyr0ball commented

2026-04-12 17:54:08 -07:00

Owner

Goal

Use Alan's local audio adventure / audiobook library as a labeling source for:

Correction training data — play known audio through the linnet mic, label where the classifier got tone wrong via CorrectionWidget, export for fine-tuning
STT accuracy evaluation — compare Whisper output against known transcriptions (WER measurement per model/setting)
Hallucination profiling — identify which audio conditions (music beds, ambience, low-energy speech, silence) trigger specific hallucination tokens

Implementation

Batch ingest script: push audio segments via /audio API endpoint
Label events in the existing correction UI
Export via /export for dataset assembly
Audio adventures are good because: varied affect, multiple voice actors, clear transcriptions available, no PII

Acceptance

Batch ingest script exists and works against local audio files
Export produces a labeled dataset in a format suitable for fine-tuning cf-voice classifiers
At least one evaluation run completed comparing Whisper accuracy across window sizes (500ms, 1s, 2s)

## Goal Use Alan's local audio adventure / audiobook library as a labeling source for: 1. **Correction training data** — play known audio through the linnet mic, label where the classifier got tone wrong via `CorrectionWidget`, export for fine-tuning 2. **STT accuracy evaluation** — compare Whisper output against known transcriptions (WER measurement per model/setting) 3. **Hallucination profiling** — identify which audio conditions (music beds, ambience, low-energy speech, silence) trigger specific hallucination tokens ## Implementation - Batch ingest script: push audio segments via `/audio` API endpoint - Label events in the existing correction UI - Export via `/export` for dataset assembly - Audio adventures are good because: varied affect, multiple voice actors, clear transcriptions available, no PII ## Acceptance - Batch ingest script exists and works against local audio files - Export produces a labeled dataset in a format suitable for fine-tuning cf-voice classifiers - At least one evaluation run completed comparing Whisper accuracy across window sizes (500ms, 1s, 2s)

pyr0ball commented

2026-04-12 18:04:19 -07:00

Author

Owner

Implementation approach: subtitle-aligned slicing

Use SRT/VTT subtitle files from the media library as free ground-truth transcriptions.

Pipeline

SRT/VTT file + audio file
    ↓
parse_subtitles() → [(t_start, t_end, text), ...]
    ↓
slice_audio(t_start, t_end) → PCM clip
    ↓
cf-voice /classify(clip) → tone label, speaker_id
WhisperSTT(clip) → transcript
    ↓
compare(transcript, subtitle_text) → WER, hallucination flags
    ↓
corpus row: {audio_clip, subtitle_text, whisper_text, tone_label, speaker_id, wer}

What this gives us

STT evaluation dataset: Whisper vs. subtitle text = WER per window size / model
Tone corpus: audio clip + known text → labeled pairs for parallel classifier (linnet#22)
Hallucination profiling: large WER divergence = failure case worth labeling in CorrectionWidget
Diarization validation: audio adventures with multiple voice actors = known-speaker attribution to validate speaker_id

Implementation notes

pysrt or webvtt-py for subtitle parsing
Slice with numpy or soundfile (no ffmpeg dependency needed for PCM)
Skip clips shorter than 500ms (below Whisper minimum)
Skip clips longer than 10s (too much context drift)
Store corpus as JSONL: one row per subtitle entry
Run in batch mode against a media library directory, not through the web UI

Script location

scripts/build_corpus.py — standalone, no server required, reads directly from cf-voice library

## Implementation approach: subtitle-aligned slicing Use SRT/VTT subtitle files from the media library as free ground-truth transcriptions. ### Pipeline ``` SRT/VTT file + audio file ↓ parse_subtitles() → [(t_start, t_end, text), ...] ↓ slice_audio(t_start, t_end) → PCM clip ↓ cf-voice /classify(clip) → tone label, speaker_id WhisperSTT(clip) → transcript ↓ compare(transcript, subtitle_text) → WER, hallucination flags ↓ corpus row: {audio_clip, subtitle_text, whisper_text, tone_label, speaker_id, wer} ``` ### What this gives us - **STT evaluation dataset**: Whisper vs. subtitle text = WER per window size / model - **Tone corpus**: audio clip + known text → labeled pairs for parallel classifier (linnet#22) - **Hallucination profiling**: large WER divergence = failure case worth labeling in CorrectionWidget - **Diarization validation**: audio adventures with multiple voice actors = known-speaker attribution to validate speaker_id ### Implementation notes - `pysrt` or `webvtt-py` for subtitle parsing - Slice with `numpy` or `soundfile` (no ffmpeg dependency needed for PCM) - Skip clips shorter than 500ms (below Whisper minimum) - Skip clips longer than 10s (too much context drift) - Store corpus as JSONL: one row per subtitle entry - Run in batch mode against a media library directory, not through the web UI ### Script location `scripts/build_corpus.py` — standalone, no server required, reads directly from cf-voice library