Training corpus: use local media library (audio adventures) for correction labeling and STT evaluation #23

Open
opened 2026-04-12 17:54:08 -07:00 by pyr0ball · 1 comment
Owner

Goal

Use Alan's local audio adventure / audiobook library as a labeling source for:

  1. Correction training data — play known audio through the linnet mic, label where the classifier got tone wrong via CorrectionWidget, export for fine-tuning
  2. STT accuracy evaluation — compare Whisper output against known transcriptions (WER measurement per model/setting)
  3. Hallucination profiling — identify which audio conditions (music beds, ambience, low-energy speech, silence) trigger specific hallucination tokens

Implementation

  • Batch ingest script: push audio segments via /audio API endpoint
  • Label events in the existing correction UI
  • Export via /export for dataset assembly
  • Audio adventures are good because: varied affect, multiple voice actors, clear transcriptions available, no PII

Acceptance

  • Batch ingest script exists and works against local audio files
  • Export produces a labeled dataset in a format suitable for fine-tuning cf-voice classifiers
  • At least one evaluation run completed comparing Whisper accuracy across window sizes (500ms, 1s, 2s)
## Goal Use Alan's local audio adventure / audiobook library as a labeling source for: 1. **Correction training data** — play known audio through the linnet mic, label where the classifier got tone wrong via `CorrectionWidget`, export for fine-tuning 2. **STT accuracy evaluation** — compare Whisper output against known transcriptions (WER measurement per model/setting) 3. **Hallucination profiling** — identify which audio conditions (music beds, ambience, low-energy speech, silence) trigger specific hallucination tokens ## Implementation - Batch ingest script: push audio segments via `/audio` API endpoint - Label events in the existing correction UI - Export via `/export` for dataset assembly - Audio adventures are good because: varied affect, multiple voice actors, clear transcriptions available, no PII ## Acceptance - Batch ingest script exists and works against local audio files - Export produces a labeled dataset in a format suitable for fine-tuning cf-voice classifiers - At least one evaluation run completed comparing Whisper accuracy across window sizes (500ms, 1s, 2s)
Author
Owner

Implementation approach: subtitle-aligned slicing

Use SRT/VTT subtitle files from the media library as free ground-truth transcriptions.

Pipeline

SRT/VTT file + audio file
    ↓
parse_subtitles() → [(t_start, t_end, text), ...]
    ↓
slice_audio(t_start, t_end) → PCM clip
    ↓
cf-voice /classify(clip) → tone label, speaker_id
WhisperSTT(clip) → transcript
    ↓
compare(transcript, subtitle_text) → WER, hallucination flags
    ↓
corpus row: {audio_clip, subtitle_text, whisper_text, tone_label, speaker_id, wer}

What this gives us

  • STT evaluation dataset: Whisper vs. subtitle text = WER per window size / model
  • Tone corpus: audio clip + known text → labeled pairs for parallel classifier (linnet#22)
  • Hallucination profiling: large WER divergence = failure case worth labeling in CorrectionWidget
  • Diarization validation: audio adventures with multiple voice actors = known-speaker attribution to validate speaker_id

Implementation notes

  • pysrt or webvtt-py for subtitle parsing
  • Slice with numpy or soundfile (no ffmpeg dependency needed for PCM)
  • Skip clips shorter than 500ms (below Whisper minimum)
  • Skip clips longer than 10s (too much context drift)
  • Store corpus as JSONL: one row per subtitle entry
  • Run in batch mode against a media library directory, not through the web UI

Script location

scripts/build_corpus.py — standalone, no server required, reads directly from cf-voice library

## Implementation approach: subtitle-aligned slicing Use SRT/VTT subtitle files from the media library as free ground-truth transcriptions. ### Pipeline ``` SRT/VTT file + audio file ↓ parse_subtitles() → [(t_start, t_end, text), ...] ↓ slice_audio(t_start, t_end) → PCM clip ↓ cf-voice /classify(clip) → tone label, speaker_id WhisperSTT(clip) → transcript ↓ compare(transcript, subtitle_text) → WER, hallucination flags ↓ corpus row: {audio_clip, subtitle_text, whisper_text, tone_label, speaker_id, wer} ``` ### What this gives us - **STT evaluation dataset**: Whisper vs. subtitle text = WER per window size / model - **Tone corpus**: audio clip + known text → labeled pairs for parallel classifier (linnet#22) - **Hallucination profiling**: large WER divergence = failure case worth labeling in CorrectionWidget - **Diarization validation**: audio adventures with multiple voice actors = known-speaker attribution to validate speaker_id ### Implementation notes - `pysrt` or `webvtt-py` for subtitle parsing - Slice with `numpy` or `soundfile` (no ffmpeg dependency needed for PCM) - Skip clips shorter than 500ms (below Whisper minimum) - Skip clips longer than 10s (too much context drift) - Store corpus as JSONL: one row per subtitle entry - Run in batch mode against a media library directory, not through the web UI ### Script location `scripts/build_corpus.py` — standalone, no server required, reads directly from cf-voice library
pyr0ball added this to the Navigation — v0.2.x milestone 2026-04-17 11:56:15 -07:00
Sign in to join this conversation.
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/linnet#23
No description provided.