Training corpus: use local media library (audio adventures) for correction labeling and STT evaluation #23
Labels
No labels
a11y
backlog
blocked
bug
cf-core-dep
design
enhancement
infrastructure
internal
privacy
tier:free
tier:paid
ux
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Circuit-Forge/linnet#23
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Goal
Use Alan's local audio adventure / audiobook library as a labeling source for:
CorrectionWidget, export for fine-tuningImplementation
/audioAPI endpoint/exportfor dataset assemblyAcceptance
Implementation approach: subtitle-aligned slicing
Use SRT/VTT subtitle files from the media library as free ground-truth transcriptions.
Pipeline
What this gives us
Implementation notes
pysrtorwebvtt-pyfor subtitle parsingnumpyorsoundfile(no ffmpeg dependency needed for PCM)Script location
scripts/build_corpus.py— standalone, no server required, reads directly from cf-voice library