feat: SDH subtitle generation pipeline (Marlin + cf-stt + tone annotation → SRT/VTT) #31

New issue

Open

opened 2026-05-22 19:34:29 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-05-22 19:34:29 -07:00

Owner

Summary

Add SDH (Subtitles for the Deaf and Hard of Hearing) subtitle generation to Linnet, combining three inference services into a merged, broadcast-quality subtitle output.

SDH is distinct from standard closed captions: it includes non-speech audio events ([DOOR SLAMS], [tense music], [crowd cheering]), speaker identification, and tone/manner descriptors ([whispering], [angrily]) — exactly the bracketed annotation grammar Linnet already produces for real-time tone annotation.

Pipeline

Stage	Service	Output
1. Video events	`cf-video` (Marlin-2B)	Non-speech events + timestamps
2. Speech	`cf-stt` (Whisper)	Transcript + word timestamps + speaker diarization
3. Tone annotation	Linnet	Per-utterance manner descriptors
4. Merge + format	Linnet backend	SRT / VTT / ASS with merged tracks

Example Output (SRT)

1
00:00:04,200 --> 00:00:05,800
[DOOR SLAMS]

2
00:00:05,800 --> 00:00:08,100
[tense music swells]

3
00:00:08,200 --> 00:00:11,400
SARAH: [whispering] I did not know
you would be home early.

Scope

New endpoint: POST /api/subtitles/generate — accepts video file path or URL, returns job ID
SSE progress stream: GET /api/subtitles/{job_id}/progress
Result download: GET /api/subtitles/{job_id}/output?format=srt|vtt|ass
Backend orchestrates: cf-video allocation → cf-stt allocation → tone annotation → merge
Merge logic: interleave Marlin event track + Whisper speech track by timestamp; apply Linnet tone annotations to speech segments
Speaker label format follows broadcast SDH convention: SPEAKER NAME: prefix in caps
Tone descriptors: reuse Linnet Elcor annotation vocabulary; bracketed, title-case

Dependencies

cf-video service type in cf-orch (see cf-orch#71) — not yet built
cf-stt already available
Linnet tone annotation: already available for real-time text; needs batch/offline mode for subtitle segments

Accessibility rationale

SDH production is expensive, often outsourced, and routinely skipped on independent, community, and self-hosted content. A local pipeline producing broadcast-quality SDH from a video file is a direct accessibility win for deaf/HoH users — a primary CF audience. This is a strong product differentiator for Linnet beyond real-time chat annotation.

Model

Video events: NemoStation/Marlin-2B (candidate; see cf-orch#71)
Speech: Whisper large-v3 or distil-large-v3
Tone: Linnet local inference (no external service)

## Summary Add SDH (Subtitles for the Deaf and Hard of Hearing) subtitle generation to Linnet, combining three inference services into a merged, broadcast-quality subtitle output. SDH is distinct from standard closed captions: it includes non-speech audio events (`[DOOR SLAMS]`, `[tense music]`, `[crowd cheering]`), speaker identification, and tone/manner descriptors (`[whispering]`, `[angrily]`) — exactly the bracketed annotation grammar Linnet already produces for real-time tone annotation. ## Pipeline | Stage | Service | Output | |-------|---------|--------| | 1. Video events | `cf-video` (Marlin-2B) | Non-speech events + timestamps | | 2. Speech | `cf-stt` (Whisper) | Transcript + word timestamps + speaker diarization | | 3. Tone annotation | Linnet | Per-utterance manner descriptors | | 4. Merge + format | Linnet backend | SRT / VTT / ASS with merged tracks | ## Example Output (SRT) ``` 1 00:00:04,200 --> 00:00:05,800 [DOOR SLAMS] 2 00:00:05,800 --> 00:00:08,100 [tense music swells] 3 00:00:08,200 --> 00:00:11,400 SARAH: [whispering] I did not know you would be home early. ``` ## Scope - New endpoint: `POST /api/subtitles/generate` — accepts video file path or URL, returns job ID - SSE progress stream: `GET /api/subtitles/{job_id}/progress` - Result download: `GET /api/subtitles/{job_id}/output?format=srt|vtt|ass` - Backend orchestrates: cf-video allocation → cf-stt allocation → tone annotation → merge - Merge logic: interleave Marlin event track + Whisper speech track by timestamp; apply Linnet tone annotations to speech segments - Speaker label format follows broadcast SDH convention: `SPEAKER NAME:` prefix in caps - Tone descriptors: reuse Linnet Elcor annotation vocabulary; bracketed, title-case ## Dependencies - `cf-video` service type in cf-orch (see cf-orch#71) — not yet built - `cf-stt` already available - Linnet tone annotation: already available for real-time text; needs batch/offline mode for subtitle segments ## Accessibility rationale SDH production is expensive, often outsourced, and routinely skipped on independent, community, and self-hosted content. A local pipeline producing broadcast-quality SDH from a video file is a direct accessibility win for deaf/HoH users — a primary CF audience. This is a strong product differentiator for Linnet beyond real-time chat annotation. ## Model - Video events: `NemoStation/Marlin-2B` (candidate; see cf-orch#71) - Speech: Whisper large-v3 or distil-large-v3 - Tone: Linnet local inference (no external service)