feat: SDH subtitle generation pipeline (Marlin + cf-stt + tone annotation → SRT/VTT) #31

Open
opened 2026-05-22 19:34:29 -07:00 by pyr0ball · 0 comments
Owner

Summary

Add SDH (Subtitles for the Deaf and Hard of Hearing) subtitle generation to Linnet, combining three inference services into a merged, broadcast-quality subtitle output.

SDH is distinct from standard closed captions: it includes non-speech audio events ([DOOR SLAMS], [tense music], [crowd cheering]), speaker identification, and tone/manner descriptors ([whispering], [angrily]) — exactly the bracketed annotation grammar Linnet already produces for real-time tone annotation.

Pipeline

Stage Service Output
1. Video events cf-video (Marlin-2B) Non-speech events + timestamps
2. Speech cf-stt (Whisper) Transcript + word timestamps + speaker diarization
3. Tone annotation Linnet Per-utterance manner descriptors
4. Merge + format Linnet backend SRT / VTT / ASS with merged tracks

Example Output (SRT)

1
00:00:04,200 --> 00:00:05,800
[DOOR SLAMS]

2
00:00:05,800 --> 00:00:08,100
[tense music swells]

3
00:00:08,200 --> 00:00:11,400
SARAH: [whispering] I did not know
you would be home early.

Scope

  • New endpoint: POST /api/subtitles/generate — accepts video file path or URL, returns job ID
  • SSE progress stream: GET /api/subtitles/{job_id}/progress
  • Result download: GET /api/subtitles/{job_id}/output?format=srt|vtt|ass
  • Backend orchestrates: cf-video allocation → cf-stt allocation → tone annotation → merge
  • Merge logic: interleave Marlin event track + Whisper speech track by timestamp; apply Linnet tone annotations to speech segments
  • Speaker label format follows broadcast SDH convention: SPEAKER NAME: prefix in caps
  • Tone descriptors: reuse Linnet Elcor annotation vocabulary; bracketed, title-case

Dependencies

  • cf-video service type in cf-orch (see cf-orch#71) — not yet built
  • cf-stt already available
  • Linnet tone annotation: already available for real-time text; needs batch/offline mode for subtitle segments

Accessibility rationale

SDH production is expensive, often outsourced, and routinely skipped on independent, community, and self-hosted content. A local pipeline producing broadcast-quality SDH from a video file is a direct accessibility win for deaf/HoH users — a primary CF audience. This is a strong product differentiator for Linnet beyond real-time chat annotation.

Model

  • Video events: NemoStation/Marlin-2B (candidate; see cf-orch#71)
  • Speech: Whisper large-v3 or distil-large-v3
  • Tone: Linnet local inference (no external service)
## Summary Add SDH (Subtitles for the Deaf and Hard of Hearing) subtitle generation to Linnet, combining three inference services into a merged, broadcast-quality subtitle output. SDH is distinct from standard closed captions: it includes non-speech audio events (`[DOOR SLAMS]`, `[tense music]`, `[crowd cheering]`), speaker identification, and tone/manner descriptors (`[whispering]`, `[angrily]`) — exactly the bracketed annotation grammar Linnet already produces for real-time tone annotation. ## Pipeline | Stage | Service | Output | |-------|---------|--------| | 1. Video events | `cf-video` (Marlin-2B) | Non-speech events + timestamps | | 2. Speech | `cf-stt` (Whisper) | Transcript + word timestamps + speaker diarization | | 3. Tone annotation | Linnet | Per-utterance manner descriptors | | 4. Merge + format | Linnet backend | SRT / VTT / ASS with merged tracks | ## Example Output (SRT) ``` 1 00:00:04,200 --> 00:00:05,800 [DOOR SLAMS] 2 00:00:05,800 --> 00:00:08,100 [tense music swells] 3 00:00:08,200 --> 00:00:11,400 SARAH: [whispering] I did not know you would be home early. ``` ## Scope - New endpoint: `POST /api/subtitles/generate` — accepts video file path or URL, returns job ID - SSE progress stream: `GET /api/subtitles/{job_id}/progress` - Result download: `GET /api/subtitles/{job_id}/output?format=srt|vtt|ass` - Backend orchestrates: cf-video allocation → cf-stt allocation → tone annotation → merge - Merge logic: interleave Marlin event track + Whisper speech track by timestamp; apply Linnet tone annotations to speech segments - Speaker label format follows broadcast SDH convention: `SPEAKER NAME:` prefix in caps - Tone descriptors: reuse Linnet Elcor annotation vocabulary; bracketed, title-case ## Dependencies - `cf-video` service type in cf-orch (see cf-orch#71) — not yet built - `cf-stt` already available - Linnet tone annotation: already available for real-time text; needs batch/offline mode for subtitle segments ## Accessibility rationale SDH production is expensive, often outsourced, and routinely skipped on independent, community, and self-hosted content. A local pipeline producing broadcast-quality SDH from a video file is a direct accessibility win for deaf/HoH users — a primary CF audience. This is a strong product differentiator for Linnet beyond real-time chat annotation. ## Model - Video events: `NemoStation/Marlin-2B` (candidate; see cf-orch#71) - Speech: Whisper large-v3 or distil-large-v3 - Tone: Linnet local inference (no external service)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/linnet#31
No description provided.