eval: cohere-transcribe-diarize as pyannote replacement backend #5

New issue

Open

opened 2026-06-06 01:32:07 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-06-06 01:32:07 -07:00

Owner

Source: https://huggingface.co/syvai/cohere-transcribe-diarize

What it is

syvai/cohere-transcribe-diarize is a conformer encoder-decoder that performs transcription + diarization in a single forward pass. It extends the vocabulary with 8 speaker tokens and 300 timestamp tokens (100ms resolution), emitting an interleaved stream like:

<|spltoken0|><|t:0.0|> Welcome back.<|t:2.4|>
<|spltoken1|><|t:2.5|> Thanks for having me.<|t:5.1|>

License: Apache 2.0 (no gating, no HF_TOKEN required)
Speed: 44x real-time on RTX 3090; 249x throughput via vLLM batching

Why this matters for cf-voice

Current cf_voice/diarize.py uses pyannote/speaker-diarization-3.1, which:

Requires accepting gated model terms + HF_TOKEN
Is diarization-only (no transcription)
Requires a separate ASR model + alignment step

cohere-transcribe-diarize eliminates all three pain points.

Integration approach

Add as an optional backend in cf_voice/diarize.py alongside the existing pyannote backend:

class CohereTranscribeDiarizeBackend:
    """Single-pass ASR + diarization via syvai/cohere-transcribe-diarize."""
    # 30s window max — use sliding-window helper for longer audio
    # vLLM-compatible for cf-orch GPU scheduling

The 30s window limit requires the sliding-window + speaker embedding clustering path already provided in the model's helper scripts.

Before adopting

Benchmark DER (Diarization Error Rate) vs pyannote 3.1 on a representative dataset
Measure WER (Word Error Rate) vs current ASR backend
Confirm model weight size and VRAM requirements
Test sliding-window speaker consistency on >30s clips

Products that benefit

cf-voice: direct replacement
Osprey: IVR call transcription with caller/agent separation
Linnet: speaker-aware tone annotation

vLLM deployment note

vLLM 0.19.0 with continuous batching is the recommended deployment. This maps cleanly onto cf-orch's existing GPU worker pattern.

**Source:** https://huggingface.co/syvai/cohere-transcribe-diarize ## What it is `syvai/cohere-transcribe-diarize` is a conformer encoder-decoder that performs **transcription + diarization in a single forward pass**. It extends the vocabulary with 8 speaker tokens and 300 timestamp tokens (100ms resolution), emitting an interleaved stream like: ``` <|spltoken0|><|t:0.0|> Welcome back.<|t:2.4|> <|spltoken1|><|t:2.5|> Thanks for having me.<|t:5.1|> ``` **License:** Apache 2.0 (no gating, no HF_TOKEN required) **Speed:** 44x real-time on RTX 3090; 249x throughput via vLLM batching ## Why this matters for cf-voice Current `cf_voice/diarize.py` uses `pyannote/speaker-diarization-3.1`, which: - Requires accepting gated model terms + `HF_TOKEN` - Is diarization-only (no transcription) - Requires a separate ASR model + alignment step cohere-transcribe-diarize eliminates all three pain points. ## Integration approach Add as an optional backend in `cf_voice/diarize.py` alongside the existing pyannote backend: ```python class CohereTranscribeDiarizeBackend: """Single-pass ASR + diarization via syvai/cohere-transcribe-diarize.""" # 30s window max — use sliding-window helper for longer audio # vLLM-compatible for cf-orch GPU scheduling ``` The 30s window limit requires the sliding-window + speaker embedding clustering path already provided in the model's helper scripts. ## Before adopting - [ ] Benchmark DER (Diarization Error Rate) vs pyannote 3.1 on a representative dataset - [ ] Measure WER (Word Error Rate) vs current ASR backend - [ ] Confirm model weight size and VRAM requirements - [ ] Test sliding-window speaker consistency on >30s clips ## Products that benefit - **cf-voice**: direct replacement - **Osprey**: IVR call transcription with caller/agent separation - **Linnet**: speaker-aware tone annotation ## vLLM deployment note vLLM 0.19.0 with continuous batching is the recommended deployment. This maps cleanly onto cf-orch's existing GPU worker pattern.