eval: NVIDIA Sortformer + Parakeet streaming pair for Osprey phone-call use case #8

New issue

Open

opened 2026-06-06 01:48:05 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-06-06 01:48:05 -07:00

Owner

Sources:

What they are

NVIDIA designed these as a complementary streaming pair:

Sortformer (117M) — streaming diarization. Tracks who is speaking in real-time via Arrival-Order Speaker Cache (AOSC). Outputs speaker probability matrices.
Parakeet (600M) — per-speaker streaming ASR. Injects speaker-specific kernels at the encoder layer to produce clean per-speaker transcripts from mixed audio.

Neither model is useful standalone for cf-voice's full pipeline, but together they cover streaming diarization + transcription.

Performance

Model	Metric	Score
Sortformer	DER CALLHOME 2-speaker	6.65%
Sortformer	DER AMI IHM	16.67%
Sortformer	Streaming latency (low)	1.04s
Parakeet	Single-speaker WER avg	7.44%
Parakeet	cpWER AMI IHM	21.26%
Parakeet	Streaming latency	80ms–1.12s

Why Osprey specifically

Osprey handles government hold-line phone calls:

Always 2 speakers (user + IVR/agent)
English only
Real-time transcription required for response timing
CALLHOME is the phone-call benchmark — Sortformer's 6.65% DER there is the relevant number

Blockers before adopting

License check: NVIDIA Open Model License is not Apache/MIT. Confirm commercial inference is permitted. (It typically is under NVIDIA OML, but requires one read-through.)
NeMo dependency: Both models require NVIDIA NeMo Framework + PyTorch + Cython. Heavy dependency — evaluate in an isolated Docker container before adding to cf-voice.
English only: Acceptable for Osprey; not suitable for ARK-ASR-0.6B's multilingual use case.
4-speaker max: Fine for phone calls; not suitable for meeting/lecture diarization.

cf-voice backend comparison context

See cf-voice#5 (cohere-transcribe-diarize) and cf-voice#6 (ARK-ASR-0.6B) for the other candidates. This pair targets a different niche: streaming real-time phone calls rather than offline lecture/meeting transcription.

**Sources:** - https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1 - https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1 ## What they are NVIDIA designed these as a complementary streaming pair: 1. **Sortformer** (117M) — streaming diarization. Tracks *who* is speaking in real-time via Arrival-Order Speaker Cache (AOSC). Outputs speaker probability matrices. 2. **Parakeet** (600M) — per-speaker streaming ASR. Injects speaker-specific kernels at the encoder layer to produce clean per-speaker transcripts from mixed audio. Neither model is useful standalone for cf-voice's full pipeline, but together they cover streaming diarization + transcription. ## Performance | Model | Metric | Score | |---|---|---| | Sortformer | DER CALLHOME 2-speaker | 6.65% | | Sortformer | DER AMI IHM | 16.67% | | Sortformer | Streaming latency (low) | 1.04s | | Parakeet | Single-speaker WER avg | 7.44% | | Parakeet | cpWER AMI IHM | 21.26% | | Parakeet | Streaming latency | 80ms–1.12s | ## Why Osprey specifically Osprey handles government hold-line phone calls: - Always 2 speakers (user + IVR/agent) - English only - Real-time transcription required for response timing - CALLHOME is the phone-call benchmark — Sortformer's 6.65% DER there is the relevant number ## Blockers before adopting - [ ] **License check:** NVIDIA Open Model License is not Apache/MIT. Confirm commercial inference is permitted. (It typically is under NVIDIA OML, but requires one read-through.) - [ ] **NeMo dependency:** Both models require NVIDIA NeMo Framework + PyTorch + Cython. Heavy dependency — evaluate in an isolated Docker container before adding to cf-voice. - [ ] **English only:** Acceptable for Osprey; not suitable for ARK-ASR-0.6B's multilingual use case. - [ ] **4-speaker max:** Fine for phone calls; not suitable for meeting/lecture diarization. ## cf-voice backend comparison context See cf-voice#5 (cohere-transcribe-diarize) and cf-voice#6 (ARK-ASR-0.6B) for the other candidates. This pair targets a different niche: streaming real-time phone calls rather than offline lecture/meeting transcription.