eval: NVIDIA Sortformer + Parakeet streaming pair for Osprey phone-call use case #8

Open
opened 2026-06-06 01:48:05 -07:00 by pyr0ball · 0 comments
Owner

Sources:

What they are

NVIDIA designed these as a complementary streaming pair:

  1. Sortformer (117M) — streaming diarization. Tracks who is speaking in real-time via Arrival-Order Speaker Cache (AOSC). Outputs speaker probability matrices.
  2. Parakeet (600M) — per-speaker streaming ASR. Injects speaker-specific kernels at the encoder layer to produce clean per-speaker transcripts from mixed audio.

Neither model is useful standalone for cf-voice's full pipeline, but together they cover streaming diarization + transcription.

Performance

Model Metric Score
Sortformer DER CALLHOME 2-speaker 6.65%
Sortformer DER AMI IHM 16.67%
Sortformer Streaming latency (low) 1.04s
Parakeet Single-speaker WER avg 7.44%
Parakeet cpWER AMI IHM 21.26%
Parakeet Streaming latency 80ms–1.12s

Why Osprey specifically

Osprey handles government hold-line phone calls:

  • Always 2 speakers (user + IVR/agent)
  • English only
  • Real-time transcription required for response timing
  • CALLHOME is the phone-call benchmark — Sortformer's 6.65% DER there is the relevant number

Blockers before adopting

  • License check: NVIDIA Open Model License is not Apache/MIT. Confirm commercial inference is permitted. (It typically is under NVIDIA OML, but requires one read-through.)
  • NeMo dependency: Both models require NVIDIA NeMo Framework + PyTorch + Cython. Heavy dependency — evaluate in an isolated Docker container before adding to cf-voice.
  • English only: Acceptable for Osprey; not suitable for ARK-ASR-0.6B's multilingual use case.
  • 4-speaker max: Fine for phone calls; not suitable for meeting/lecture diarization.

cf-voice backend comparison context

See cf-voice#5 (cohere-transcribe-diarize) and cf-voice#6 (ARK-ASR-0.6B) for the other candidates. This pair targets a different niche: streaming real-time phone calls rather than offline lecture/meeting transcription.

**Sources:** - https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1 - https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1 ## What they are NVIDIA designed these as a complementary streaming pair: 1. **Sortformer** (117M) — streaming diarization. Tracks *who* is speaking in real-time via Arrival-Order Speaker Cache (AOSC). Outputs speaker probability matrices. 2. **Parakeet** (600M) — per-speaker streaming ASR. Injects speaker-specific kernels at the encoder layer to produce clean per-speaker transcripts from mixed audio. Neither model is useful standalone for cf-voice's full pipeline, but together they cover streaming diarization + transcription. ## Performance | Model | Metric | Score | |---|---|---| | Sortformer | DER CALLHOME 2-speaker | 6.65% | | Sortformer | DER AMI IHM | 16.67% | | Sortformer | Streaming latency (low) | 1.04s | | Parakeet | Single-speaker WER avg | 7.44% | | Parakeet | cpWER AMI IHM | 21.26% | | Parakeet | Streaming latency | 80ms–1.12s | ## Why Osprey specifically Osprey handles government hold-line phone calls: - Always 2 speakers (user + IVR/agent) - English only - Real-time transcription required for response timing - CALLHOME is the phone-call benchmark — Sortformer's 6.65% DER there is the relevant number ## Blockers before adopting - [ ] **License check:** NVIDIA Open Model License is not Apache/MIT. Confirm commercial inference is permitted. (It typically is under NVIDIA OML, but requires one read-through.) - [ ] **NeMo dependency:** Both models require NVIDIA NeMo Framework + PyTorch + Cython. Heavy dependency — evaluate in an isolated Docker container before adding to cf-voice. - [ ] **English only:** Acceptable for Osprey; not suitable for ARK-ASR-0.6B's multilingual use case. - [ ] **4-speaker max:** Fine for phone calls; not suitable for meeting/lecture diarization. ## cf-voice backend comparison context See cf-voice#5 (cohere-transcribe-diarize) and cf-voice#6 (ARK-ASR-0.6B) for the other candidates. This pair targets a different niche: streaming real-time phone calls rather than offline lecture/meeting transcription.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/cf-voice#8
No description provided.