watch: SoulX-Transcriber — Chinese diarization leader, not practical yet for EN/low-resource #7

Open
opened 2026-06-06 01:41:00 -07:00 by pyr0ball · 0 comments
Owner

Source: https://huggingface.co/Soul-AILab/SoulX-Transcriber

What it is

SoulX-Transcriber is a unified end-to-end diarization + transcription model based on Qwen3-Omni-30B-A3B (MoE, Apache 2.0). It handles speaker attribution and timestamped segmentation in a single pass.

Why not yet

Blocker Detail
Size 35B total weights (~70GB BF16) — all weights load into VRAM even though only 3B activate per pass
WER 14-25% on benchmark datasets — ARK-ASR-0.6B (0.6B params) beats it at 6.55%
Languages Chinese + English only
CPU No fallback — NVIDIA GPU required
Stack vLLM-omni + ms-swift + Python 3.12 (non-standard, not compatible with existing cf-orch GPU worker setup)

What's genuinely good

DER 2.89% on AISHELL-4 (Chinese meeting transcription) is state-of-the-art. If CF ever targets Chinese-language institutional or enterprise users — education, corporate meetings — this is worth revisiting on high-VRAM hardware (A100/H100 class).

cf-voice backend comparison

Model WER (EN) DER Size CPU? Languages
ARK-ASR-0.6B (cf-voice#6) 6.55% n/a 0.6B Yes 19
cohere-transcribe-diarize (cf-voice#5) TBD TBD ~? No TBD
SoulX-Transcriber 14-25% 2.89-11.67% 35B total No ZH + EN

Watch trigger

Re-evaluate if: (a) a distilled/quantized version releases under 8B active params, or (b) CF adds a Chinese-language product tier.

**Source:** https://huggingface.co/Soul-AILab/SoulX-Transcriber ## What it is SoulX-Transcriber is a unified end-to-end diarization + transcription model based on Qwen3-Omni-30B-A3B (MoE, Apache 2.0). It handles speaker attribution and timestamped segmentation in a single pass. ## Why not yet | Blocker | Detail | |---|---| | Size | 35B total weights (~70GB BF16) — all weights load into VRAM even though only 3B activate per pass | | WER | 14-25% on benchmark datasets — ARK-ASR-0.6B (0.6B params) beats it at 6.55% | | Languages | Chinese + English only | | CPU | No fallback — NVIDIA GPU required | | Stack | vLLM-omni + ms-swift + Python 3.12 (non-standard, not compatible with existing cf-orch GPU worker setup) | ## What's genuinely good DER 2.89% on AISHELL-4 (Chinese meeting transcription) is state-of-the-art. If CF ever targets Chinese-language institutional or enterprise users — education, corporate meetings — this is worth revisiting on high-VRAM hardware (A100/H100 class). ## cf-voice backend comparison | Model | WER (EN) | DER | Size | CPU? | Languages | |---|---|---|---|---|---| | ARK-ASR-0.6B (cf-voice#6) | 6.55% | n/a | 0.6B | Yes | 19 | | cohere-transcribe-diarize (cf-voice#5) | TBD | TBD | ~? | No | TBD | | SoulX-Transcriber | 14-25% | 2.89-11.67% | 35B total | No | ZH + EN | ## Watch trigger Re-evaluate if: (a) a distilled/quantized version releases under 8B active params, or (b) CF adds a Chinese-language product tier.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/cf-voice#7
No description provided.