watch: SoulX-Transcriber — Chinese diarization leader, not practical yet for EN/low-resource #7

New issue

Open

opened 2026-06-06 01:41:00 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-06-06 01:41:00 -07:00

Owner

Source: https://huggingface.co/Soul-AILab/SoulX-Transcriber

What it is

SoulX-Transcriber is a unified end-to-end diarization + transcription model based on Qwen3-Omni-30B-A3B (MoE, Apache 2.0). It handles speaker attribution and timestamped segmentation in a single pass.

Why not yet

Blocker	Detail
Size	35B total weights (~70GB BF16) — all weights load into VRAM even though only 3B activate per pass
WER	14-25% on benchmark datasets — ARK-ASR-0.6B (0.6B params) beats it at 6.55%
Languages	Chinese + English only
CPU	No fallback — NVIDIA GPU required
Stack	vLLM-omni + ms-swift + Python 3.12 (non-standard, not compatible with existing cf-orch GPU worker setup)

What's genuinely good

DER 2.89% on AISHELL-4 (Chinese meeting transcription) is state-of-the-art. If CF ever targets Chinese-language institutional or enterprise users — education, corporate meetings — this is worth revisiting on high-VRAM hardware (A100/H100 class).

cf-voice backend comparison

Model	WER (EN)	DER	Size	CPU?	Languages
ARK-ASR-0.6B (cf-voice#6)	6.55%	n/a	0.6B	Yes	19
cohere-transcribe-diarize (cf-voice#5)	TBD	TBD	~?	No	TBD
SoulX-Transcriber	14-25%	2.89-11.67%	35B total	No	ZH + EN

Watch trigger

Re-evaluate if: (a) a distilled/quantized version releases under 8B active params, or (b) CF adds a Chinese-language product tier.

**Source:** https://huggingface.co/Soul-AILab/SoulX-Transcriber ## What it is SoulX-Transcriber is a unified end-to-end diarization + transcription model based on Qwen3-Omni-30B-A3B (MoE, Apache 2.0). It handles speaker attribution and timestamped segmentation in a single pass. ## Why not yet | Blocker | Detail | |---|---| | Size | 35B total weights (~70GB BF16) — all weights load into VRAM even though only 3B activate per pass | | WER | 14-25% on benchmark datasets — ARK-ASR-0.6B (0.6B params) beats it at 6.55% | | Languages | Chinese + English only | | CPU | No fallback — NVIDIA GPU required | | Stack | vLLM-omni + ms-swift + Python 3.12 (non-standard, not compatible with existing cf-orch GPU worker setup) | ## What's genuinely good DER 2.89% on AISHELL-4 (Chinese meeting transcription) is state-of-the-art. If CF ever targets Chinese-language institutional or enterprise users — education, corporate meetings — this is worth revisiting on high-VRAM hardware (A100/H100 class). ## cf-voice backend comparison | Model | WER (EN) | DER | Size | CPU? | Languages | |---|---|---|---|---|---| | ARK-ASR-0.6B (cf-voice#6) | 6.55% | n/a | 0.6B | Yes | 19 | | cohere-transcribe-diarize (cf-voice#5) | TBD | TBD | ~? | No | TBD | | SoulX-Transcriber | 14-25% | 2.89-11.67% | 35B total | No | ZH + EN | ## Watch trigger Re-evaluate if: (a) a distilled/quantized version releases under 8B active params, or (b) CF adds a Chinese-language product tier.