watch: inline speaker-tag approach (Qwen3-ASR fine-tune) — wait for credible benchmark #9

New issue

Open

opened 2026-06-06 01:48:05 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-06-06 01:48:05 -07:00

Owner

Source: https://huggingface.co/mrfakename/qwen3-asr-1.7b-ami-diarization-fft-r6-20260422

What it is

A community fine-tune of Qwen3-ASR-1.7B that emits inline speaker tags ([S0], [S1], etc.) directly in the transcript output — no separate diarization pipeline. Apache 2.0.

Approach example:

[S0] Welcome to the meeting. [S1] Thanks for having me. [S0] Let's get started.

Why not yet

The evaluation published with this checkpoint is not credible for production adoption:

6-clip held-out test set only
Matched speaker count on 3/6 clips
No WER or DER metrics — only turn-count delta (-1.67 avg vs reference)

Why the approach is worth watching

Fine-tuning Qwen3-ASR to emit inline speaker tags is architecturally clean — it collapses ASR + diarization into a single decode without the vocabulary extension complexity of cohere-transcribe-diarize. When someone fine-tunes this approach with:

A proper held-out benchmark (AMI full test set, CALLHOME, DIHARD)
Published WER + DER numbers
A checkpoint that generalizes beyond 4 speakers and 6 clips

...it becomes a serious candidate as the cf-voice default backend.

Watch trigger

Re-evaluate when mrfakename or another contributor publishes a version with AMI/CALLHOME benchmark numbers and a >50 clip eval set.

**Source:** https://huggingface.co/mrfakename/qwen3-asr-1.7b-ami-diarization-fft-r6-20260422 ## What it is A community fine-tune of Qwen3-ASR-1.7B that emits inline speaker tags (`[S0]`, `[S1]`, etc.) directly in the transcript output — no separate diarization pipeline. Apache 2.0. Approach example: ``` [S0] Welcome to the meeting. [S1] Thanks for having me. [S0] Let's get started. ``` ## Why not yet The evaluation published with this checkpoint is not credible for production adoption: - 6-clip held-out test set only - Matched speaker count on 3/6 clips - No WER or DER metrics — only turn-count delta (-1.67 avg vs reference) ## Why the approach is worth watching Fine-tuning Qwen3-ASR to emit inline speaker tags is architecturally clean — it collapses ASR + diarization into a single decode without the vocabulary extension complexity of cohere-transcribe-diarize. When someone fine-tunes this approach with: - A proper held-out benchmark (AMI full test set, CALLHOME, DIHARD) - Published WER + DER numbers - A checkpoint that generalizes beyond 4 speakers and 6 clips ...it becomes a serious candidate as the cf-voice default backend. ## Watch trigger Re-evaluate when mrfakename or another contributor publishes a version with AMI/CALLHOME benchmark numbers and a >50 clip eval set.