watch: inline speaker-tag approach (Qwen3-ASR fine-tune) — wait for credible benchmark #9
Labels
No labels
a11y
acoustic
backlog
bug
cf-core-dep
diarization
enhancement
inference
privacy
stt
testing
tier:paid
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Circuit-Forge/cf-voice#9
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Source: https://huggingface.co/mrfakename/qwen3-asr-1.7b-ami-diarization-fft-r6-20260422
What it is
A community fine-tune of Qwen3-ASR-1.7B that emits inline speaker tags (
[S0],[S1], etc.) directly in the transcript output — no separate diarization pipeline. Apache 2.0.Approach example:
Why not yet
The evaluation published with this checkpoint is not credible for production adoption:
Why the approach is worth watching
Fine-tuning Qwen3-ASR to emit inline speaker tags is architecturally clean — it collapses ASR + diarization into a single decode without the vocabulary extension complexity of cohere-transcribe-diarize. When someone fine-tunes this approach with:
...it becomes a serious candidate as the cf-voice default backend.
Watch trigger
Re-evaluate when mrfakename or another contributor publishes a version with AMI/CALLHOME benchmark numbers and a >50 clip eval set.