eval: ARK-ASR-0.6B as lightweight CPU-capable ASR backend (Linnet students use case) #6

New issue

Open

opened 2026-06-06 01:37:49 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-06-06 01:37:49 -07:00

Owner

Source: https://huggingface.co/AutoArk-AI/ARK-ASR-0.6B

What it is

ARK-ASR-0.6B is a 0.6B parameter ASR model using a Whisper-style encoder + RoPE + MLP adapter + Qwen2 decoder. Trained with teacher-data adaptation and online policy distillation (OPD).

License: Apache 2.0 (no gating, no HF_TOKEN required)
WER: 6.55% avg across 7 English benchmarks (AMI, Earnings22, GigaSpeech, LibriSpeech, SPGISpeech, VoxPopuli)
CER: 4.30% avg across 3 Chinese benchmarks
Languages: 19 (EN, ZH, DE, JA, FR, KO, ES, PL, IT, RO, HU, CS, NL, FI, HR, SK, SL, ET, LT)
CPU fallback: Yes (float32 — no CUDA required)
Input: 16 kHz mono audio

Why this matters for cf-voice / Linnet

Linnet delegates its full ASR pipeline to cf-voice (requirements.txt pulls cf-voice from Forgejo). The primary driver here is students using Linnet as a tone/context aid in lectures and group discussions.

Student context:

No GPU is the default on student laptops — CPU fallback is essential
School audio cannot go to the cloud in most institutional settings — local inference required
19 languages covers international cohorts
Compact enough for a cf-voice "lite" install profile

At 0.6B params (half of Whisper Large v3 at 1.5B), this fits comfortably in RAM and runs on CPU at usable speeds.

Integration approach

Add as a named backend in cf-voice alongside the existing pyannote pipeline:

class ARKASRBackend:
    """Lightweight ASR via AutoArk-AI/ARK-ASR-0.6B.
    CPU-capable (float32 fallback). 19 languages.
    Requires trust_remote_code=True — pin to audited commit before shipping.
    """

Security flag — action required before shipping

The model requires trust_remote_code=True, meaning Qwen2 decoder injection code from the HuggingFace repo runs at load time. Before this backend ships to end users:

Audit AutoArk-AI/ARK-ASR-0.6B custom modeling code
Pin to a specific commit hash in the install, not @main
Document the audit result in the backend module docstring

Comparison to alternatives

Model	Params	WER (EN)	CPU?	License	Gated?
ARK-ASR-0.6B	0.6B	6.55%	Yes	Apache 2.0	No
Qwen3-ASR-0.6B	0.6B	6.93%	Yes	Apache 2.0	No
Whisper Large v3	1.5B	~3-4%	Slow	MIT	No
cohere-transcribe-diarize	~?	TBD	No (CUDA only)	Apache 2.0	No

ARK-ASR-0.6B is the best fit for the free-tier / no-GPU / student path. cohere-transcribe-diarize is better for the paid/GPU path (adds diarization too — see cf-voice#5).

**Source:** https://huggingface.co/AutoArk-AI/ARK-ASR-0.6B ## What it is ARK-ASR-0.6B is a 0.6B parameter ASR model using a Whisper-style encoder + RoPE + MLP adapter + Qwen2 decoder. Trained with teacher-data adaptation and online policy distillation (OPD). - **License:** Apache 2.0 (no gating, no HF_TOKEN required) - **WER:** 6.55% avg across 7 English benchmarks (AMI, Earnings22, GigaSpeech, LibriSpeech, SPGISpeech, VoxPopuli) - **CER:** 4.30% avg across 3 Chinese benchmarks - **Languages:** 19 (EN, ZH, DE, JA, FR, KO, ES, PL, IT, RO, HU, CS, NL, FI, HR, SK, SL, ET, LT) - **CPU fallback:** Yes (float32 — no CUDA required) - **Input:** 16 kHz mono audio ## Why this matters for cf-voice / Linnet Linnet delegates its full ASR pipeline to cf-voice (`requirements.txt` pulls cf-voice from Forgejo). The primary driver here is **students using Linnet as a tone/context aid** in lectures and group discussions. Student context: - No GPU is the default on student laptops — CPU fallback is essential - School audio cannot go to the cloud in most institutional settings — local inference required - 19 languages covers international cohorts - Compact enough for a cf-voice "lite" install profile At 0.6B params (half of Whisper Large v3 at 1.5B), this fits comfortably in RAM and runs on CPU at usable speeds. ## Integration approach Add as a named backend in cf-voice alongside the existing pyannote pipeline: ```python class ARKASRBackend: """Lightweight ASR via AutoArk-AI/ARK-ASR-0.6B. CPU-capable (float32 fallback). 19 languages. Requires trust_remote_code=True — pin to audited commit before shipping. """ ``` ## Security flag — action required before shipping The model requires `trust_remote_code=True`, meaning Qwen2 decoder injection code from the HuggingFace repo runs at load time. Before this backend ships to end users: - [ ] Audit `AutoArk-AI/ARK-ASR-0.6B` custom modeling code - [ ] Pin to a specific commit hash in the install, not `@main` - [ ] Document the audit result in the backend module docstring ## Comparison to alternatives | Model | Params | WER (EN) | CPU? | License | Gated? | |---|---|---|---|---|---| | ARK-ASR-0.6B | 0.6B | 6.55% | Yes | Apache 2.0 | No | | Qwen3-ASR-0.6B | 0.6B | 6.93% | Yes | Apache 2.0 | No | | Whisper Large v3 | 1.5B | ~3-4% | Slow | MIT | No | | cohere-transcribe-diarize | ~? | TBD | No (CUDA only) | Apache 2.0 | No | ARK-ASR-0.6B is the best fit for the free-tier / no-GPU / student path. cohere-transcribe-diarize is better for the paid/GPU path (adds diarization too — see cf-voice#5).