eval: ARK-ASR-0.6B as lightweight CPU-capable ASR backend (Linnet students use case) #6

Open
opened 2026-06-06 01:37:49 -07:00 by pyr0ball · 0 comments
Owner

Source: https://huggingface.co/AutoArk-AI/ARK-ASR-0.6B

What it is

ARK-ASR-0.6B is a 0.6B parameter ASR model using a Whisper-style encoder + RoPE + MLP adapter + Qwen2 decoder. Trained with teacher-data adaptation and online policy distillation (OPD).

  • License: Apache 2.0 (no gating, no HF_TOKEN required)
  • WER: 6.55% avg across 7 English benchmarks (AMI, Earnings22, GigaSpeech, LibriSpeech, SPGISpeech, VoxPopuli)
  • CER: 4.30% avg across 3 Chinese benchmarks
  • Languages: 19 (EN, ZH, DE, JA, FR, KO, ES, PL, IT, RO, HU, CS, NL, FI, HR, SK, SL, ET, LT)
  • CPU fallback: Yes (float32 — no CUDA required)
  • Input: 16 kHz mono audio

Why this matters for cf-voice / Linnet

Linnet delegates its full ASR pipeline to cf-voice (requirements.txt pulls cf-voice from Forgejo). The primary driver here is students using Linnet as a tone/context aid in lectures and group discussions.

Student context:

  • No GPU is the default on student laptops — CPU fallback is essential
  • School audio cannot go to the cloud in most institutional settings — local inference required
  • 19 languages covers international cohorts
  • Compact enough for a cf-voice "lite" install profile

At 0.6B params (half of Whisper Large v3 at 1.5B), this fits comfortably in RAM and runs on CPU at usable speeds.

Integration approach

Add as a named backend in cf-voice alongside the existing pyannote pipeline:

class ARKASRBackend:
    """Lightweight ASR via AutoArk-AI/ARK-ASR-0.6B.
    CPU-capable (float32 fallback). 19 languages.
    Requires trust_remote_code=True — pin to audited commit before shipping.
    """

Security flag — action required before shipping

The model requires trust_remote_code=True, meaning Qwen2 decoder injection code from the HuggingFace repo runs at load time. Before this backend ships to end users:

  • Audit AutoArk-AI/ARK-ASR-0.6B custom modeling code
  • Pin to a specific commit hash in the install, not @main
  • Document the audit result in the backend module docstring

Comparison to alternatives

Model Params WER (EN) CPU? License Gated?
ARK-ASR-0.6B 0.6B 6.55% Yes Apache 2.0 No
Qwen3-ASR-0.6B 0.6B 6.93% Yes Apache 2.0 No
Whisper Large v3 1.5B ~3-4% Slow MIT No
cohere-transcribe-diarize ~? TBD No (CUDA only) Apache 2.0 No

ARK-ASR-0.6B is the best fit for the free-tier / no-GPU / student path. cohere-transcribe-diarize is better for the paid/GPU path (adds diarization too — see cf-voice#5).

**Source:** https://huggingface.co/AutoArk-AI/ARK-ASR-0.6B ## What it is ARK-ASR-0.6B is a 0.6B parameter ASR model using a Whisper-style encoder + RoPE + MLP adapter + Qwen2 decoder. Trained with teacher-data adaptation and online policy distillation (OPD). - **License:** Apache 2.0 (no gating, no HF_TOKEN required) - **WER:** 6.55% avg across 7 English benchmarks (AMI, Earnings22, GigaSpeech, LibriSpeech, SPGISpeech, VoxPopuli) - **CER:** 4.30% avg across 3 Chinese benchmarks - **Languages:** 19 (EN, ZH, DE, JA, FR, KO, ES, PL, IT, RO, HU, CS, NL, FI, HR, SK, SL, ET, LT) - **CPU fallback:** Yes (float32 — no CUDA required) - **Input:** 16 kHz mono audio ## Why this matters for cf-voice / Linnet Linnet delegates its full ASR pipeline to cf-voice (`requirements.txt` pulls cf-voice from Forgejo). The primary driver here is **students using Linnet as a tone/context aid** in lectures and group discussions. Student context: - No GPU is the default on student laptops — CPU fallback is essential - School audio cannot go to the cloud in most institutional settings — local inference required - 19 languages covers international cohorts - Compact enough for a cf-voice "lite" install profile At 0.6B params (half of Whisper Large v3 at 1.5B), this fits comfortably in RAM and runs on CPU at usable speeds. ## Integration approach Add as a named backend in cf-voice alongside the existing pyannote pipeline: ```python class ARKASRBackend: """Lightweight ASR via AutoArk-AI/ARK-ASR-0.6B. CPU-capable (float32 fallback). 19 languages. Requires trust_remote_code=True — pin to audited commit before shipping. """ ``` ## Security flag — action required before shipping The model requires `trust_remote_code=True`, meaning Qwen2 decoder injection code from the HuggingFace repo runs at load time. Before this backend ships to end users: - [ ] Audit `AutoArk-AI/ARK-ASR-0.6B` custom modeling code - [ ] Pin to a specific commit hash in the install, not `@main` - [ ] Document the audit result in the backend module docstring ## Comparison to alternatives | Model | Params | WER (EN) | CPU? | License | Gated? | |---|---|---|---|---|---| | ARK-ASR-0.6B | 0.6B | 6.55% | Yes | Apache 2.0 | No | | Qwen3-ASR-0.6B | 0.6B | 6.93% | Yes | Apache 2.0 | No | | Whisper Large v3 | 1.5B | ~3-4% | Slow | MIT | No | | cohere-transcribe-diarize | ~? | TBD | No (CUDA only) | Apache 2.0 | No | ARK-ASR-0.6B is the best fit for the free-tier / no-GPU / student path. cohere-transcribe-diarize is better for the paid/GPU path (adds diarization too — see cf-voice#5).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/cf-voice#6
No description provided.