feat: add VLM vision model to NAS and bench_models.yaml (moondream2 or SmolVLM) #44

Open
opened 2026-04-09 19:38:53 -07:00 by pyr0ball · 0 comments
Owner

What

cf-vision currently only supports SigLIP (classify + embed). To enable caption-quality benchmarking and real VQA routing, we need a generative VLM on the NAS.

Candidates

  • vikhyatk/moondream2 — ~2 GB fp16, fast, good for documents; VLMBackend already supports it
  • HuggingFaceTB/SmolVLM-Instruct (256M) — ~500 MB, extremely fast, good routing baseline

Tasks

  • Download chosen model to /Library/Assets/LLM/vision/ on the NAS
  • Add moondream2 (or SmolVLM) entry to bench_models.yaml with service: cf-vision
  • Add vision-caption task to bench_tasks.yaml with quality: pattern_match checking that response contains image-description vocabulary
  • Update navi.yaml profile cf-vision managed block to support --backend vlm variant (likely a second service entry on port 8007)
  • Add VLM VRAM estimate to _VRAM_TABLE in vlm.py if not already present

Notes

SigLIP (port 8006) handles classify+embed; VLM backend (port 8007) would handle caption+VQA. Run as separate services so they don't compete for VRAM.

## What cf-vision currently only supports SigLIP (classify + embed). To enable caption-quality benchmarking and real VQA routing, we need a generative VLM on the NAS. ## Candidates - `vikhyatk/moondream2` — ~2 GB fp16, fast, good for documents; VLMBackend already supports it - `HuggingFaceTB/SmolVLM-Instruct` (256M) — ~500 MB, extremely fast, good routing baseline ## Tasks - [ ] Download chosen model to `/Library/Assets/LLM/vision/` on the NAS - [ ] Add `moondream2` (or SmolVLM) entry to `bench_models.yaml` with `service: cf-vision` - [ ] Add `vision-caption` task to `bench_tasks.yaml` with `quality: pattern_match` checking that response contains image-description vocabulary - [ ] Update `navi.yaml` profile `cf-vision` managed block to support `--backend vlm` variant (likely a second service entry on port 8007) - [ ] Add VLM VRAM estimate to `_VRAM_TABLE` in `vlm.py` if not already present ## Notes SigLIP (port 8006) handles classify+embed; VLM backend (port 8007) would handle caption+VQA. Run as separate services so they don't compete for VRAM.
pyr0ball added the
enhancement
label 2026-04-09 19:38:53 -07:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/circuitforge-core#44
No description provided.