feat: add VLM vision model to NAS and bench_models.yaml (moondream2 or SmolVLM) #44

New issue

Closed

opened 2026-04-09 19:38:53 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-09 19:38:53 -07:00

Owner

What

cf-vision currently only supports SigLIP (classify + embed). To enable caption-quality benchmarking and real VQA routing, we need a generative VLM on the NAS.

Candidates

vikhyatk/moondream2 — ~2 GB fp16, fast, good for documents; VLMBackend already supports it
HuggingFaceTB/SmolVLM-Instruct (256M) — ~500 MB, extremely fast, good routing baseline

Tasks

Download chosen model to /Library/Assets/LLM/vision/ on the NAS
Add moondream2 (or SmolVLM) entry to bench_models.yaml with service: cf-vision
Add vision-caption task to bench_tasks.yaml with quality: pattern_match checking that response contains image-description vocabulary
Update navi.yaml profile cf-vision managed block to support --backend vlm variant (likely a second service entry on port 8007)
Add VLM VRAM estimate to _VRAM_TABLE in vlm.py if not already present

Notes

SigLIP (port 8006) handles classify+embed; VLM backend (port 8007) would handle caption+VQA. Run as separate services so they don't compete for VRAM.

## What cf-vision currently only supports SigLIP (classify + embed). To enable caption-quality benchmarking and real VQA routing, we need a generative VLM on the NAS. ## Candidates - `vikhyatk/moondream2` — ~2 GB fp16, fast, good for documents; VLMBackend already supports it - `HuggingFaceTB/SmolVLM-Instruct` (256M) — ~500 MB, extremely fast, good routing baseline ## Tasks - [ ] Download chosen model to `/Library/Assets/LLM/vision/` on the NAS - [ ] Add `moondream2` (or SmolVLM) entry to `bench_models.yaml` with `service: cf-vision` - [ ] Add `vision-caption` task to `bench_tasks.yaml` with `quality: pattern_match` checking that response contains image-description vocabulary - [ ] Update `navi.yaml` profile `cf-vision` managed block to support `--backend vlm` variant (likely a second service entry on port 8007) - [ ] Add VLM VRAM estimate to `_VRAM_TABLE` in `vlm.py` if not already present ## Notes SigLIP (port 8006) handles classify+embed; VLM backend (port 8007) would handle caption+VQA. Run as separate services so they don't compete for VRAM.