Replace stale llama/mistral/phi model refs with models active on the
cluster: deepseek-r1 (1.5b, 7b-4bit, 0528-qwen3-8b-gguf), granite-4.1-8b,
qwen2.5 (3b, 7b), capybarahermes-2.5-mistral-7b, darwin-9b-opus. Update
benchmark_plans.py doc examples to match.