26 KiB
Dual-GPU / Dual-Inference Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Add DUAL_GPU_MODE=ollama|vllm|mixed env var that gates which inference service occupies GPU 1 on dual-GPU systems, plus a first-run download size warning in preflight.
Architecture: Sub-profiles (dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed) are injected alongside --profile dual-gpu by the Makefile based on DUAL_GPU_MODE. The LLM router requires zero changes — _is_reachable() naturally skips backends that aren't running. Preflight gains ollama_research as a tracked service and emits a size warning block.
Tech Stack: Docker Compose profiles, Python (preflight.py), YAML (llm.yaml, compose files), bash (Makefile, manage.sh)
Design doc: docs/plans/2026-02-26-dual-gpu-design.md
Test runner: conda run -n job-seeker python -m pytest tests/ -v
Task 1: Update config/llm.yaml
Files:
- Modify:
config/llm.yaml
Step 1: Add vllm_research backend and update research_fallback_order
Open config/llm.yaml. After the vllm: block, add:
vllm_research:
api_key: ''
base_url: http://host.docker.internal:8000/v1
enabled: true
model: __auto__
supports_images: false
type: openai_compat
Replace research_fallback_order: section with:
research_fallback_order:
- claude_code
- vllm_research
- ollama_research
- github_copilot
- anthropic
Step 2: Verify YAML parses cleanly
conda run -n job-seeker python -c "import yaml; yaml.safe_load(open('config/llm.yaml'))"
Expected: no output (no error).
Step 3: Run existing llm config test
conda run -n job-seeker python -m pytest tests/test_llm_router.py::test_config_loads -v
Expected: PASS
Step 4: Commit
git add config/llm.yaml
git commit -m "feat: add vllm_research backend and update research_fallback_order"
Task 2: Write failing tests for preflight changes
Files:
- Create:
tests/test_preflight.py
No existing test file for preflight. Write all tests upfront — they fail until Task 3–5 implement the code.
Step 1: Create tests/test_preflight.py
"""Tests for scripts/preflight.py additions: dual-GPU service table, size warning, VRAM check."""
import pytest
from pathlib import Path
from unittest.mock import patch
import yaml
import tempfile
import os
# ── Service table ──────────────────────────────────────────────────────────────
def test_ollama_research_in_services():
"""ollama_research must be in _SERVICES at port 11435."""
from scripts.preflight import _SERVICES
assert "ollama_research" in _SERVICES
_, default_port, env_var, docker_owned, adoptable = _SERVICES["ollama_research"]
assert default_port == 11435
assert env_var == "OLLAMA_RESEARCH_PORT"
assert docker_owned is True
assert adoptable is True
def test_ollama_research_in_llm_backends():
"""ollama_research must be a standalone key in _LLM_BACKENDS (not nested under ollama)."""
from scripts.preflight import _LLM_BACKENDS
assert "ollama_research" in _LLM_BACKENDS
# Should map to the ollama_research llm backend
backend_names = [name for name, _ in _LLM_BACKENDS["ollama_research"]]
assert "ollama_research" in backend_names
def test_vllm_research_in_llm_backends():
"""vllm_research must be registered under vllm in _LLM_BACKENDS."""
from scripts.preflight import _LLM_BACKENDS
assert "vllm" in _LLM_BACKENDS
backend_names = [name for name, _ in _LLM_BACKENDS["vllm"]]
assert "vllm_research" in backend_names
def test_ollama_research_in_docker_internal():
"""ollama_research must map to internal port 11434 (Ollama's container port)."""
from scripts.preflight import _DOCKER_INTERNAL
assert "ollama_research" in _DOCKER_INTERNAL
hostname, port = _DOCKER_INTERNAL["ollama_research"]
assert hostname == "ollama_research"
assert port == 11434 # container-internal port is always 11434
def test_ollama_not_mapped_to_ollama_research_backend():
"""ollama service key must only update the ollama llm backend, not ollama_research."""
from scripts.preflight import _LLM_BACKENDS
ollama_backend_names = [name for name, _ in _LLM_BACKENDS.get("ollama", [])]
assert "ollama_research" not in ollama_backend_names
# ── Download size warning ──────────────────────────────────────────────────────
def test_download_size_remote_profile():
"""Remote profile: only searxng + app, no ollama, no vision, no vllm."""
from scripts.preflight import _download_size_mb
sizes = _download_size_mb("remote", "ollama")
assert "searxng" in sizes
assert "app" in sizes
assert "ollama" not in sizes
assert "vision_image" not in sizes
assert "vllm_image" not in sizes
def test_download_size_cpu_profile():
"""CPU profile: adds ollama image + llama3.2:3b weights."""
from scripts.preflight import _download_size_mb
sizes = _download_size_mb("cpu", "ollama")
assert "ollama" in sizes
assert "llama3_2_3b" in sizes
assert "vision_image" not in sizes
def test_download_size_single_gpu_profile():
"""Single-GPU: adds vision image + moondream2 weights."""
from scripts.preflight import _download_size_mb
sizes = _download_size_mb("single-gpu", "ollama")
assert "vision_image" in sizes
assert "moondream2" in sizes
assert "vllm_image" not in sizes
def test_download_size_dual_gpu_ollama_mode():
"""dual-gpu + ollama mode: no vllm image."""
from scripts.preflight import _download_size_mb
sizes = _download_size_mb("dual-gpu", "ollama")
assert "vllm_image" not in sizes
def test_download_size_dual_gpu_vllm_mode():
"""dual-gpu + vllm mode: adds ~10 GB vllm image."""
from scripts.preflight import _download_size_mb
sizes = _download_size_mb("dual-gpu", "vllm")
assert "vllm_image" in sizes
assert sizes["vllm_image"] >= 9000 # at least 9 GB
def test_download_size_dual_gpu_mixed_mode():
"""dual-gpu + mixed mode: also includes vllm image."""
from scripts.preflight import _download_size_mb
sizes = _download_size_mb("dual-gpu", "mixed")
assert "vllm_image" in sizes
# ── Mixed-mode VRAM warning ────────────────────────────────────────────────────
def test_mixed_mode_vram_warning_triggered():
"""Should return a warning string when GPU 1 has < 12 GB free in mixed mode."""
from scripts.preflight import _mixed_mode_vram_warning
gpus = [
{"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 20.0},
{"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 8.0}, # tight
]
warning = _mixed_mode_vram_warning(gpus, "mixed")
assert warning is not None
assert "8.0" in warning or "GPU 1" in warning
def test_mixed_mode_vram_warning_not_triggered_with_headroom():
"""Should return None when GPU 1 has >= 12 GB free."""
from scripts.preflight import _mixed_mode_vram_warning
gpus = [
{"name": "RTX 4090", "vram_total_gb": 24.0, "vram_free_gb": 20.0},
{"name": "RTX 4090", "vram_total_gb": 24.0, "vram_free_gb": 18.0}, # plenty
]
warning = _mixed_mode_vram_warning(gpus, "mixed")
assert warning is None
def test_mixed_mode_vram_warning_not_triggered_for_other_modes():
"""Warning only applies in mixed mode."""
from scripts.preflight import _mixed_mode_vram_warning
gpus = [
{"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 20.0},
{"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 6.0},
]
assert _mixed_mode_vram_warning(gpus, "ollama") is None
assert _mixed_mode_vram_warning(gpus, "vllm") is None
# ── update_llm_yaml with ollama_research ──────────────────────────────────────
def test_update_llm_yaml_sets_ollama_research_url_docker_internal():
"""ollama_research backend URL must be set to ollama_research:11434 when Docker-owned."""
from scripts.preflight import update_llm_yaml
llm_cfg = {
"backends": {
"ollama": {"base_url": "http://old", "type": "openai_compat"},
"ollama_research": {"base_url": "http://old", "type": "openai_compat"},
"vllm": {"base_url": "http://old", "type": "openai_compat"},
"vllm_research": {"base_url": "http://old", "type": "openai_compat"},
"vision_service": {"base_url": "http://old", "type": "vision_service"},
}
}
with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
yaml.dump(llm_cfg, f)
tmp_path = Path(f.name)
ports = {
"ollama": {
"resolved": 11434, "external": False, "env_var": "OLLAMA_PORT"
},
"ollama_research": {
"resolved": 11435, "external": False, "env_var": "OLLAMA_RESEARCH_PORT"
},
"vllm": {
"resolved": 8000, "external": False, "env_var": "VLLM_PORT"
},
"vision": {
"resolved": 8002, "external": False, "env_var": "VISION_PORT"
},
}
try:
# Patch LLM_YAML to point at our temp file
with patch("scripts.preflight.LLM_YAML", tmp_path):
update_llm_yaml(ports)
result = yaml.safe_load(tmp_path.read_text())
# Docker-internal: use service name + container port
assert result["backends"]["ollama_research"]["base_url"] == "http://ollama_research:11434/v1"
# vllm_research must match vllm's URL
assert result["backends"]["vllm_research"]["base_url"] == result["backends"]["vllm"]["base_url"]
finally:
tmp_path.unlink()
def test_update_llm_yaml_sets_ollama_research_url_external():
"""When ollama_research is external (adopted), URL uses host.docker.internal:11435."""
from scripts.preflight import update_llm_yaml
llm_cfg = {
"backends": {
"ollama": {"base_url": "http://old", "type": "openai_compat"},
"ollama_research": {"base_url": "http://old", "type": "openai_compat"},
}
}
with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
yaml.dump(llm_cfg, f)
tmp_path = Path(f.name)
ports = {
"ollama": {"resolved": 11434, "external": False, "env_var": "OLLAMA_PORT"},
"ollama_research": {"resolved": 11435, "external": True, "env_var": "OLLAMA_RESEARCH_PORT"},
}
try:
with patch("scripts.preflight.LLM_YAML", tmp_path):
update_llm_yaml(ports)
result = yaml.safe_load(tmp_path.read_text())
assert result["backends"]["ollama_research"]["base_url"] == "http://host.docker.internal:11435/v1"
finally:
tmp_path.unlink()
Step 2: Run tests to confirm they all fail
conda run -n job-seeker python -m pytest tests/test_preflight.py -v 2>&1 | head -50
Expected: all FAIL with ImportError or AssertionError — that's correct.
Step 3: Commit failing tests
git add tests/test_preflight.py
git commit -m "test: add failing tests for dual-gpu preflight additions"
Task 3: preflight.py — service table additions
Files:
- Modify:
scripts/preflight.py:46-67(_SERVICES,_LLM_BACKENDS,_DOCKER_INTERNAL)
Step 1: Update _SERVICES
Find the _SERVICES dict (currently ends at the "ollama" entry). Add ollama_research as a new entry:
_SERVICES: dict[str, tuple[str, int, str, bool, bool]] = {
"streamlit": ("streamlit_port", 8501, "STREAMLIT_PORT", True, False),
"searxng": ("searxng_port", 8888, "SEARXNG_PORT", True, True),
"vllm": ("vllm_port", 8000, "VLLM_PORT", True, True),
"vision": ("vision_port", 8002, "VISION_PORT", True, True),
"ollama": ("ollama_port", 11434, "OLLAMA_PORT", True, True),
"ollama_research": ("ollama_research_port", 11435, "OLLAMA_RESEARCH_PORT", True, True),
}
Step 2: Update _LLM_BACKENDS
Replace the existing dict:
_LLM_BACKENDS: dict[str, list[tuple[str, str]]] = {
"ollama": [("ollama", "/v1")],
"ollama_research": [("ollama_research", "/v1")],
"vllm": [("vllm", "/v1"), ("vllm_research", "/v1")],
"vision": [("vision_service", "")],
}
Step 3: Update _DOCKER_INTERNAL
Add ollama_research entry:
_DOCKER_INTERNAL: dict[str, tuple[str, int]] = {
"ollama": ("ollama", 11434),
"ollama_research": ("ollama_research", 11434), # container-internal port is always 11434
"vllm": ("vllm", 8000),
"vision": ("vision", 8002),
"searxng": ("searxng", 8080),
}
Step 4: Run service table tests
conda run -n job-seeker python -m pytest tests/test_preflight.py::test_ollama_research_in_services tests/test_preflight.py::test_ollama_research_in_llm_backends tests/test_preflight.py::test_vllm_research_in_llm_backends tests/test_preflight.py::test_ollama_research_in_docker_internal tests/test_preflight.py::test_ollama_not_mapped_to_ollama_research_backend tests/test_preflight.py::test_update_llm_yaml_sets_ollama_research_url_docker_internal tests/test_preflight.py::test_update_llm_yaml_sets_ollama_research_url_external -v
Expected: all PASS
Step 5: Commit
git add scripts/preflight.py
git commit -m "feat: add ollama_research to preflight service table and LLM backend map"
Task 4: preflight.py — _download_size_mb() pure function
Files:
- Modify:
scripts/preflight.py(add new function aftercalc_cpu_offload_gb)
Step 1: Add the function
After calc_cpu_offload_gb(), add:
def _download_size_mb(profile: str, dual_gpu_mode: str = "ollama") -> dict[str, int]:
"""
Return estimated first-run download sizes in MB, keyed by component name.
Profile-aware: only includes components that will actually be pulled.
"""
sizes: dict[str, int] = {
"searxng": 300,
"app": 1500,
}
if profile in ("cpu", "single-gpu", "dual-gpu"):
sizes["ollama"] = 800
sizes["llama3_2_3b"] = 2000
if profile in ("single-gpu", "dual-gpu"):
sizes["vision_image"] = 3000
sizes["moondream2"] = 1800
if profile == "dual-gpu" and dual_gpu_mode in ("vllm", "mixed"):
sizes["vllm_image"] = 10000
return sizes
Step 2: Run download size tests
conda run -n job-seeker python -m pytest tests/test_preflight.py -k "download_size" -v
Expected: all PASS
Step 3: Commit
git add scripts/preflight.py
git commit -m "feat: add _download_size_mb() pure function for preflight size warning"
Task 5: preflight.py — VRAM warning, size report block, DUAL_GPU_MODE default
Files:
- Modify:
scripts/preflight.py(three additions tomain()and a new helper)
Step 1: Add _mixed_mode_vram_warning() after _download_size_mb()
def _mixed_mode_vram_warning(gpus: list[dict], dual_gpu_mode: str) -> str | None:
"""
Return a warning string if GPU 1 likely lacks VRAM for mixed mode, else None.
Only relevant when dual_gpu_mode == 'mixed' and at least 2 GPUs are present.
"""
if dual_gpu_mode != "mixed" or len(gpus) < 2:
return None
free = gpus[1]["vram_free_gb"]
if free < 12:
return (
f"⚠ DUAL_GPU_MODE=mixed: GPU 1 has only {free:.1f} GB free — "
f"running ollama_research + vllm together may cause OOM. "
f"Consider DUAL_GPU_MODE=ollama or DUAL_GPU_MODE=vllm."
)
return None
Step 2: Run VRAM warning tests
conda run -n job-seeker python -m pytest tests/test_preflight.py -k "vram" -v
Expected: all PASS
Step 3: Wire size warning into main() report block
In main(), find the closing print("╚═...═╝") line. Add the size warning block just before it:
# ── Download size warning ──────────────────────────────────────────────
dual_gpu_mode = os.environ.get("DUAL_GPU_MODE", "ollama")
sizes = _download_size_mb(profile, dual_gpu_mode)
total_mb = sum(sizes.values())
print("║")
print("║ Download sizes (first-run estimates)")
print("║ Docker images")
print(f"║ app (Python build) ~{sizes.get('app', 0):,} MB")
if "searxng" in sizes:
print(f"║ searxng/searxng ~{sizes['searxng']:,} MB")
if "ollama" in sizes:
shared_note = " (shared by ollama + ollama_research)" if profile == "dual-gpu" and dual_gpu_mode in ("ollama", "mixed") else ""
print(f"║ ollama/ollama ~{sizes['ollama']:,} MB{shared_note}")
if "vision_image" in sizes:
print(f"║ vision service ~{sizes['vision_image']:,} MB (torch + moondream)")
if "vllm_image" in sizes:
print(f"║ vllm/vllm-openai ~{sizes['vllm_image']:,} MB")
print("║ Model weights (lazy-loaded on first use)")
if "llama3_2_3b" in sizes:
print(f"║ llama3.2:3b ~{sizes['llama3_2_3b']:,} MB → OLLAMA_MODELS_DIR")
if "moondream2" in sizes:
print(f"║ moondream2 ~{sizes['moondream2']:,} MB → vision container cache")
if profile == "dual-gpu" and dual_gpu_mode in ("ollama", "mixed"):
print("║ Note: ollama + ollama_research share model dir — no double download")
print(f"║ ⚠ Total first-run: ~{total_mb / 1024:.1f} GB (models persist between restarts)")
# ── Mixed-mode VRAM warning ────────────────────────────────────────────
vram_warn = _mixed_mode_vram_warning(gpus, dual_gpu_mode)
if vram_warn:
print("║")
print(f"║ {vram_warn}")
Step 4: Wire DUAL_GPU_MODE default into write_env() block in main()
In main(), find the if not args.check_only: block. After env_updates["PEREGRINE_GPU_NAMES"], add:
# Write DUAL_GPU_MODE default for new 2-GPU setups (don't override user's choice)
if len(gpus) >= 2:
existing_env: dict[str, str] = {}
if ENV_FILE.exists():
for line in ENV_FILE.read_text().splitlines():
if "=" in line and not line.startswith("#"):
k, _, v = line.partition("=")
existing_env[k.strip()] = v.strip()
if "DUAL_GPU_MODE" not in existing_env:
env_updates["DUAL_GPU_MODE"] = "ollama"
Step 5: Add import os if not already present at top of file
Check line 1–30 of scripts/preflight.py. import os is already present inside get_cpu_cores() as a local import — move it to the top-level imports block:
import os # add alongside existing stdlib imports
And remove the local import os inside get_cpu_cores().
Step 6: Run all preflight tests
conda run -n job-seeker python -m pytest tests/test_preflight.py -v
Expected: all PASS
Step 7: Smoke-check the preflight report output
conda run -n job-seeker python scripts/preflight.py --check-only
Expected: report includes the Download sizes block near the bottom.
Step 8: Commit
git add scripts/preflight.py
git commit -m "feat: add DUAL_GPU_MODE default, VRAM warning, and download size report to preflight"
Task 6: compose.yml — ollama_research service + profile updates
Files:
- Modify:
compose.yml
Step 1: Update ollama profiles line
Find:
profiles: [cpu, single-gpu, dual-gpu]
Replace with:
profiles: [cpu, single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
Step 2: Update vision profiles line
Find:
profiles: [single-gpu, dual-gpu]
Replace with:
profiles: [single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
Step 3: Update vllm profiles line
Find:
profiles: [dual-gpu]
Replace with:
profiles: [dual-gpu-vllm, dual-gpu-mixed]
Step 4: Add ollama_research service
After the closing lines of the ollama service block, add:
ollama_research:
image: ollama/ollama:latest
ports:
- "${OLLAMA_RESEARCH_PORT:-11435}:11434"
volumes:
- ${OLLAMA_MODELS_DIR:-~/models/ollama}:/root/.ollama
- ./docker/ollama/entrypoint.sh:/entrypoint.sh
environment:
- OLLAMA_MODELS=/root/.ollama
- DEFAULT_OLLAMA_MODEL=${OLLAMA_RESEARCH_MODEL:-llama3.2:3b}
entrypoint: ["/bin/bash", "/entrypoint.sh"]
profiles: [dual-gpu-ollama, dual-gpu-mixed]
restart: unless-stopped
Step 5: Validate compose YAML
docker compose -f compose.yml config --quiet
Expected: no errors.
Step 6: Commit
git add compose.yml
git commit -m "feat: add ollama_research service and update profiles for dual-gpu sub-profiles"
Task 7: GPU overlay files — compose.gpu.yml and compose.podman-gpu.yml
Files:
- Modify:
compose.gpu.yml - Modify:
compose.podman-gpu.yml
Step 1: Add ollama_research to compose.gpu.yml
After the ollama: block, add:
ollama_research:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
Step 2: Add ollama_research to compose.podman-gpu.yml
After the ollama: block, add:
ollama_research:
devices:
- nvidia.com/gpu=1
deploy:
resources:
reservations:
devices: []
Step 3: Validate both files
docker compose -f compose.yml -f compose.gpu.yml config --quiet
Expected: no errors.
Step 4: Commit
git add compose.gpu.yml compose.podman-gpu.yml
git commit -m "feat: assign ollama_research to GPU 1 in Docker and Podman GPU overlays"
Task 8: Makefile + manage.sh — DUAL_GPU_MODE injection and help text
Files:
- Modify:
Makefile - Modify:
manage.sh
Step 1: Update Makefile
After the COMPOSE_OVERRIDE variable, add DUAL_GPU_MODE reading:
DUAL_GPU_MODE ?= $(shell grep -m1 '^DUAL_GPU_MODE=' .env 2>/dev/null | cut -d= -f2 || echo ollama)
In the GPU overlay block, find:
else
ifneq (,$(findstring gpu,$(PROFILE)))
COMPOSE_FILES := -f compose.yml $(COMPOSE_OVERRIDE) -f compose.gpu.yml
endif
endif
Replace the else branch with:
else
ifneq (,$(findstring gpu,$(PROFILE)))
COMPOSE_FILES := -f compose.yml $(COMPOSE_OVERRIDE) -f compose.gpu.yml
endif
endif
ifeq ($(PROFILE),dual-gpu)
COMPOSE_FILES += --profile dual-gpu-$(DUAL_GPU_MODE)
endif
Step 2: Update manage.sh — profiles help block
Find the profiles section in usage():
echo " dual-gpu Ollama + Vision + vLLM on GPU 0+1"
Replace with:
echo " dual-gpu Ollama + Vision on GPU 0; GPU 1 set by DUAL_GPU_MODE"
echo " DUAL_GPU_MODE=ollama (default) ollama_research on GPU 1"
echo " DUAL_GPU_MODE=vllm vllm on GPU 1"
echo " DUAL_GPU_MODE=mixed both on GPU 1 (VRAM-split)"
Step 3: Verify Makefile parses
make help
Expected: help table prints cleanly, no make errors.
Step 4: Verify manage.sh help
./manage.sh help
Expected: new dual-gpu description appears in profiles section.
Step 5: Commit
git add Makefile manage.sh
git commit -m "feat: inject DUAL_GPU_MODE sub-profile in Makefile; update manage.sh help"
Task 9: Integration smoke test
Goal: Verify the full chain works for DUAL_GPU_MODE=ollama without actually starting Docker (dry-run compose config check).
Step 1: Write DUAL_GPU_MODE=ollama to .env temporarily
echo "DUAL_GPU_MODE=ollama" >> .env
Step 2: Dry-run compose config for dual-gpu + dual-gpu-ollama
docker compose -f compose.yml -f compose.gpu.yml --profile dual-gpu --profile dual-gpu-ollama config 2>&1 | grep -E "^ [a-z]|image:|ports:"
Expected output includes:
ollama:service with port 11434ollama_research:service with port 11435vision:servicesearxng:service- No
vllm:service
Step 3: Dry-run for DUAL_GPU_MODE=vllm
docker compose -f compose.yml -f compose.gpu.yml --profile dual-gpu --profile dual-gpu-vllm config 2>&1 | grep -E "^ [a-z]|image:|ports:"
Expected:
ollama:service (port 11434)vllm:service (port 8000)- No
ollama_research:service
Step 4: Run full test suite
conda run -n job-seeker python -m pytest tests/ -v
Expected: all existing tests PASS, all new preflight tests PASS.
Step 5: Clean up .env test entry
# Remove the test DUAL_GPU_MODE line (preflight will re-write it correctly on next run)
sed -i '/^DUAL_GPU_MODE=/d' .env
Step 6: Final commit
git add .env # in case preflight rewrote it during testing
git commit -m "feat: dual-gpu DUAL_GPU_MODE complete — ollama/vllm/mixed GPU 1 selection"