peregrine/docs/plans/2026-02-26-dual-gpu-plan.md

811 lines
26 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Dual-GPU / Dual-Inference Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Add `DUAL_GPU_MODE=ollama|vllm|mixed` env var that gates which inference service occupies GPU 1 on dual-GPU systems, plus a first-run download size warning in preflight.
**Architecture:** Sub-profiles (`dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed`) are injected alongside `--profile dual-gpu` by the Makefile based on `DUAL_GPU_MODE`. The LLM router requires zero changes — `_is_reachable()` naturally skips backends that aren't running. Preflight gains `ollama_research` as a tracked service and emits a size warning block.
**Tech Stack:** Docker Compose profiles, Python (preflight.py), YAML (llm.yaml, compose files), bash (Makefile, manage.sh)
**Design doc:** `docs/plans/2026-02-26-dual-gpu-design.md`
**Test runner:** `conda run -n job-seeker python -m pytest tests/ -v`
---
### Task 1: Update `config/llm.yaml`
**Files:**
- Modify: `config/llm.yaml`
**Step 1: Add `vllm_research` backend and update `research_fallback_order`**
Open `config/llm.yaml`. After the `vllm:` block, add:
```yaml
vllm_research:
api_key: ''
base_url: http://host.docker.internal:8000/v1
enabled: true
model: __auto__
supports_images: false
type: openai_compat
```
Replace `research_fallback_order:` section with:
```yaml
research_fallback_order:
- claude_code
- vllm_research
- ollama_research
- github_copilot
- anthropic
```
**Step 2: Verify YAML parses cleanly**
```bash
conda run -n job-seeker python -c "import yaml; yaml.safe_load(open('config/llm.yaml'))"
```
Expected: no output (no error).
**Step 3: Run existing llm config test**
```bash
conda run -n job-seeker python -m pytest tests/test_llm_router.py::test_config_loads -v
```
Expected: PASS
**Step 4: Commit**
```bash
git add config/llm.yaml
git commit -m "feat: add vllm_research backend and update research_fallback_order"
```
---
### Task 2: Write failing tests for preflight changes
**Files:**
- Create: `tests/test_preflight.py`
No existing test file for preflight. Write all tests upfront — they fail until Task 35 implement the code.
**Step 1: Create `tests/test_preflight.py`**
```python
"""Tests for scripts/preflight.py additions: dual-GPU service table, size warning, VRAM check."""
import pytest
from pathlib import Path
from unittest.mock import patch
import yaml
import tempfile
import os
# ── Service table ──────────────────────────────────────────────────────────────
def test_ollama_research_in_services():
"""ollama_research must be in _SERVICES at port 11435."""
from scripts.preflight import _SERVICES
assert "ollama_research" in _SERVICES
_, default_port, env_var, docker_owned, adoptable = _SERVICES["ollama_research"]
assert default_port == 11435
assert env_var == "OLLAMA_RESEARCH_PORT"
assert docker_owned is True
assert adoptable is True
def test_ollama_research_in_llm_backends():
"""ollama_research must be a standalone key in _LLM_BACKENDS (not nested under ollama)."""
from scripts.preflight import _LLM_BACKENDS
assert "ollama_research" in _LLM_BACKENDS
# Should map to the ollama_research llm backend
backend_names = [name for name, _ in _LLM_BACKENDS["ollama_research"]]
assert "ollama_research" in backend_names
def test_vllm_research_in_llm_backends():
"""vllm_research must be registered under vllm in _LLM_BACKENDS."""
from scripts.preflight import _LLM_BACKENDS
assert "vllm" in _LLM_BACKENDS
backend_names = [name for name, _ in _LLM_BACKENDS["vllm"]]
assert "vllm_research" in backend_names
def test_ollama_research_in_docker_internal():
"""ollama_research must map to internal port 11434 (Ollama's container port)."""
from scripts.preflight import _DOCKER_INTERNAL
assert "ollama_research" in _DOCKER_INTERNAL
hostname, port = _DOCKER_INTERNAL["ollama_research"]
assert hostname == "ollama_research"
assert port == 11434 # container-internal port is always 11434
def test_ollama_not_mapped_to_ollama_research_backend():
"""ollama service key must only update the ollama llm backend, not ollama_research."""
from scripts.preflight import _LLM_BACKENDS
ollama_backend_names = [name for name, _ in _LLM_BACKENDS.get("ollama", [])]
assert "ollama_research" not in ollama_backend_names
# ── Download size warning ──────────────────────────────────────────────────────
def test_download_size_remote_profile():
"""Remote profile: only searxng + app, no ollama, no vision, no vllm."""
from scripts.preflight import _download_size_mb
sizes = _download_size_mb("remote", "ollama")
assert "searxng" in sizes
assert "app" in sizes
assert "ollama" not in sizes
assert "vision_image" not in sizes
assert "vllm_image" not in sizes
def test_download_size_cpu_profile():
"""CPU profile: adds ollama image + llama3.2:3b weights."""
from scripts.preflight import _download_size_mb
sizes = _download_size_mb("cpu", "ollama")
assert "ollama" in sizes
assert "llama3_2_3b" in sizes
assert "vision_image" not in sizes
def test_download_size_single_gpu_profile():
"""Single-GPU: adds vision image + moondream2 weights."""
from scripts.preflight import _download_size_mb
sizes = _download_size_mb("single-gpu", "ollama")
assert "vision_image" in sizes
assert "moondream2" in sizes
assert "vllm_image" not in sizes
def test_download_size_dual_gpu_ollama_mode():
"""dual-gpu + ollama mode: no vllm image."""
from scripts.preflight import _download_size_mb
sizes = _download_size_mb("dual-gpu", "ollama")
assert "vllm_image" not in sizes
def test_download_size_dual_gpu_vllm_mode():
"""dual-gpu + vllm mode: adds ~10 GB vllm image."""
from scripts.preflight import _download_size_mb
sizes = _download_size_mb("dual-gpu", "vllm")
assert "vllm_image" in sizes
assert sizes["vllm_image"] >= 9000 # at least 9 GB
def test_download_size_dual_gpu_mixed_mode():
"""dual-gpu + mixed mode: also includes vllm image."""
from scripts.preflight import _download_size_mb
sizes = _download_size_mb("dual-gpu", "mixed")
assert "vllm_image" in sizes
# ── Mixed-mode VRAM warning ────────────────────────────────────────────────────
def test_mixed_mode_vram_warning_triggered():
"""Should return a warning string when GPU 1 has < 12 GB free in mixed mode."""
from scripts.preflight import _mixed_mode_vram_warning
gpus = [
{"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 20.0},
{"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 8.0}, # tight
]
warning = _mixed_mode_vram_warning(gpus, "mixed")
assert warning is not None
assert "8.0" in warning or "GPU 1" in warning
def test_mixed_mode_vram_warning_not_triggered_with_headroom():
"""Should return None when GPU 1 has >= 12 GB free."""
from scripts.preflight import _mixed_mode_vram_warning
gpus = [
{"name": "RTX 4090", "vram_total_gb": 24.0, "vram_free_gb": 20.0},
{"name": "RTX 4090", "vram_total_gb": 24.0, "vram_free_gb": 18.0}, # plenty
]
warning = _mixed_mode_vram_warning(gpus, "mixed")
assert warning is None
def test_mixed_mode_vram_warning_not_triggered_for_other_modes():
"""Warning only applies in mixed mode."""
from scripts.preflight import _mixed_mode_vram_warning
gpus = [
{"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 20.0},
{"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 6.0},
]
assert _mixed_mode_vram_warning(gpus, "ollama") is None
assert _mixed_mode_vram_warning(gpus, "vllm") is None
# ── update_llm_yaml with ollama_research ──────────────────────────────────────
def test_update_llm_yaml_sets_ollama_research_url_docker_internal():
"""ollama_research backend URL must be set to ollama_research:11434 when Docker-owned."""
from scripts.preflight import update_llm_yaml
llm_cfg = {
"backends": {
"ollama": {"base_url": "http://old", "type": "openai_compat"},
"ollama_research": {"base_url": "http://old", "type": "openai_compat"},
"vllm": {"base_url": "http://old", "type": "openai_compat"},
"vllm_research": {"base_url": "http://old", "type": "openai_compat"},
"vision_service": {"base_url": "http://old", "type": "vision_service"},
}
}
with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
yaml.dump(llm_cfg, f)
tmp_path = Path(f.name)
ports = {
"ollama": {
"resolved": 11434, "external": False, "env_var": "OLLAMA_PORT"
},
"ollama_research": {
"resolved": 11435, "external": False, "env_var": "OLLAMA_RESEARCH_PORT"
},
"vllm": {
"resolved": 8000, "external": False, "env_var": "VLLM_PORT"
},
"vision": {
"resolved": 8002, "external": False, "env_var": "VISION_PORT"
},
}
try:
# Patch LLM_YAML to point at our temp file
with patch("scripts.preflight.LLM_YAML", tmp_path):
update_llm_yaml(ports)
result = yaml.safe_load(tmp_path.read_text())
# Docker-internal: use service name + container port
assert result["backends"]["ollama_research"]["base_url"] == "http://ollama_research:11434/v1"
# vllm_research must match vllm's URL
assert result["backends"]["vllm_research"]["base_url"] == result["backends"]["vllm"]["base_url"]
finally:
tmp_path.unlink()
def test_update_llm_yaml_sets_ollama_research_url_external():
"""When ollama_research is external (adopted), URL uses host.docker.internal:11435."""
from scripts.preflight import update_llm_yaml
llm_cfg = {
"backends": {
"ollama": {"base_url": "http://old", "type": "openai_compat"},
"ollama_research": {"base_url": "http://old", "type": "openai_compat"},
}
}
with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
yaml.dump(llm_cfg, f)
tmp_path = Path(f.name)
ports = {
"ollama": {"resolved": 11434, "external": False, "env_var": "OLLAMA_PORT"},
"ollama_research": {"resolved": 11435, "external": True, "env_var": "OLLAMA_RESEARCH_PORT"},
}
try:
with patch("scripts.preflight.LLM_YAML", tmp_path):
update_llm_yaml(ports)
result = yaml.safe_load(tmp_path.read_text())
assert result["backends"]["ollama_research"]["base_url"] == "http://host.docker.internal:11435/v1"
finally:
tmp_path.unlink()
```
**Step 2: Run tests to confirm they all fail**
```bash
conda run -n job-seeker python -m pytest tests/test_preflight.py -v 2>&1 | head -50
```
Expected: all FAIL with `ImportError` or `AssertionError` — that's correct.
**Step 3: Commit failing tests**
```bash
git add tests/test_preflight.py
git commit -m "test: add failing tests for dual-gpu preflight additions"
```
---
### Task 3: `preflight.py` — service table additions
**Files:**
- Modify: `scripts/preflight.py:46-67` (`_SERVICES`, `_LLM_BACKENDS`, `_DOCKER_INTERNAL`)
**Step 1: Update `_SERVICES`**
Find the `_SERVICES` dict (currently ends at the `"ollama"` entry). Add `ollama_research` as a new entry:
```python
_SERVICES: dict[str, tuple[str, int, str, bool, bool]] = {
"streamlit": ("streamlit_port", 8501, "STREAMLIT_PORT", True, False),
"searxng": ("searxng_port", 8888, "SEARXNG_PORT", True, True),
"vllm": ("vllm_port", 8000, "VLLM_PORT", True, True),
"vision": ("vision_port", 8002, "VISION_PORT", True, True),
"ollama": ("ollama_port", 11434, "OLLAMA_PORT", True, True),
"ollama_research": ("ollama_research_port", 11435, "OLLAMA_RESEARCH_PORT", True, True),
}
```
**Step 2: Update `_LLM_BACKENDS`**
Replace the existing dict:
```python
_LLM_BACKENDS: dict[str, list[tuple[str, str]]] = {
"ollama": [("ollama", "/v1")],
"ollama_research": [("ollama_research", "/v1")],
"vllm": [("vllm", "/v1"), ("vllm_research", "/v1")],
"vision": [("vision_service", "")],
}
```
**Step 3: Update `_DOCKER_INTERNAL`**
Add `ollama_research` entry:
```python
_DOCKER_INTERNAL: dict[str, tuple[str, int]] = {
"ollama": ("ollama", 11434),
"ollama_research": ("ollama_research", 11434), # container-internal port is always 11434
"vllm": ("vllm", 8000),
"vision": ("vision", 8002),
"searxng": ("searxng", 8080),
}
```
**Step 4: Run service table tests**
```bash
conda run -n job-seeker python -m pytest tests/test_preflight.py::test_ollama_research_in_services tests/test_preflight.py::test_ollama_research_in_llm_backends tests/test_preflight.py::test_vllm_research_in_llm_backends tests/test_preflight.py::test_ollama_research_in_docker_internal tests/test_preflight.py::test_ollama_not_mapped_to_ollama_research_backend tests/test_preflight.py::test_update_llm_yaml_sets_ollama_research_url_docker_internal tests/test_preflight.py::test_update_llm_yaml_sets_ollama_research_url_external -v
```
Expected: all PASS
**Step 5: Commit**
```bash
git add scripts/preflight.py
git commit -m "feat: add ollama_research to preflight service table and LLM backend map"
```
---
### Task 4: `preflight.py` — `_download_size_mb()` pure function
**Files:**
- Modify: `scripts/preflight.py` (add new function after `calc_cpu_offload_gb`)
**Step 1: Add the function**
After `calc_cpu_offload_gb()`, add:
```python
def _download_size_mb(profile: str, dual_gpu_mode: str = "ollama") -> dict[str, int]:
"""
Return estimated first-run download sizes in MB, keyed by component name.
Profile-aware: only includes components that will actually be pulled.
"""
sizes: dict[str, int] = {
"searxng": 300,
"app": 1500,
}
if profile in ("cpu", "single-gpu", "dual-gpu"):
sizes["ollama"] = 800
sizes["llama3_2_3b"] = 2000
if profile in ("single-gpu", "dual-gpu"):
sizes["vision_image"] = 3000
sizes["moondream2"] = 1800
if profile == "dual-gpu" and dual_gpu_mode in ("vllm", "mixed"):
sizes["vllm_image"] = 10000
return sizes
```
**Step 2: Run download size tests**
```bash
conda run -n job-seeker python -m pytest tests/test_preflight.py -k "download_size" -v
```
Expected: all PASS
**Step 3: Commit**
```bash
git add scripts/preflight.py
git commit -m "feat: add _download_size_mb() pure function for preflight size warning"
```
---
### Task 5: `preflight.py` — VRAM warning, size report block, DUAL_GPU_MODE default
**Files:**
- Modify: `scripts/preflight.py` (three additions to `main()` and a new helper)
**Step 1: Add `_mixed_mode_vram_warning()` after `_download_size_mb()`**
```python
def _mixed_mode_vram_warning(gpus: list[dict], dual_gpu_mode: str) -> str | None:
"""
Return a warning string if GPU 1 likely lacks VRAM for mixed mode, else None.
Only relevant when dual_gpu_mode == 'mixed' and at least 2 GPUs are present.
"""
if dual_gpu_mode != "mixed" or len(gpus) < 2:
return None
free = gpus[1]["vram_free_gb"]
if free < 12:
return (
f"⚠ DUAL_GPU_MODE=mixed: GPU 1 has only {free:.1f} GB free — "
f"running ollama_research + vllm together may cause OOM. "
f"Consider DUAL_GPU_MODE=ollama or DUAL_GPU_MODE=vllm."
)
return None
```
**Step 2: Run VRAM warning tests**
```bash
conda run -n job-seeker python -m pytest tests/test_preflight.py -k "vram" -v
```
Expected: all PASS
**Step 3: Wire size warning into `main()` report block**
In `main()`, find the closing `print("╚═...═╝")` line. Add the size warning block just before it:
```python
# ── Download size warning ──────────────────────────────────────────────
dual_gpu_mode = os.environ.get("DUAL_GPU_MODE", "ollama")
sizes = _download_size_mb(profile, dual_gpu_mode)
total_mb = sum(sizes.values())
print("║")
print("║ Download sizes (first-run estimates)")
print("║ Docker images")
print(f"║ app (Python build) ~{sizes.get('app', 0):,} MB")
if "searxng" in sizes:
print(f"║ searxng/searxng ~{sizes['searxng']:,} MB")
if "ollama" in sizes:
shared_note = " (shared by ollama + ollama_research)" if profile == "dual-gpu" and dual_gpu_mode in ("ollama", "mixed") else ""
print(f"║ ollama/ollama ~{sizes['ollama']:,} MB{shared_note}")
if "vision_image" in sizes:
print(f"║ vision service ~{sizes['vision_image']:,} MB (torch + moondream)")
if "vllm_image" in sizes:
print(f"║ vllm/vllm-openai ~{sizes['vllm_image']:,} MB")
print("║ Model weights (lazy-loaded on first use)")
if "llama3_2_3b" in sizes:
print(f"║ llama3.2:3b ~{sizes['llama3_2_3b']:,} MB → OLLAMA_MODELS_DIR")
if "moondream2" in sizes:
print(f"║ moondream2 ~{sizes['moondream2']:,} MB → vision container cache")
if profile == "dual-gpu" and dual_gpu_mode in ("ollama", "mixed"):
print("║ Note: ollama + ollama_research share model dir — no double download")
print(f"║ ⚠ Total first-run: ~{total_mb / 1024:.1f} GB (models persist between restarts)")
# ── Mixed-mode VRAM warning ────────────────────────────────────────────
vram_warn = _mixed_mode_vram_warning(gpus, dual_gpu_mode)
if vram_warn:
print("║")
print(f"║ {vram_warn}")
```
**Step 4: Wire `DUAL_GPU_MODE` default into `write_env()` block in `main()`**
In `main()`, find the `if not args.check_only:` block. After `env_updates["PEREGRINE_GPU_NAMES"]`, add:
```python
# Write DUAL_GPU_MODE default for new 2-GPU setups (don't override user's choice)
if len(gpus) >= 2:
existing_env: dict[str, str] = {}
if ENV_FILE.exists():
for line in ENV_FILE.read_text().splitlines():
if "=" in line and not line.startswith("#"):
k, _, v = line.partition("=")
existing_env[k.strip()] = v.strip()
if "DUAL_GPU_MODE" not in existing_env:
env_updates["DUAL_GPU_MODE"] = "ollama"
```
**Step 5: Add `import os` if not already present at top of file**
Check line 130 of `scripts/preflight.py`. `import os` is already present inside `get_cpu_cores()` as a local import — move it to the top-level imports block:
```python
import os # add alongside existing stdlib imports
```
And remove the local `import os` inside `get_cpu_cores()`.
**Step 6: Run all preflight tests**
```bash
conda run -n job-seeker python -m pytest tests/test_preflight.py -v
```
Expected: all PASS
**Step 7: Smoke-check the preflight report output**
```bash
conda run -n job-seeker python scripts/preflight.py --check-only
```
Expected: report includes the `Download sizes` block near the bottom.
**Step 8: Commit**
```bash
git add scripts/preflight.py
git commit -m "feat: add DUAL_GPU_MODE default, VRAM warning, and download size report to preflight"
```
---
### Task 6: `compose.yml` — `ollama_research` service + profile updates
**Files:**
- Modify: `compose.yml`
**Step 1: Update `ollama` profiles line**
Find:
```yaml
profiles: [cpu, single-gpu, dual-gpu]
```
Replace with:
```yaml
profiles: [cpu, single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
```
**Step 2: Update `vision` profiles line**
Find:
```yaml
profiles: [single-gpu, dual-gpu]
```
Replace with:
```yaml
profiles: [single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
```
**Step 3: Update `vllm` profiles line**
Find:
```yaml
profiles: [dual-gpu]
```
Replace with:
```yaml
profiles: [dual-gpu-vllm, dual-gpu-mixed]
```
**Step 4: Add `ollama_research` service**
After the closing lines of the `ollama` service block, add:
```yaml
ollama_research:
image: ollama/ollama:latest
ports:
- "${OLLAMA_RESEARCH_PORT:-11435}:11434"
volumes:
- ${OLLAMA_MODELS_DIR:-~/models/ollama}:/root/.ollama
- ./docker/ollama/entrypoint.sh:/entrypoint.sh
environment:
- OLLAMA_MODELS=/root/.ollama
- DEFAULT_OLLAMA_MODEL=${OLLAMA_RESEARCH_MODEL:-llama3.2:3b}
entrypoint: ["/bin/bash", "/entrypoint.sh"]
profiles: [dual-gpu-ollama, dual-gpu-mixed]
restart: unless-stopped
```
**Step 5: Validate compose YAML**
```bash
docker compose -f compose.yml config --quiet
```
Expected: no errors.
**Step 6: Commit**
```bash
git add compose.yml
git commit -m "feat: add ollama_research service and update profiles for dual-gpu sub-profiles"
```
---
### Task 7: GPU overlay files — `compose.gpu.yml` and `compose.podman-gpu.yml`
**Files:**
- Modify: `compose.gpu.yml`
- Modify: `compose.podman-gpu.yml`
**Step 1: Add `ollama_research` to `compose.gpu.yml`**
After the `ollama:` block, add:
```yaml
ollama_research:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
```
**Step 2: Add `ollama_research` to `compose.podman-gpu.yml`**
After the `ollama:` block, add:
```yaml
ollama_research:
devices:
- nvidia.com/gpu=1
deploy:
resources:
reservations:
devices: []
```
**Step 3: Validate both files**
```bash
docker compose -f compose.yml -f compose.gpu.yml config --quiet
```
Expected: no errors.
**Step 4: Commit**
```bash
git add compose.gpu.yml compose.podman-gpu.yml
git commit -m "feat: assign ollama_research to GPU 1 in Docker and Podman GPU overlays"
```
---
### Task 8: `Makefile` + `manage.sh` — `DUAL_GPU_MODE` injection and help text
**Files:**
- Modify: `Makefile`
- Modify: `manage.sh`
**Step 1: Update `Makefile`**
After the `COMPOSE_OVERRIDE` variable, add `DUAL_GPU_MODE` reading:
```makefile
DUAL_GPU_MODE ?= $(shell grep -m1 '^DUAL_GPU_MODE=' .env 2>/dev/null | cut -d= -f2 || echo ollama)
```
In the GPU overlay block, find:
```makefile
else
ifneq (,$(findstring gpu,$(PROFILE)))
COMPOSE_FILES := -f compose.yml $(COMPOSE_OVERRIDE) -f compose.gpu.yml
endif
endif
```
Replace the `else` branch with:
```makefile
else
ifneq (,$(findstring gpu,$(PROFILE)))
COMPOSE_FILES := -f compose.yml $(COMPOSE_OVERRIDE) -f compose.gpu.yml
endif
endif
ifeq ($(PROFILE),dual-gpu)
COMPOSE_FILES += --profile dual-gpu-$(DUAL_GPU_MODE)
endif
```
**Step 2: Update `manage.sh` — profiles help block**
Find the profiles section in `usage()`:
```bash
echo " dual-gpu Ollama + Vision + vLLM on GPU 0+1"
```
Replace with:
```bash
echo " dual-gpu Ollama + Vision on GPU 0; GPU 1 set by DUAL_GPU_MODE"
echo " DUAL_GPU_MODE=ollama (default) ollama_research on GPU 1"
echo " DUAL_GPU_MODE=vllm vllm on GPU 1"
echo " DUAL_GPU_MODE=mixed both on GPU 1 (VRAM-split)"
```
**Step 3: Verify Makefile parses**
```bash
make help
```
Expected: help table prints cleanly, no make errors.
**Step 4: Verify manage.sh help**
```bash
./manage.sh help
```
Expected: new dual-gpu description appears in profiles section.
**Step 5: Commit**
```bash
git add Makefile manage.sh
git commit -m "feat: inject DUAL_GPU_MODE sub-profile in Makefile; update manage.sh help"
```
---
### Task 9: Integration smoke test
**Goal:** Verify the full chain works for `DUAL_GPU_MODE=ollama` without actually starting Docker (dry-run compose config check).
**Step 1: Write `DUAL_GPU_MODE=ollama` to `.env` temporarily**
```bash
echo "DUAL_GPU_MODE=ollama" >> .env
```
**Step 2: Dry-run compose config for dual-gpu + dual-gpu-ollama**
```bash
docker compose -f compose.yml -f compose.gpu.yml --profile dual-gpu --profile dual-gpu-ollama config 2>&1 | grep -E "^ [a-z]|image:|ports:"
```
Expected output includes:
- `ollama:` service with port 11434
- `ollama_research:` service with port 11435
- `vision:` service
- `searxng:` service
- **No** `vllm:` service
**Step 3: Dry-run for `DUAL_GPU_MODE=vllm`**
```bash
docker compose -f compose.yml -f compose.gpu.yml --profile dual-gpu --profile dual-gpu-vllm config 2>&1 | grep -E "^ [a-z]|image:|ports:"
```
Expected:
- `ollama:` service (port 11434)
- `vllm:` service (port 8000)
- **No** `ollama_research:` service
**Step 4: Run full test suite**
```bash
conda run -n job-seeker python -m pytest tests/ -v
```
Expected: all existing tests PASS, all new preflight tests PASS.
**Step 5: Clean up `.env` test entry**
```bash
# Remove the test DUAL_GPU_MODE line (preflight will re-write it correctly on next run)
sed -i '/^DUAL_GPU_MODE=/d' .env
```
**Step 6: Final commit**
```bash
git add .env # in case preflight rewrote it during testing
git commit -m "feat: dual-gpu DUAL_GPU_MODE complete — ollama/vllm/mixed GPU 1 selection"
```