diff --git a/docs/plans/2026-02-26-dual-gpu-design.md b/docs/plans/2026-02-26-dual-gpu-design.md new file mode 100644 index 0000000..860a17a --- /dev/null +++ b/docs/plans/2026-02-26-dual-gpu-design.md @@ -0,0 +1,257 @@ +# Peregrine — Dual-GPU / Dual-Inference Design + +**Date:** 2026-02-26 +**Status:** Approved — ready for implementation +**Scope:** Peregrine (reference impl; patterns propagate to future products) + +--- + +## Goal + +Replace the fixed `dual-gpu` profile (Ollama + vLLM hardwired to GPU 0 + GPU 1) with a +`DUAL_GPU_MODE` env var that selects which inference stack occupies GPU 1. Simultaneously +add a first-run download size warning to preflight so users know what they're in for before +Docker starts pulling images and models. + +--- + +## Modes + +| `DUAL_GPU_MODE` | GPU 0 | GPU 1 | Research backend | +|-----------------|-------|-------|-----------------| +| `ollama` (default) | ollama + vision | ollama_research | `ollama_research` | +| `vllm` | ollama + vision | vllm | `vllm_research` | +| `mixed` | ollama + vision | ollama_research + vllm (VRAM-split) | `vllm_research` → `ollama_research` fallback | + +`mixed` requires sufficient VRAM on GPU 1. Preflight warns (not blocks) when GPU 1 has +< 12 GB free before starting in mixed mode. + +Cover letters always use `ollama` on GPU 0. Research uses whichever GPU 1 backend is +reachable. The LLM router's `_is_reachable()` check handles this transparently — the +fallback chain simply skips services that aren't running. + +--- + +## Compose Profile Architecture + +Docker Compose profiles used to gate which services start per mode. +`DUAL_GPU_MODE` is read by the Makefile and passed as a second `--profile` flag. + +### Service → profile mapping + +| Service | Profiles | +|---------|---------| +| `ollama` | `cpu`, `single-gpu`, `dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed` | +| `vision` | `single-gpu`, `dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed` | +| `ollama_research` | `dual-gpu-ollama`, `dual-gpu-mixed` | +| `vllm` | `dual-gpu-vllm`, `dual-gpu-mixed` | +| `finetune` | `finetune` | + +User-facing profiles remain: `remote`, `cpu`, `single-gpu`, `dual-gpu`. +Sub-profiles (`dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed`) are injected by the +Makefile and never typed by the user. + +--- + +## File Changes + +### `compose.yml` + +**`ollama`** — add all dual-gpu sub-profiles to `profiles`: +```yaml +profiles: [cpu, single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed] +``` + +**`vision`** — same pattern: +```yaml +profiles: [single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed] +``` + +**`vllm`** — change from `[dual-gpu]` to: +```yaml +profiles: [dual-gpu-vllm, dual-gpu-mixed] +``` + +**`ollama_research`** — new service: +```yaml +ollama_research: + image: ollama/ollama:latest + ports: + - "${OLLAMA_RESEARCH_PORT:-11435}:11434" + volumes: + - ${OLLAMA_MODELS_DIR:-~/models/ollama}:/root/.ollama # shared — no double download + - ./docker/ollama/entrypoint.sh:/entrypoint.sh + environment: + - OLLAMA_MODELS=/root/.ollama + - DEFAULT_OLLAMA_MODEL=${OLLAMA_RESEARCH_MODEL:-llama3.2:3b} + entrypoint: ["/bin/bash", "/entrypoint.sh"] + profiles: [dual-gpu-ollama, dual-gpu-mixed] + restart: unless-stopped +``` + +### `compose.gpu.yml` + +Add `ollama_research` block (GPU 1). `vllm` stays on GPU 1 as-is: +```yaml +ollama_research: + deploy: + resources: + reservations: + devices: + - driver: nvidia + device_ids: ["1"] + capabilities: [gpu] +``` + +### `compose.podman-gpu.yml` + +Same addition for Podman CDI: +```yaml +ollama_research: + devices: + - nvidia.com/gpu=1 + deploy: + resources: + reservations: + devices: [] +``` + +### `Makefile` + +Two additions after existing `COMPOSE` detection: + +```makefile +DUAL_GPU_MODE ?= $(shell grep -m1 '^DUAL_GPU_MODE=' .env 2>/dev/null | cut -d= -f2 || echo ollama) + +# GPU overlay: matches single-gpu, dual-gpu (findstring gpu already covers these) +# Sub-profile injection for dual-gpu modes: +ifeq ($(PROFILE),dual-gpu) + COMPOSE_FILES += --profile dual-gpu-$(DUAL_GPU_MODE) +endif +``` + +Update `manage.sh` usage block to document `dual-gpu` profile with `DUAL_GPU_MODE` note: +``` +dual-gpu Ollama + Vision on GPU 0; GPU 1 mode set by DUAL_GPU_MODE + DUAL_GPU_MODE=ollama (default) ollama_research on GPU 1 + DUAL_GPU_MODE=vllm vllm on GPU 1 + DUAL_GPU_MODE=mixed both on GPU 1 (VRAM-split; see preflight warning) +``` + +### `scripts/preflight.py` + +**1. `_SERVICES` — add `ollama_research`:** +```python +"ollama_research": ("ollama_research_port", 11435, "OLLAMA_RESEARCH_PORT", True, True), +``` + +**2. `_LLM_BACKENDS` — add entries for both new backends:** +```python +"ollama_research": [("ollama_research", "/v1")], +# vllm_research is an alias for vllm's port — preflight updates base_url for both: +"vllm": [("vllm", "/v1"), ("vllm_research", "/v1")], +``` + +**3. `_DOCKER_INTERNAL` — add `ollama_research`:** +```python +"ollama_research": ("ollama_research", 11434), # container-internal port is always 11434 +``` + +**4. `recommend_profile()` — unchanged** (still returns `"dual-gpu"` for 2 GPUs). +Write `DUAL_GPU_MODE=ollama` to `.env` when first setting up a 2-GPU system. + +**5. Mixed-mode VRAM warning** — after GPU resource section, before closing line: +```python +dual_gpu_mode = os.environ.get("DUAL_GPU_MODE", "ollama") +if dual_gpu_mode == "mixed" and len(gpus) >= 2: + if gpus[1]["vram_free_gb"] < 12: + print(f"║ ⚠ DUAL_GPU_MODE=mixed: GPU 1 has only {gpus[1]['vram_free_gb']:.1f} GB free") + print(f"║ Running ollama_research + vllm together may cause OOM.") + print(f"║ Consider DUAL_GPU_MODE=ollama or DUAL_GPU_MODE=vllm instead.") +``` + +**6. Download size warning** — profile-aware block added just before the closing `╚` line: + +``` +║ Download sizes (first-run estimates) +║ Docker images +║ ollama/ollama ~800 MB (shared by ollama + ollama_research) +║ searxng/searxng ~300 MB +║ app (Python build) ~1.5 GB +║ vision service ~3.0 GB [single-gpu and above] +║ vllm/vllm-openai ~10.0 GB [vllm / mixed mode only] +║ +║ Model weights (lazy-loaded on first use) +║ llama3.2:3b ~2.0 GB → OLLAMA_MODELS_DIR +║ moondream2 ~1.8 GB → vision container cache [single-gpu+] +║ Note: ollama + ollama_research share the same model dir — no double download +║ +║ ⚠ Total first-run: ~X GB (models persist between restarts) +``` + +Total is summed at runtime based on active profile + `DUAL_GPU_MODE`. + +Size table (used by the warning calculator): +| Component | Size | Condition | +|-----------|------|-----------| +| `ollama/ollama` image | 800 MB | cpu, single-gpu, dual-gpu | +| `searxng/searxng` image | 300 MB | always | +| app image | 1,500 MB | always | +| vision service image | 3,000 MB | single-gpu, dual-gpu | +| `vllm/vllm-openai` image | 10,000 MB | vllm or mixed mode | +| llama3.2:3b weights | 2,000 MB | cpu, single-gpu, dual-gpu | +| moondream2 weights | 1,800 MB | single-gpu, dual-gpu | + +### `config/llm.yaml` + +**Add `vllm_research` backend:** +```yaml +vllm_research: + api_key: '' + base_url: http://host.docker.internal:8000/v1 # same port as vllm; preflight keeps in sync + enabled: true + model: __auto__ + supports_images: false + type: openai_compat +``` + +**Update `research_fallback_order`:** +```yaml +research_fallback_order: + - claude_code + - vllm_research + - ollama_research + - github_copilot + - anthropic +``` + +`vllm` stays in the main `fallback_order` (cover letters). `vllm_research` is the explicit +research alias for the same service — different config key, same port, makes routing intent +readable in the YAML. + +--- + +## Downstream Compatibility + +The LLM router requires no changes. `_is_reachable()` already skips backends that aren't +responding. When `DUAL_GPU_MODE=ollama`, `vllm_research` is unreachable and skipped; +`ollama_research` is up and used. When `DUAL_GPU_MODE=vllm`, the reverse. `mixed` mode +makes both reachable; `vllm_research` wins as the higher-priority entry. + +Preflight's `update_llm_yaml()` keeps `base_url` values correct for both adopted (external) +and Docker-internal routing automatically, since `vllm_research` is registered under the +`"vllm"` key in `_LLM_BACKENDS`. + +--- + +## Future Considerations + +- **Triple-GPU / 3+ service configs:** When a third product is active, extract this pattern + into `circuitforge-core` as a reusable inference topology manager. +- **Dual vLLM:** Two vLLM instances (e.g., different model sizes per task) follows the same + pattern — add `vllm_research` as a separate compose service on its own port. +- **VRAM-aware model selection:** Preflight could suggest smaller models when VRAM is tight + in mixed mode (e.g., swap llama3.2:3b → llama3.2:1b for the research instance). +- **Queue optimizer (1-GPU / CPU):** When only one inference backend is available and a batch + of tasks is queued, group by task type (all cover letters first, then all research briefs) + to avoid repeated model context switches. Tracked separately. diff --git a/docs/plans/2026-02-26-dual-gpu-plan.md b/docs/plans/2026-02-26-dual-gpu-plan.md new file mode 100644 index 0000000..08f84b0 --- /dev/null +++ b/docs/plans/2026-02-26-dual-gpu-plan.md @@ -0,0 +1,811 @@ +# Dual-GPU / Dual-Inference Implementation Plan + +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. + +**Goal:** Add `DUAL_GPU_MODE=ollama|vllm|mixed` env var that gates which inference service occupies GPU 1 on dual-GPU systems, plus a first-run download size warning in preflight. + +**Architecture:** Sub-profiles (`dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed`) are injected alongside `--profile dual-gpu` by the Makefile based on `DUAL_GPU_MODE`. The LLM router requires zero changes — `_is_reachable()` naturally skips backends that aren't running. Preflight gains `ollama_research` as a tracked service and emits a size warning block. + +**Tech Stack:** Docker Compose profiles, Python (preflight.py), YAML (llm.yaml, compose files), bash (Makefile, manage.sh) + +**Design doc:** `docs/plans/2026-02-26-dual-gpu-design.md` + +**Test runner:** `conda run -n job-seeker python -m pytest tests/ -v` + +--- + +### Task 1: Update `config/llm.yaml` + +**Files:** +- Modify: `config/llm.yaml` + +**Step 1: Add `vllm_research` backend and update `research_fallback_order`** + +Open `config/llm.yaml`. After the `vllm:` block, add: + +```yaml + vllm_research: + api_key: '' + base_url: http://host.docker.internal:8000/v1 + enabled: true + model: __auto__ + supports_images: false + type: openai_compat +``` + +Replace `research_fallback_order:` section with: + +```yaml +research_fallback_order: +- claude_code +- vllm_research +- ollama_research +- github_copilot +- anthropic +``` + +**Step 2: Verify YAML parses cleanly** + +```bash +conda run -n job-seeker python -c "import yaml; yaml.safe_load(open('config/llm.yaml'))" +``` + +Expected: no output (no error). + +**Step 3: Run existing llm config test** + +```bash +conda run -n job-seeker python -m pytest tests/test_llm_router.py::test_config_loads -v +``` + +Expected: PASS + +**Step 4: Commit** + +```bash +git add config/llm.yaml +git commit -m "feat: add vllm_research backend and update research_fallback_order" +``` + +--- + +### Task 2: Write failing tests for preflight changes + +**Files:** +- Create: `tests/test_preflight.py` + +No existing test file for preflight. Write all tests upfront — they fail until Task 3–5 implement the code. + +**Step 1: Create `tests/test_preflight.py`** + +```python +"""Tests for scripts/preflight.py additions: dual-GPU service table, size warning, VRAM check.""" +import pytest +from pathlib import Path +from unittest.mock import patch +import yaml +import tempfile +import os + + +# ── Service table ────────────────────────────────────────────────────────────── + +def test_ollama_research_in_services(): + """ollama_research must be in _SERVICES at port 11435.""" + from scripts.preflight import _SERVICES + assert "ollama_research" in _SERVICES + _, default_port, env_var, docker_owned, adoptable = _SERVICES["ollama_research"] + assert default_port == 11435 + assert env_var == "OLLAMA_RESEARCH_PORT" + assert docker_owned is True + assert adoptable is True + + +def test_ollama_research_in_llm_backends(): + """ollama_research must be a standalone key in _LLM_BACKENDS (not nested under ollama).""" + from scripts.preflight import _LLM_BACKENDS + assert "ollama_research" in _LLM_BACKENDS + # Should map to the ollama_research llm backend + backend_names = [name for name, _ in _LLM_BACKENDS["ollama_research"]] + assert "ollama_research" in backend_names + + +def test_vllm_research_in_llm_backends(): + """vllm_research must be registered under vllm in _LLM_BACKENDS.""" + from scripts.preflight import _LLM_BACKENDS + assert "vllm" in _LLM_BACKENDS + backend_names = [name for name, _ in _LLM_BACKENDS["vllm"]] + assert "vllm_research" in backend_names + + +def test_ollama_research_in_docker_internal(): + """ollama_research must map to internal port 11434 (Ollama's container port).""" + from scripts.preflight import _DOCKER_INTERNAL + assert "ollama_research" in _DOCKER_INTERNAL + hostname, port = _DOCKER_INTERNAL["ollama_research"] + assert hostname == "ollama_research" + assert port == 11434 # container-internal port is always 11434 + + +def test_ollama_not_mapped_to_ollama_research_backend(): + """ollama service key must only update the ollama llm backend, not ollama_research.""" + from scripts.preflight import _LLM_BACKENDS + ollama_backend_names = [name for name, _ in _LLM_BACKENDS.get("ollama", [])] + assert "ollama_research" not in ollama_backend_names + + +# ── Download size warning ────────────────────────────────────────────────────── + +def test_download_size_remote_profile(): + """Remote profile: only searxng + app, no ollama, no vision, no vllm.""" + from scripts.preflight import _download_size_mb + sizes = _download_size_mb("remote", "ollama") + assert "searxng" in sizes + assert "app" in sizes + assert "ollama" not in sizes + assert "vision_image" not in sizes + assert "vllm_image" not in sizes + + +def test_download_size_cpu_profile(): + """CPU profile: adds ollama image + llama3.2:3b weights.""" + from scripts.preflight import _download_size_mb + sizes = _download_size_mb("cpu", "ollama") + assert "ollama" in sizes + assert "llama3_2_3b" in sizes + assert "vision_image" not in sizes + + +def test_download_size_single_gpu_profile(): + """Single-GPU: adds vision image + moondream2 weights.""" + from scripts.preflight import _download_size_mb + sizes = _download_size_mb("single-gpu", "ollama") + assert "vision_image" in sizes + assert "moondream2" in sizes + assert "vllm_image" not in sizes + + +def test_download_size_dual_gpu_ollama_mode(): + """dual-gpu + ollama mode: no vllm image.""" + from scripts.preflight import _download_size_mb + sizes = _download_size_mb("dual-gpu", "ollama") + assert "vllm_image" not in sizes + + +def test_download_size_dual_gpu_vllm_mode(): + """dual-gpu + vllm mode: adds ~10 GB vllm image.""" + from scripts.preflight import _download_size_mb + sizes = _download_size_mb("dual-gpu", "vllm") + assert "vllm_image" in sizes + assert sizes["vllm_image"] >= 9000 # at least 9 GB + + +def test_download_size_dual_gpu_mixed_mode(): + """dual-gpu + mixed mode: also includes vllm image.""" + from scripts.preflight import _download_size_mb + sizes = _download_size_mb("dual-gpu", "mixed") + assert "vllm_image" in sizes + + +# ── Mixed-mode VRAM warning ──────────────────────────────────────────────────── + +def test_mixed_mode_vram_warning_triggered(): + """Should return a warning string when GPU 1 has < 12 GB free in mixed mode.""" + from scripts.preflight import _mixed_mode_vram_warning + gpus = [ + {"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 20.0}, + {"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 8.0}, # tight + ] + warning = _mixed_mode_vram_warning(gpus, "mixed") + assert warning is not None + assert "8.0" in warning or "GPU 1" in warning + + +def test_mixed_mode_vram_warning_not_triggered_with_headroom(): + """Should return None when GPU 1 has >= 12 GB free.""" + from scripts.preflight import _mixed_mode_vram_warning + gpus = [ + {"name": "RTX 4090", "vram_total_gb": 24.0, "vram_free_gb": 20.0}, + {"name": "RTX 4090", "vram_total_gb": 24.0, "vram_free_gb": 18.0}, # plenty + ] + warning = _mixed_mode_vram_warning(gpus, "mixed") + assert warning is None + + +def test_mixed_mode_vram_warning_not_triggered_for_other_modes(): + """Warning only applies in mixed mode.""" + from scripts.preflight import _mixed_mode_vram_warning + gpus = [ + {"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 20.0}, + {"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 6.0}, + ] + assert _mixed_mode_vram_warning(gpus, "ollama") is None + assert _mixed_mode_vram_warning(gpus, "vllm") is None + + +# ── update_llm_yaml with ollama_research ────────────────────────────────────── + +def test_update_llm_yaml_sets_ollama_research_url_docker_internal(): + """ollama_research backend URL must be set to ollama_research:11434 when Docker-owned.""" + from scripts.preflight import update_llm_yaml + + llm_cfg = { + "backends": { + "ollama": {"base_url": "http://old", "type": "openai_compat"}, + "ollama_research": {"base_url": "http://old", "type": "openai_compat"}, + "vllm": {"base_url": "http://old", "type": "openai_compat"}, + "vllm_research": {"base_url": "http://old", "type": "openai_compat"}, + "vision_service": {"base_url": "http://old", "type": "vision_service"}, + } + } + + with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f: + yaml.dump(llm_cfg, f) + tmp_path = Path(f.name) + + ports = { + "ollama": { + "resolved": 11434, "external": False, "env_var": "OLLAMA_PORT" + }, + "ollama_research": { + "resolved": 11435, "external": False, "env_var": "OLLAMA_RESEARCH_PORT" + }, + "vllm": { + "resolved": 8000, "external": False, "env_var": "VLLM_PORT" + }, + "vision": { + "resolved": 8002, "external": False, "env_var": "VISION_PORT" + }, + } + + try: + # Patch LLM_YAML to point at our temp file + with patch("scripts.preflight.LLM_YAML", tmp_path): + update_llm_yaml(ports) + + result = yaml.safe_load(tmp_path.read_text()) + # Docker-internal: use service name + container port + assert result["backends"]["ollama_research"]["base_url"] == "http://ollama_research:11434/v1" + # vllm_research must match vllm's URL + assert result["backends"]["vllm_research"]["base_url"] == result["backends"]["vllm"]["base_url"] + finally: + tmp_path.unlink() + + +def test_update_llm_yaml_sets_ollama_research_url_external(): + """When ollama_research is external (adopted), URL uses host.docker.internal:11435.""" + from scripts.preflight import update_llm_yaml + + llm_cfg = { + "backends": { + "ollama": {"base_url": "http://old", "type": "openai_compat"}, + "ollama_research": {"base_url": "http://old", "type": "openai_compat"}, + } + } + + with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f: + yaml.dump(llm_cfg, f) + tmp_path = Path(f.name) + + ports = { + "ollama": {"resolved": 11434, "external": False, "env_var": "OLLAMA_PORT"}, + "ollama_research": {"resolved": 11435, "external": True, "env_var": "OLLAMA_RESEARCH_PORT"}, + } + + try: + with patch("scripts.preflight.LLM_YAML", tmp_path): + update_llm_yaml(ports) + result = yaml.safe_load(tmp_path.read_text()) + assert result["backends"]["ollama_research"]["base_url"] == "http://host.docker.internal:11435/v1" + finally: + tmp_path.unlink() +``` + +**Step 2: Run tests to confirm they all fail** + +```bash +conda run -n job-seeker python -m pytest tests/test_preflight.py -v 2>&1 | head -50 +``` + +Expected: all FAIL with `ImportError` or `AssertionError` — that's correct. + +**Step 3: Commit failing tests** + +```bash +git add tests/test_preflight.py +git commit -m "test: add failing tests for dual-gpu preflight additions" +``` + +--- + +### Task 3: `preflight.py` — service table additions + +**Files:** +- Modify: `scripts/preflight.py:46-67` (`_SERVICES`, `_LLM_BACKENDS`, `_DOCKER_INTERNAL`) + +**Step 1: Update `_SERVICES`** + +Find the `_SERVICES` dict (currently ends at the `"ollama"` entry). Add `ollama_research` as a new entry: + +```python +_SERVICES: dict[str, tuple[str, int, str, bool, bool]] = { + "streamlit": ("streamlit_port", 8501, "STREAMLIT_PORT", True, False), + "searxng": ("searxng_port", 8888, "SEARXNG_PORT", True, True), + "vllm": ("vllm_port", 8000, "VLLM_PORT", True, True), + "vision": ("vision_port", 8002, "VISION_PORT", True, True), + "ollama": ("ollama_port", 11434, "OLLAMA_PORT", True, True), + "ollama_research": ("ollama_research_port", 11435, "OLLAMA_RESEARCH_PORT", True, True), +} +``` + +**Step 2: Update `_LLM_BACKENDS`** + +Replace the existing dict: + +```python +_LLM_BACKENDS: dict[str, list[tuple[str, str]]] = { + "ollama": [("ollama", "/v1")], + "ollama_research": [("ollama_research", "/v1")], + "vllm": [("vllm", "/v1"), ("vllm_research", "/v1")], + "vision": [("vision_service", "")], +} +``` + +**Step 3: Update `_DOCKER_INTERNAL`** + +Add `ollama_research` entry: + +```python +_DOCKER_INTERNAL: dict[str, tuple[str, int]] = { + "ollama": ("ollama", 11434), + "ollama_research": ("ollama_research", 11434), # container-internal port is always 11434 + "vllm": ("vllm", 8000), + "vision": ("vision", 8002), + "searxng": ("searxng", 8080), +} +``` + +**Step 4: Run service table tests** + +```bash +conda run -n job-seeker python -m pytest tests/test_preflight.py::test_ollama_research_in_services tests/test_preflight.py::test_ollama_research_in_llm_backends tests/test_preflight.py::test_vllm_research_in_llm_backends tests/test_preflight.py::test_ollama_research_in_docker_internal tests/test_preflight.py::test_ollama_not_mapped_to_ollama_research_backend tests/test_preflight.py::test_update_llm_yaml_sets_ollama_research_url_docker_internal tests/test_preflight.py::test_update_llm_yaml_sets_ollama_research_url_external -v +``` + +Expected: all PASS + +**Step 5: Commit** + +```bash +git add scripts/preflight.py +git commit -m "feat: add ollama_research to preflight service table and LLM backend map" +``` + +--- + +### Task 4: `preflight.py` — `_download_size_mb()` pure function + +**Files:** +- Modify: `scripts/preflight.py` (add new function after `calc_cpu_offload_gb`) + +**Step 1: Add the function** + +After `calc_cpu_offload_gb()`, add: + +```python +def _download_size_mb(profile: str, dual_gpu_mode: str = "ollama") -> dict[str, int]: + """ + Return estimated first-run download sizes in MB, keyed by component name. + Profile-aware: only includes components that will actually be pulled. + """ + sizes: dict[str, int] = { + "searxng": 300, + "app": 1500, + } + if profile in ("cpu", "single-gpu", "dual-gpu"): + sizes["ollama"] = 800 + sizes["llama3_2_3b"] = 2000 + if profile in ("single-gpu", "dual-gpu"): + sizes["vision_image"] = 3000 + sizes["moondream2"] = 1800 + if profile == "dual-gpu" and dual_gpu_mode in ("vllm", "mixed"): + sizes["vllm_image"] = 10000 + return sizes +``` + +**Step 2: Run download size tests** + +```bash +conda run -n job-seeker python -m pytest tests/test_preflight.py -k "download_size" -v +``` + +Expected: all PASS + +**Step 3: Commit** + +```bash +git add scripts/preflight.py +git commit -m "feat: add _download_size_mb() pure function for preflight size warning" +``` + +--- + +### Task 5: `preflight.py` — VRAM warning, size report block, DUAL_GPU_MODE default + +**Files:** +- Modify: `scripts/preflight.py` (three additions to `main()` and a new helper) + +**Step 1: Add `_mixed_mode_vram_warning()` after `_download_size_mb()`** + +```python +def _mixed_mode_vram_warning(gpus: list[dict], dual_gpu_mode: str) -> str | None: + """ + Return a warning string if GPU 1 likely lacks VRAM for mixed mode, else None. + Only relevant when dual_gpu_mode == 'mixed' and at least 2 GPUs are present. + """ + if dual_gpu_mode != "mixed" or len(gpus) < 2: + return None + free = gpus[1]["vram_free_gb"] + if free < 12: + return ( + f"⚠ DUAL_GPU_MODE=mixed: GPU 1 has only {free:.1f} GB free — " + f"running ollama_research + vllm together may cause OOM. " + f"Consider DUAL_GPU_MODE=ollama or DUAL_GPU_MODE=vllm." + ) + return None +``` + +**Step 2: Run VRAM warning tests** + +```bash +conda run -n job-seeker python -m pytest tests/test_preflight.py -k "vram" -v +``` + +Expected: all PASS + +**Step 3: Wire size warning into `main()` report block** + +In `main()`, find the closing `print("╚═...═╝")` line. Add the size warning block just before it: + +```python + # ── Download size warning ────────────────────────────────────────────── + dual_gpu_mode = os.environ.get("DUAL_GPU_MODE", "ollama") + sizes = _download_size_mb(profile, dual_gpu_mode) + total_mb = sum(sizes.values()) + print("║") + print("║ Download sizes (first-run estimates)") + print("║ Docker images") + print(f"║ app (Python build) ~{sizes.get('app', 0):,} MB") + if "searxng" in sizes: + print(f"║ searxng/searxng ~{sizes['searxng']:,} MB") + if "ollama" in sizes: + shared_note = " (shared by ollama + ollama_research)" if profile == "dual-gpu" and dual_gpu_mode in ("ollama", "mixed") else "" + print(f"║ ollama/ollama ~{sizes['ollama']:,} MB{shared_note}") + if "vision_image" in sizes: + print(f"║ vision service ~{sizes['vision_image']:,} MB (torch + moondream)") + if "vllm_image" in sizes: + print(f"║ vllm/vllm-openai ~{sizes['vllm_image']:,} MB") + print("║ Model weights (lazy-loaded on first use)") + if "llama3_2_3b" in sizes: + print(f"║ llama3.2:3b ~{sizes['llama3_2_3b']:,} MB → OLLAMA_MODELS_DIR") + if "moondream2" in sizes: + print(f"║ moondream2 ~{sizes['moondream2']:,} MB → vision container cache") + if profile == "dual-gpu" and dual_gpu_mode in ("ollama", "mixed"): + print("║ Note: ollama + ollama_research share model dir — no double download") + print(f"║ ⚠ Total first-run: ~{total_mb / 1024:.1f} GB (models persist between restarts)") + + # ── Mixed-mode VRAM warning ──────────────────────────────────────────── + vram_warn = _mixed_mode_vram_warning(gpus, dual_gpu_mode) + if vram_warn: + print("║") + print(f"║ {vram_warn}") +``` + +**Step 4: Wire `DUAL_GPU_MODE` default into `write_env()` block in `main()`** + +In `main()`, find the `if not args.check_only:` block. After `env_updates["PEREGRINE_GPU_NAMES"]`, add: + +```python + # Write DUAL_GPU_MODE default for new 2-GPU setups (don't override user's choice) + if len(gpus) >= 2: + existing_env: dict[str, str] = {} + if ENV_FILE.exists(): + for line in ENV_FILE.read_text().splitlines(): + if "=" in line and not line.startswith("#"): + k, _, v = line.partition("=") + existing_env[k.strip()] = v.strip() + if "DUAL_GPU_MODE" not in existing_env: + env_updates["DUAL_GPU_MODE"] = "ollama" +``` + +**Step 5: Add `import os` if not already present at top of file** + +Check line 1–30 of `scripts/preflight.py`. `import os` is already present inside `get_cpu_cores()` as a local import — move it to the top-level imports block: + +```python +import os # add alongside existing stdlib imports +``` + +And remove the local `import os` inside `get_cpu_cores()`. + +**Step 6: Run all preflight tests** + +```bash +conda run -n job-seeker python -m pytest tests/test_preflight.py -v +``` + +Expected: all PASS + +**Step 7: Smoke-check the preflight report output** + +```bash +conda run -n job-seeker python scripts/preflight.py --check-only +``` + +Expected: report includes the `Download sizes` block near the bottom. + +**Step 8: Commit** + +```bash +git add scripts/preflight.py +git commit -m "feat: add DUAL_GPU_MODE default, VRAM warning, and download size report to preflight" +``` + +--- + +### Task 6: `compose.yml` — `ollama_research` service + profile updates + +**Files:** +- Modify: `compose.yml` + +**Step 1: Update `ollama` profiles line** + +Find: +```yaml + profiles: [cpu, single-gpu, dual-gpu] +``` +Replace with: +```yaml + profiles: [cpu, single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed] +``` + +**Step 2: Update `vision` profiles line** + +Find: +```yaml + profiles: [single-gpu, dual-gpu] +``` +Replace with: +```yaml + profiles: [single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed] +``` + +**Step 3: Update `vllm` profiles line** + +Find: +```yaml + profiles: [dual-gpu] +``` +Replace with: +```yaml + profiles: [dual-gpu-vllm, dual-gpu-mixed] +``` + +**Step 4: Add `ollama_research` service** + +After the closing lines of the `ollama` service block, add: + +```yaml + ollama_research: + image: ollama/ollama:latest + ports: + - "${OLLAMA_RESEARCH_PORT:-11435}:11434" + volumes: + - ${OLLAMA_MODELS_DIR:-~/models/ollama}:/root/.ollama + - ./docker/ollama/entrypoint.sh:/entrypoint.sh + environment: + - OLLAMA_MODELS=/root/.ollama + - DEFAULT_OLLAMA_MODEL=${OLLAMA_RESEARCH_MODEL:-llama3.2:3b} + entrypoint: ["/bin/bash", "/entrypoint.sh"] + profiles: [dual-gpu-ollama, dual-gpu-mixed] + restart: unless-stopped +``` + +**Step 5: Validate compose YAML** + +```bash +docker compose -f compose.yml config --quiet +``` + +Expected: no errors. + +**Step 6: Commit** + +```bash +git add compose.yml +git commit -m "feat: add ollama_research service and update profiles for dual-gpu sub-profiles" +``` + +--- + +### Task 7: GPU overlay files — `compose.gpu.yml` and `compose.podman-gpu.yml` + +**Files:** +- Modify: `compose.gpu.yml` +- Modify: `compose.podman-gpu.yml` + +**Step 1: Add `ollama_research` to `compose.gpu.yml`** + +After the `ollama:` block, add: + +```yaml + ollama_research: + deploy: + resources: + reservations: + devices: + - driver: nvidia + device_ids: ["1"] + capabilities: [gpu] +``` + +**Step 2: Add `ollama_research` to `compose.podman-gpu.yml`** + +After the `ollama:` block, add: + +```yaml + ollama_research: + devices: + - nvidia.com/gpu=1 + deploy: + resources: + reservations: + devices: [] +``` + +**Step 3: Validate both files** + +```bash +docker compose -f compose.yml -f compose.gpu.yml config --quiet +``` + +Expected: no errors. + +**Step 4: Commit** + +```bash +git add compose.gpu.yml compose.podman-gpu.yml +git commit -m "feat: assign ollama_research to GPU 1 in Docker and Podman GPU overlays" +``` + +--- + +### Task 8: `Makefile` + `manage.sh` — `DUAL_GPU_MODE` injection and help text + +**Files:** +- Modify: `Makefile` +- Modify: `manage.sh` + +**Step 1: Update `Makefile`** + +After the `COMPOSE_OVERRIDE` variable, add `DUAL_GPU_MODE` reading: + +```makefile +DUAL_GPU_MODE ?= $(shell grep -m1 '^DUAL_GPU_MODE=' .env 2>/dev/null | cut -d= -f2 || echo ollama) +``` + +In the GPU overlay block, find: +```makefile +else + ifneq (,$(findstring gpu,$(PROFILE))) + COMPOSE_FILES := -f compose.yml $(COMPOSE_OVERRIDE) -f compose.gpu.yml + endif +endif +``` + +Replace the `else` branch with: +```makefile +else + ifneq (,$(findstring gpu,$(PROFILE))) + COMPOSE_FILES := -f compose.yml $(COMPOSE_OVERRIDE) -f compose.gpu.yml + endif +endif +ifeq ($(PROFILE),dual-gpu) + COMPOSE_FILES += --profile dual-gpu-$(DUAL_GPU_MODE) +endif +``` + +**Step 2: Update `manage.sh` — profiles help block** + +Find the profiles section in `usage()`: +```bash + echo " dual-gpu Ollama + Vision + vLLM on GPU 0+1" +``` + +Replace with: +```bash + echo " dual-gpu Ollama + Vision on GPU 0; GPU 1 set by DUAL_GPU_MODE" + echo " DUAL_GPU_MODE=ollama (default) ollama_research on GPU 1" + echo " DUAL_GPU_MODE=vllm vllm on GPU 1" + echo " DUAL_GPU_MODE=mixed both on GPU 1 (VRAM-split)" +``` + +**Step 3: Verify Makefile parses** + +```bash +make help +``` + +Expected: help table prints cleanly, no make errors. + +**Step 4: Verify manage.sh help** + +```bash +./manage.sh help +``` + +Expected: new dual-gpu description appears in profiles section. + +**Step 5: Commit** + +```bash +git add Makefile manage.sh +git commit -m "feat: inject DUAL_GPU_MODE sub-profile in Makefile; update manage.sh help" +``` + +--- + +### Task 9: Integration smoke test + +**Goal:** Verify the full chain works for `DUAL_GPU_MODE=ollama` without actually starting Docker (dry-run compose config check). + +**Step 1: Write `DUAL_GPU_MODE=ollama` to `.env` temporarily** + +```bash +echo "DUAL_GPU_MODE=ollama" >> .env +``` + +**Step 2: Dry-run compose config for dual-gpu + dual-gpu-ollama** + +```bash +docker compose -f compose.yml -f compose.gpu.yml --profile dual-gpu --profile dual-gpu-ollama config 2>&1 | grep -E "^ [a-z]|image:|ports:" +``` + +Expected output includes: +- `ollama:` service with port 11434 +- `ollama_research:` service with port 11435 +- `vision:` service +- `searxng:` service +- **No** `vllm:` service + +**Step 3: Dry-run for `DUAL_GPU_MODE=vllm`** + +```bash +docker compose -f compose.yml -f compose.gpu.yml --profile dual-gpu --profile dual-gpu-vllm config 2>&1 | grep -E "^ [a-z]|image:|ports:" +``` + +Expected: +- `ollama:` service (port 11434) +- `vllm:` service (port 8000) +- **No** `ollama_research:` service + +**Step 4: Run full test suite** + +```bash +conda run -n job-seeker python -m pytest tests/ -v +``` + +Expected: all existing tests PASS, all new preflight tests PASS. + +**Step 5: Clean up `.env` test entry** + +```bash +# Remove the test DUAL_GPU_MODE line (preflight will re-write it correctly on next run) +sed -i '/^DUAL_GPU_MODE=/d' .env +``` + +**Step 6: Final commit** + +```bash +git add .env # in case preflight rewrote it during testing +git commit -m "feat: dual-gpu DUAL_GPU_MODE complete — ollama/vllm/mixed GPU 1 selection" +```