peregrine/docs/plans/2026-02-26-dual-gpu-design.md

# Peregrine — Dual-GPU / Dual-Inference Design

**Date:** 2026-02-26
**Status:** Approved — ready for implementation
**Scope:** Peregrine (reference impl; patterns propagate to future products)

---

## Goal

Replace the fixed `dual-gpu` profile (Ollama + vLLM hardwired to GPU 0 + GPU 1) with a
`DUAL_GPU_MODE` env var that selects which inference stack occupies GPU 1. Simultaneously
add a first-run download size warning to preflight so users know what they're in for before
Docker starts pulling images and models.

---

## Modes

| `DUAL_GPU_MODE` | GPU 0 | GPU 1 | Research backend |
|-----------------|-------|-------|-----------------|
| `ollama` (default) | ollama + vision | ollama_research | `ollama_research` |
| `vllm` | ollama + vision | vllm | `vllm_research` |
| `mixed` | ollama + vision | ollama_research + vllm (VRAM-split) | `vllm_research` → `ollama_research` fallback |

`mixed` requires sufficient VRAM on GPU 1. Preflight warns (not blocks) when GPU 1 has
< 12 GB free before starting in mixed mode.

Cover letters always use `ollama` on GPU 0. Research uses whichever GPU 1 backend is
reachable. The LLM router's `_is_reachable()` check handles this transparently — the
fallback chain simply skips services that aren't running.

---

## Compose Profile Architecture

Docker Compose profiles used to gate which services start per mode.
`DUAL_GPU_MODE` is read by the Makefile and passed as a second `--profile` flag.

### Service → profile mapping

| Service | Profiles |
|---------|---------|
| `ollama` | `cpu`, `single-gpu`, `dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed` |
| `vision` | `single-gpu`, `dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed` |
| `ollama_research` | `dual-gpu-ollama`, `dual-gpu-mixed` |
| `vllm` | `dual-gpu-vllm`, `dual-gpu-mixed` |
| `finetune` | `finetune` |

User-facing profiles remain: `remote`, `cpu`, `single-gpu`, `dual-gpu`.
Sub-profiles (`dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed`) are injected by the
Makefile and never typed by the user.

---

## File Changes

### `compose.yml`

**`ollama`** — add all dual-gpu sub-profiles to `profiles`:
```yaml
profiles: [cpu, single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
```

**`vision`** — same pattern:
```yaml
profiles: [single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
```

**`vllm`** — change from `[dual-gpu]` to:
```yaml
profiles: [dual-gpu-vllm, dual-gpu-mixed]
```

**`ollama_research`** — new service:
```yaml
ollama_research:
  image: ollama/ollama:latest
  ports:
    - "${OLLAMA_RESEARCH_PORT:-11435}:11434"
  volumes:
    - ${OLLAMA_MODELS_DIR:-~/models/ollama}:/root/.ollama  # shared — no double download
    - ./docker/ollama/entrypoint.sh:/entrypoint.sh
  environment:
    - OLLAMA_MODELS=/root/.ollama
    - DEFAULT_OLLAMA_MODEL=${OLLAMA_RESEARCH_MODEL:-llama3.2:3b}
  entrypoint: ["/bin/bash", "/entrypoint.sh"]
  profiles: [dual-gpu-ollama, dual-gpu-mixed]
  restart: unless-stopped
```

### `compose.gpu.yml`

Add `ollama_research` block (GPU 1). `vllm` stays on GPU 1 as-is:
```yaml
ollama_research:
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ["1"]
            capabilities: [gpu]
```

### `compose.podman-gpu.yml`

Same addition for Podman CDI:
```yaml
ollama_research:
  devices:
    - nvidia.com/gpu=1
  deploy:
    resources:
      reservations:
        devices: []
```

### `Makefile`

Two additions after existing `COMPOSE` detection:

```makefile
DUAL_GPU_MODE ?= $(shell grep -m1 '^DUAL_GPU_MODE=' .env 2>/dev/null | cut -d= -f2 || echo ollama)

# GPU overlay: matches single-gpu, dual-gpu (findstring gpu already covers these)
# Sub-profile injection for dual-gpu modes:
ifeq ($(PROFILE),dual-gpu)
  COMPOSE_FILES += --profile dual-gpu-$(DUAL_GPU_MODE)
endif
```

Update `manage.sh` usage block to document `dual-gpu` profile with `DUAL_GPU_MODE` note:
```
dual-gpu     Ollama + Vision on GPU 0; GPU 1 mode set by DUAL_GPU_MODE
             DUAL_GPU_MODE=ollama  (default) ollama_research on GPU 1
             DUAL_GPU_MODE=vllm             vllm on GPU 1
             DUAL_GPU_MODE=mixed            both on GPU 1 (VRAM-split; see preflight warning)
```

### `scripts/preflight.py`

**1. `_SERVICES` — add `ollama_research`:**
```python
"ollama_research": ("ollama_research_port", 11435, "OLLAMA_RESEARCH_PORT", True, True),
```

**2. `_LLM_BACKENDS` — add entries for both new backends:**
```python
"ollama_research": [("ollama_research", "/v1")],
# vllm_research is an alias for vllm's port — preflight updates base_url for both:
"vllm": [("vllm", "/v1"), ("vllm_research", "/v1")],
```

**3. `_DOCKER_INTERNAL` — add `ollama_research`:**
```python
"ollama_research": ("ollama_research", 11434),  # container-internal port is always 11434
```

**4. `recommend_profile()` — unchanged** (still returns `"dual-gpu"` for 2 GPUs).
Write `DUAL_GPU_MODE=ollama` to `.env` when first setting up a 2-GPU system.

**5. Mixed-mode VRAM warning** — after GPU resource section, before closing line:
```python
dual_gpu_mode = os.environ.get("DUAL_GPU_MODE", "ollama")
if dual_gpu_mode == "mixed" and len(gpus) >= 2:
    if gpus[1]["vram_free_gb"] < 12:
        print(f"║  ⚠  DUAL_GPU_MODE=mixed: GPU 1 has only {gpus[1]['vram_free_gb']:.1f} GB free")
        print(f"║     Running ollama_research + vllm together may cause OOM.")
        print(f"║     Consider DUAL_GPU_MODE=ollama or DUAL_GPU_MODE=vllm instead.")
```

**6. Download size warning** — profile-aware block added just before the closing `╚` line:

```
║  Download sizes (first-run estimates)
║    Docker images
║      ollama/ollama        ~800 MB  (shared by ollama + ollama_research)
║      searxng/searxng      ~300 MB
║      app (Python build)  ~1.5 GB
║      vision service      ~3.0 GB  [single-gpu and above]
║      vllm/vllm-openai   ~10.0 GB  [vllm / mixed mode only]
║
║    Model weights  (lazy-loaded on first use)
║      llama3.2:3b          ~2.0 GB  → OLLAMA_MODELS_DIR
║      moondream2            ~1.8 GB  → vision container cache  [single-gpu+]
║    Note: ollama + ollama_research share the same model dir — no double download
║
║  ⚠  Total first-run: ~X GB  (models persist between restarts)
```

Total is summed at runtime based on active profile + `DUAL_GPU_MODE`.

Size table (used by the warning calculator):
| Component | Size | Condition |
|-----------|------|-----------|
| `ollama/ollama` image | 800 MB | cpu, single-gpu, dual-gpu |
| `searxng/searxng` image | 300 MB | always |
| app image | 1,500 MB | always |
| vision service image | 3,000 MB | single-gpu, dual-gpu |
| `vllm/vllm-openai` image | 10,000 MB | vllm or mixed mode |
| llama3.2:3b weights | 2,000 MB | cpu, single-gpu, dual-gpu |
| moondream2 weights | 1,800 MB | single-gpu, dual-gpu |

### `config/llm.yaml`

**Add `vllm_research` backend:**
```yaml
vllm_research:
  api_key: ''
  base_url: http://host.docker.internal:8000/v1  # same port as vllm; preflight keeps in sync
  enabled: true
  model: __auto__
  supports_images: false
  type: openai_compat
```

**Update `research_fallback_order`:**
```yaml
research_fallback_order:
  - claude_code
  - vllm_research
  - ollama_research
  - github_copilot
  - anthropic
```

`vllm` stays in the main `fallback_order` (cover letters). `vllm_research` is the explicit
research alias for the same service — different config key, same port, makes routing intent
readable in the YAML.

---

## Downstream Compatibility

The LLM router requires no changes. `_is_reachable()` already skips backends that aren't
responding. When `DUAL_GPU_MODE=ollama`, `vllm_research` is unreachable and skipped;
`ollama_research` is up and used. When `DUAL_GPU_MODE=vllm`, the reverse. `mixed` mode
makes both reachable; `vllm_research` wins as the higher-priority entry.

Preflight's `update_llm_yaml()` keeps `base_url` values correct for both adopted (external)
and Docker-internal routing automatically, since `vllm_research` is registered under the
`"vllm"` key in `_LLM_BACKENDS`.

---

## Future Considerations

- **Triple-GPU / 3+ service configs:** When a third product is active, extract this pattern
  into `circuitforge-core` as a reusable inference topology manager.
- **Dual vLLM:** Two vLLM instances (e.g., different model sizes per task) follows the same
  pattern — add `vllm_research` as a separate compose service on its own port.
- **VRAM-aware model selection:** Preflight could suggest smaller models when VRAM is tight
  in mixed mode (e.g., swap llama3.2:3b → llama3.2:1b for the research instance).
- **Queue optimizer (1-GPU / CPU):** When only one inference backend is available and a batch
  of tasks is queued, group by task type (all cover letters first, then all research briefs)
  to avoid repeated model context switches. Tracked separately.