peregrine/docs/plans/2026-02-26-dual-gpu-design.md

257 lines
8.7 KiB
Markdown

# Peregrine — Dual-GPU / Dual-Inference Design
**Date:** 2026-02-26
**Status:** Approved — ready for implementation
**Scope:** Peregrine (reference impl; patterns propagate to future products)
---
## Goal
Replace the fixed `dual-gpu` profile (Ollama + vLLM hardwired to GPU 0 + GPU 1) with a
`DUAL_GPU_MODE` env var that selects which inference stack occupies GPU 1. Simultaneously
add a first-run download size warning to preflight so users know what they're in for before
Docker starts pulling images and models.
---
## Modes
| `DUAL_GPU_MODE` | GPU 0 | GPU 1 | Research backend |
|-----------------|-------|-------|-----------------|
| `ollama` (default) | ollama + vision | ollama_research | `ollama_research` |
| `vllm` | ollama + vision | vllm | `vllm_research` |
| `mixed` | ollama + vision | ollama_research + vllm (VRAM-split) | `vllm_research``ollama_research` fallback |
`mixed` requires sufficient VRAM on GPU 1. Preflight warns (not blocks) when GPU 1 has
< 12 GB free before starting in mixed mode.
Cover letters always use `ollama` on GPU 0. Research uses whichever GPU 1 backend is
reachable. The LLM router's `_is_reachable()` check handles this transparently the
fallback chain simply skips services that aren't running.
---
## Compose Profile Architecture
Docker Compose profiles used to gate which services start per mode.
`DUAL_GPU_MODE` is read by the Makefile and passed as a second `--profile` flag.
### Service → profile mapping
| Service | Profiles |
|---------|---------|
| `ollama` | `cpu`, `single-gpu`, `dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed` |
| `vision` | `single-gpu`, `dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed` |
| `ollama_research` | `dual-gpu-ollama`, `dual-gpu-mixed` |
| `vllm` | `dual-gpu-vllm`, `dual-gpu-mixed` |
| `finetune` | `finetune` |
User-facing profiles remain: `remote`, `cpu`, `single-gpu`, `dual-gpu`.
Sub-profiles (`dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed`) are injected by the
Makefile and never typed by the user.
---
## File Changes
### `compose.yml`
**`ollama`** add all dual-gpu sub-profiles to `profiles`:
```yaml
profiles: [cpu, single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
```
**`vision`** same pattern:
```yaml
profiles: [single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
```
**`vllm`** change from `[dual-gpu]` to:
```yaml
profiles: [dual-gpu-vllm, dual-gpu-mixed]
```
**`ollama_research`** new service:
```yaml
ollama_research:
image: ollama/ollama:latest
ports:
- "${OLLAMA_RESEARCH_PORT:-11435}:11434"
volumes:
- ${OLLAMA_MODELS_DIR:-~/models/ollama}:/root/.ollama # shared — no double download
- ./docker/ollama/entrypoint.sh:/entrypoint.sh
environment:
- OLLAMA_MODELS=/root/.ollama
- DEFAULT_OLLAMA_MODEL=${OLLAMA_RESEARCH_MODEL:-llama3.2:3b}
entrypoint: ["/bin/bash", "/entrypoint.sh"]
profiles: [dual-gpu-ollama, dual-gpu-mixed]
restart: unless-stopped
```
### `compose.gpu.yml`
Add `ollama_research` block (GPU 1). `vllm` stays on GPU 1 as-is:
```yaml
ollama_research:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
```
### `compose.podman-gpu.yml`
Same addition for Podman CDI:
```yaml
ollama_research:
devices:
- nvidia.com/gpu=1
deploy:
resources:
reservations:
devices: []
```
### `Makefile`
Two additions after existing `COMPOSE` detection:
```makefile
DUAL_GPU_MODE ?= $(shell grep -m1 '^DUAL_GPU_MODE=' .env 2>/dev/null | cut -d= -f2 || echo ollama)
# GPU overlay: matches single-gpu, dual-gpu (findstring gpu already covers these)
# Sub-profile injection for dual-gpu modes:
ifeq ($(PROFILE),dual-gpu)
COMPOSE_FILES += --profile dual-gpu-$(DUAL_GPU_MODE)
endif
```
Update `manage.sh` usage block to document `dual-gpu` profile with `DUAL_GPU_MODE` note:
```
dual-gpu Ollama + Vision on GPU 0; GPU 1 mode set by DUAL_GPU_MODE
DUAL_GPU_MODE=ollama (default) ollama_research on GPU 1
DUAL_GPU_MODE=vllm vllm on GPU 1
DUAL_GPU_MODE=mixed both on GPU 1 (VRAM-split; see preflight warning)
```
### `scripts/preflight.py`
**1. `_SERVICES` — add `ollama_research`:**
```python
"ollama_research": ("ollama_research_port", 11435, "OLLAMA_RESEARCH_PORT", True, True),
```
**2. `_LLM_BACKENDS` — add entries for both new backends:**
```python
"ollama_research": [("ollama_research", "/v1")],
# vllm_research is an alias for vllm's port — preflight updates base_url for both:
"vllm": [("vllm", "/v1"), ("vllm_research", "/v1")],
```
**3. `_DOCKER_INTERNAL` — add `ollama_research`:**
```python
"ollama_research": ("ollama_research", 11434), # container-internal port is always 11434
```
**4. `recommend_profile()` — unchanged** (still returns `"dual-gpu"` for 2 GPUs).
Write `DUAL_GPU_MODE=ollama` to `.env` when first setting up a 2-GPU system.
**5. Mixed-mode VRAM warning** after GPU resource section, before closing line:
```python
dual_gpu_mode = os.environ.get("DUAL_GPU_MODE", "ollama")
if dual_gpu_mode == "mixed" and len(gpus) >= 2:
if gpus[1]["vram_free_gb"] < 12:
print(f"║ ⚠ DUAL_GPU_MODE=mixed: GPU 1 has only {gpus[1]['vram_free_gb']:.1f} GB free")
print(f"║ Running ollama_research + vllm together may cause OOM.")
print(f"║ Consider DUAL_GPU_MODE=ollama or DUAL_GPU_MODE=vllm instead.")
```
**6. Download size warning** profile-aware block added just before the closing `╚` line:
```
║ Download sizes (first-run estimates)
║ Docker images
║ ollama/ollama ~800 MB (shared by ollama + ollama_research)
║ searxng/searxng ~300 MB
║ app (Python build) ~1.5 GB
║ vision service ~3.0 GB [single-gpu and above]
║ vllm/vllm-openai ~10.0 GB [vllm / mixed mode only]
║ Model weights (lazy-loaded on first use)
║ llama3.2:3b ~2.0 GB → OLLAMA_MODELS_DIR
║ moondream2 ~1.8 GB → vision container cache [single-gpu+]
║ Note: ollama + ollama_research share the same model dir — no double download
║ ⚠ Total first-run: ~X GB (models persist between restarts)
```
Total is summed at runtime based on active profile + `DUAL_GPU_MODE`.
Size table (used by the warning calculator):
| Component | Size | Condition |
|-----------|------|-----------|
| `ollama/ollama` image | 800 MB | cpu, single-gpu, dual-gpu |
| `searxng/searxng` image | 300 MB | always |
| app image | 1,500 MB | always |
| vision service image | 3,000 MB | single-gpu, dual-gpu |
| `vllm/vllm-openai` image | 10,000 MB | vllm or mixed mode |
| llama3.2:3b weights | 2,000 MB | cpu, single-gpu, dual-gpu |
| moondream2 weights | 1,800 MB | single-gpu, dual-gpu |
### `config/llm.yaml`
**Add `vllm_research` backend:**
```yaml
vllm_research:
api_key: ''
base_url: http://host.docker.internal:8000/v1 # same port as vllm; preflight keeps in sync
enabled: true
model: __auto__
supports_images: false
type: openai_compat
```
**Update `research_fallback_order`:**
```yaml
research_fallback_order:
- claude_code
- vllm_research
- ollama_research
- github_copilot
- anthropic
```
`vllm` stays in the main `fallback_order` (cover letters). `vllm_research` is the explicit
research alias for the same service different config key, same port, makes routing intent
readable in the YAML.
---
## Downstream Compatibility
The LLM router requires no changes. `_is_reachable()` already skips backends that aren't
responding. When `DUAL_GPU_MODE=ollama`, `vllm_research` is unreachable and skipped;
`ollama_research` is up and used. When `DUAL_GPU_MODE=vllm`, the reverse. `mixed` mode
makes both reachable; `vllm_research` wins as the higher-priority entry.
Preflight's `update_llm_yaml()` keeps `base_url` values correct for both adopted (external)
and Docker-internal routing automatically, since `vllm_research` is registered under the
`"vllm"` key in `_LLM_BACKENDS`.
---
## Future Considerations
- **Triple-GPU / 3+ service configs:** When a third product is active, extract this pattern
into `circuitforge-core` as a reusable inference topology manager.
- **Dual vLLM:** Two vLLM instances (e.g., different model sizes per task) follows the same
pattern add `vllm_research` as a separate compose service on its own port.
- **VRAM-aware model selection:** Preflight could suggest smaller models when VRAM is tight
in mixed mode (e.g., swap llama3.2:3b llama3.2:1b for the research instance).
- **Queue optimizer (1-GPU / CPU):** When only one inference backend is available and a batch
of tasks is queued, group by task type (all cover letters first, then all research briefs)
to avoid repeated model context switches. Tracked separately.