8.7 KiB
Peregrine — Dual-GPU / Dual-Inference Design
Date: 2026-02-26 Status: Approved — ready for implementation Scope: Peregrine (reference impl; patterns propagate to future products)
Goal
Replace the fixed dual-gpu profile (Ollama + vLLM hardwired to GPU 0 + GPU 1) with a
DUAL_GPU_MODE env var that selects which inference stack occupies GPU 1. Simultaneously
add a first-run download size warning to preflight so users know what they're in for before
Docker starts pulling images and models.
Modes
DUAL_GPU_MODE |
GPU 0 | GPU 1 | Research backend |
|---|---|---|---|
ollama (default) |
ollama + vision | ollama_research | ollama_research |
vllm |
ollama + vision | vllm | vllm_research |
mixed |
ollama + vision | ollama_research + vllm (VRAM-split) | vllm_research → ollama_research fallback |
mixed requires sufficient VRAM on GPU 1. Preflight warns (not blocks) when GPU 1 has
< 12 GB free before starting in mixed mode.
Cover letters always use ollama on GPU 0. Research uses whichever GPU 1 backend is
reachable. The LLM router's _is_reachable() check handles this transparently — the
fallback chain simply skips services that aren't running.
Compose Profile Architecture
Docker Compose profiles used to gate which services start per mode.
DUAL_GPU_MODE is read by the Makefile and passed as a second --profile flag.
Service → profile mapping
| Service | Profiles |
|---|---|
ollama |
cpu, single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed |
vision |
single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed |
ollama_research |
dual-gpu-ollama, dual-gpu-mixed |
vllm |
dual-gpu-vllm, dual-gpu-mixed |
finetune |
finetune |
User-facing profiles remain: remote, cpu, single-gpu, dual-gpu.
Sub-profiles (dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed) are injected by the
Makefile and never typed by the user.
File Changes
compose.yml
ollama — add all dual-gpu sub-profiles to profiles:
profiles: [cpu, single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
vision — same pattern:
profiles: [single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
vllm — change from [dual-gpu] to:
profiles: [dual-gpu-vllm, dual-gpu-mixed]
ollama_research — new service:
ollama_research:
image: ollama/ollama:latest
ports:
- "${OLLAMA_RESEARCH_PORT:-11435}:11434"
volumes:
- ${OLLAMA_MODELS_DIR:-~/models/ollama}:/root/.ollama # shared — no double download
- ./docker/ollama/entrypoint.sh:/entrypoint.sh
environment:
- OLLAMA_MODELS=/root/.ollama
- DEFAULT_OLLAMA_MODEL=${OLLAMA_RESEARCH_MODEL:-llama3.2:3b}
entrypoint: ["/bin/bash", "/entrypoint.sh"]
profiles: [dual-gpu-ollama, dual-gpu-mixed]
restart: unless-stopped
compose.gpu.yml
Add ollama_research block (GPU 1). vllm stays on GPU 1 as-is:
ollama_research:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
compose.podman-gpu.yml
Same addition for Podman CDI:
ollama_research:
devices:
- nvidia.com/gpu=1
deploy:
resources:
reservations:
devices: []
Makefile
Two additions after existing COMPOSE detection:
DUAL_GPU_MODE ?= $(shell grep -m1 '^DUAL_GPU_MODE=' .env 2>/dev/null | cut -d= -f2 || echo ollama)
# GPU overlay: matches single-gpu, dual-gpu (findstring gpu already covers these)
# Sub-profile injection for dual-gpu modes:
ifeq ($(PROFILE),dual-gpu)
COMPOSE_FILES += --profile dual-gpu-$(DUAL_GPU_MODE)
endif
Update manage.sh usage block to document dual-gpu profile with DUAL_GPU_MODE note:
dual-gpu Ollama + Vision on GPU 0; GPU 1 mode set by DUAL_GPU_MODE
DUAL_GPU_MODE=ollama (default) ollama_research on GPU 1
DUAL_GPU_MODE=vllm vllm on GPU 1
DUAL_GPU_MODE=mixed both on GPU 1 (VRAM-split; see preflight warning)
scripts/preflight.py
1. _SERVICES — add ollama_research:
"ollama_research": ("ollama_research_port", 11435, "OLLAMA_RESEARCH_PORT", True, True),
2. _LLM_BACKENDS — add entries for both new backends:
"ollama_research": [("ollama_research", "/v1")],
# vllm_research is an alias for vllm's port — preflight updates base_url for both:
"vllm": [("vllm", "/v1"), ("vllm_research", "/v1")],
3. _DOCKER_INTERNAL — add ollama_research:
"ollama_research": ("ollama_research", 11434), # container-internal port is always 11434
4. recommend_profile() — unchanged (still returns "dual-gpu" for 2 GPUs).
Write DUAL_GPU_MODE=ollama to .env when first setting up a 2-GPU system.
5. Mixed-mode VRAM warning — after GPU resource section, before closing line:
dual_gpu_mode = os.environ.get("DUAL_GPU_MODE", "ollama")
if dual_gpu_mode == "mixed" and len(gpus) >= 2:
if gpus[1]["vram_free_gb"] < 12:
print(f"║ ⚠ DUAL_GPU_MODE=mixed: GPU 1 has only {gpus[1]['vram_free_gb']:.1f} GB free")
print(f"║ Running ollama_research + vllm together may cause OOM.")
print(f"║ Consider DUAL_GPU_MODE=ollama or DUAL_GPU_MODE=vllm instead.")
6. Download size warning — profile-aware block added just before the closing ╚ line:
║ Download sizes (first-run estimates)
║ Docker images
║ ollama/ollama ~800 MB (shared by ollama + ollama_research)
║ searxng/searxng ~300 MB
║ app (Python build) ~1.5 GB
║ vision service ~3.0 GB [single-gpu and above]
║ vllm/vllm-openai ~10.0 GB [vllm / mixed mode only]
║
║ Model weights (lazy-loaded on first use)
║ llama3.2:3b ~2.0 GB → OLLAMA_MODELS_DIR
║ moondream2 ~1.8 GB → vision container cache [single-gpu+]
║ Note: ollama + ollama_research share the same model dir — no double download
║
║ ⚠ Total first-run: ~X GB (models persist between restarts)
Total is summed at runtime based on active profile + DUAL_GPU_MODE.
Size table (used by the warning calculator):
| Component | Size | Condition |
|---|---|---|
ollama/ollama image |
800 MB | cpu, single-gpu, dual-gpu |
searxng/searxng image |
300 MB | always |
| app image | 1,500 MB | always |
| vision service image | 3,000 MB | single-gpu, dual-gpu |
vllm/vllm-openai image |
10,000 MB | vllm or mixed mode |
| llama3.2:3b weights | 2,000 MB | cpu, single-gpu, dual-gpu |
| moondream2 weights | 1,800 MB | single-gpu, dual-gpu |
config/llm.yaml
Add vllm_research backend:
vllm_research:
api_key: ''
base_url: http://host.docker.internal:8000/v1 # same port as vllm; preflight keeps in sync
enabled: true
model: __auto__
supports_images: false
type: openai_compat
Update research_fallback_order:
research_fallback_order:
- claude_code
- vllm_research
- ollama_research
- github_copilot
- anthropic
vllm stays in the main fallback_order (cover letters). vllm_research is the explicit
research alias for the same service — different config key, same port, makes routing intent
readable in the YAML.
Downstream Compatibility
The LLM router requires no changes. _is_reachable() already skips backends that aren't
responding. When DUAL_GPU_MODE=ollama, vllm_research is unreachable and skipped;
ollama_research is up and used. When DUAL_GPU_MODE=vllm, the reverse. mixed mode
makes both reachable; vllm_research wins as the higher-priority entry.
Preflight's update_llm_yaml() keeps base_url values correct for both adopted (external)
and Docker-internal routing automatically, since vllm_research is registered under the
"vllm" key in _LLM_BACKENDS.
Future Considerations
- Triple-GPU / 3+ service configs: When a third product is active, extract this pattern
into
circuitforge-coreas a reusable inference topology manager. - Dual vLLM: Two vLLM instances (e.g., different model sizes per task) follows the same
pattern — add
vllm_researchas a separate compose service on its own port. - VRAM-aware model selection: Preflight could suggest smaller models when VRAM is tight in mixed mode (e.g., swap llama3.2:3b → llama3.2:1b for the research instance).
- Queue optimizer (1-GPU / CPU): When only one inference backend is available and a batch of tasks is queued, group by task type (all cover letters first, then all research briefs) to avoid repeated model context switches. Tracked separately.