[New Feature] Dual-GPU support with DUAL_GPU_MODE env var #1

Closed
opened 2026-02-27 00:01:59 -08:00 by pyr0ball · 1 comment
Owner

Summary

Add DUAL_GPU_MODE environment variable to control which inference service runs on GPU 1 in dual-GPU setups. Enables running simultaneous cover letter generation and research without GPU contention.

Motivation

Currently all LLM inference (cover letter gen + research) shares a single GPU. On dual-GPU systems, GPU 1 sits idle. This blocks the common pattern of generating a cover letter while company research runs in parallel.

Design

Full design doc: docs/plans/2026-02-26-dual-gpu-design.md
Full TDD implementation plan: docs/plans/2026-02-26-dual-gpu-plan.md

Three modes

DUAL_GPU_MODE GPU 0 GPU 1
ollama ollama (cover letters) ollama_research (port 11435)
vllm ollama (cover letters) vllm (research)
mixed ollama (cover letters) both vllm + ollama_research (VRAM warning if <12GB free)

New Docker Compose sub-profiles

  • dual-gpu-ollama — dual ollama instances on separate GPUs
  • dual-gpu-vllm — ollama + vllm on separate GPUs
  • dual-gpu-mixed — ollama + vllm + ollama_research

Changes required

  • config/llm.yaml — add vllm_research alias + research_fallback_order
  • scripts/preflight.py — detect DUAL_GPU_MODE, calculate + warn on model download size, write default to .env
  • compose.yml — add ollama_research service + profile updates
  • compose.gpu.yml / compose.podman-gpu.yml — GPU device assignments
  • Makefile — inject DUAL_GPU_MODE sub-profile selection
  • manage.sh — update help text

Tier

Available to all tiers (hardware-gated, not license-gated).

Tasks

9-task TDD plan in docs/plans/2026-02-26-dual-gpu-plan.md. Ready to implement.

## Summary Add `DUAL_GPU_MODE` environment variable to control which inference service runs on GPU 1 in dual-GPU setups. Enables running simultaneous cover letter generation and research without GPU contention. ## Motivation Currently all LLM inference (cover letter gen + research) shares a single GPU. On dual-GPU systems, GPU 1 sits idle. This blocks the common pattern of generating a cover letter while company research runs in parallel. ## Design Full design doc: `docs/plans/2026-02-26-dual-gpu-design.md` Full TDD implementation plan: `docs/plans/2026-02-26-dual-gpu-plan.md` ### Three modes | `DUAL_GPU_MODE` | GPU 0 | GPU 1 | |----------------|-------|-------| | `ollama` | ollama (cover letters) | ollama_research (port 11435) | | `vllm` | ollama (cover letters) | vllm (research) | | `mixed` | ollama (cover letters) | both vllm + ollama_research (VRAM warning if <12GB free) | ### New Docker Compose sub-profiles - `dual-gpu-ollama` — dual ollama instances on separate GPUs - `dual-gpu-vllm` — ollama + vllm on separate GPUs - `dual-gpu-mixed` — ollama + vllm + ollama_research ### Changes required - `config/llm.yaml` — add `vllm_research` alias + `research_fallback_order` - `scripts/preflight.py` — detect DUAL_GPU_MODE, calculate + warn on model download size, write default to .env - `compose.yml` — add `ollama_research` service + profile updates - `compose.gpu.yml` / `compose.podman-gpu.yml` — GPU device assignments - `Makefile` — inject `DUAL_GPU_MODE` sub-profile selection - `manage.sh` — update help text ## Tier Available to all tiers (hardware-gated, not license-gated). ## Tasks 9-task TDD plan in `docs/plans/2026-02-26-dual-gpu-plan.md`. Ready to implement.
Author
Owner

Implementation complete

All 9 tasks from the implementation plan are merged to main.

What was shipped

  • DUAL_GPU_MODE=ollama|vllm|mixed env var selects which service occupies GPU 1
  • ollama_research service added to compose.yml (port 11435, shared model dir — no double download)
  • compose.gpu.yml + compose.podman-gpu.yml assign ollama_research to device_ids: ["1"]
  • Makefile injects --profile dual-gpu-$(DUAL_GPU_MODE) alongside --profile dual-gpu
  • manage.sh help updated with mode descriptions
  • preflight.py gains:
    • ollama_research in _SERVICES, _LLM_BACKENDS, _DOCKER_INTERNAL
    • _download_size_mb() — profile-aware first-run download size estimate
    • _mixed_mode_vram_warning() — warns when GPU 1 has < 12 GB free in mixed mode
    • Writes DUAL_GPU_MODE=ollama default to .env on first 2-GPU setup
  • config/llm.yaml gains vllm_research backend; research_fallback_order updated
  • 16 new unit tests in tests/test_preflight.py, all green

Verified

  • docker compose --profile dual-gpu --profile dual-gpu-ollama configollama, ollama_research, vision, searxng (no vllm) ✓
  • docker compose --profile dual-gpu --profile dual-gpu-vllm configollama, vllm, vision, searxng (no ollama_research) ✓
  • Full test suite: 401 passed, 0 failed
  • gitleaks pre-push: clean (165 commits scanned)
## Implementation complete All 9 tasks from the [implementation plan](docs/plans/2026-02-26-dual-gpu-plan.md) are merged to `main`. ### What was shipped - `DUAL_GPU_MODE=ollama|vllm|mixed` env var selects which service occupies GPU 1 - `ollama_research` service added to `compose.yml` (port 11435, shared model dir — no double download) - `compose.gpu.yml` + `compose.podman-gpu.yml` assign `ollama_research` to `device_ids: ["1"]` - Makefile injects `--profile dual-gpu-$(DUAL_GPU_MODE)` alongside `--profile dual-gpu` - `manage.sh` help updated with mode descriptions - `preflight.py` gains: - `ollama_research` in `_SERVICES`, `_LLM_BACKENDS`, `_DOCKER_INTERNAL` - `_download_size_mb()` — profile-aware first-run download size estimate - `_mixed_mode_vram_warning()` — warns when GPU 1 has < 12 GB free in mixed mode - Writes `DUAL_GPU_MODE=ollama` default to `.env` on first 2-GPU setup - `config/llm.yaml` gains `vllm_research` backend; `research_fallback_order` updated - 16 new unit tests in `tests/test_preflight.py`, all green ### Verified - `docker compose --profile dual-gpu --profile dual-gpu-ollama config` → `ollama`, `ollama_research`, `vision`, `searxng` (no vllm) ✓ - `docker compose --profile dual-gpu --profile dual-gpu-vllm config` → `ollama`, `vllm`, `vision`, `searxng` (no ollama_research) ✓ - Full test suite: **401 passed, 0 failed** - gitleaks pre-push: clean (165 commits scanned)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/peregrine#1
No description provided.