pyr0ball cb99a6a977 feat: dual-GPU DUAL_GPU_MODE complete — ollama/vllm/mixed GPU 1 selection

2026-02-27 06:20:57 -08:00

8.7 KiB

Raw Blame History

Peregrine — Dual-GPU / Dual-Inference Design

Date: 2026-02-26 Status: Approved — ready for implementation Scope: Peregrine (reference impl; patterns propagate to future products)

Goal

Replace the fixed dual-gpu profile (Ollama + vLLM hardwired to GPU 0 + GPU 1) with a DUAL_GPU_MODE env var that selects which inference stack occupies GPU 1. Simultaneously add a first-run download size warning to preflight so users know what they're in for before Docker starts pulling images and models.

Modes

`DUAL_GPU_MODE`	GPU 0	GPU 1	Research backend
`ollama` (default)	ollama + vision	ollama_research	`ollama_research`
`vllm`	ollama + vision	vllm	`vllm_research`
`mixed`	ollama + vision	ollama_research + vllm (VRAM-split)	`vllm_research` → `ollama_research` fallback

mixed requires sufficient VRAM on GPU 1. Preflight warns (not blocks) when GPU 1 has < 12 GB free before starting in mixed mode.

Cover letters always use ollama on GPU 0. Research uses whichever GPU 1 backend is reachable. The LLM router's _is_reachable() check handles this transparently — the fallback chain simply skips services that aren't running.

Compose Profile Architecture

Docker Compose profiles used to gate which services start per mode. DUAL_GPU_MODE is read by the Makefile and passed as a second --profile flag.

Service → profile mapping

Service	Profiles
`ollama`	`cpu`, `single-gpu`, `dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed`
`vision`	`single-gpu`, `dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed`
`ollama_research`	`dual-gpu-ollama`, `dual-gpu-mixed`
`vllm`	`dual-gpu-vllm`, `dual-gpu-mixed`
`finetune`	`finetune`

User-facing profiles remain: remote, cpu, single-gpu, dual-gpu. Sub-profiles (dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed) are injected by the Makefile and never typed by the user.

File Changes

`compose.yml`

ollama — add all dual-gpu sub-profiles to profiles:

profiles: [cpu, single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]

vision — same pattern:

profiles: [single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]

vllm — change from [dual-gpu] to:

profiles: [dual-gpu-vllm, dual-gpu-mixed]

ollama_research — new service:

ollama_research:
  image: ollama/ollama:latest
  ports:
    - "${OLLAMA_RESEARCH_PORT:-11435}:11434"
  volumes:
    - ${OLLAMA_MODELS_DIR:-~/models/ollama}:/root/.ollama  # shared — no double download
    - ./docker/ollama/entrypoint.sh:/entrypoint.sh
  environment:
    - OLLAMA_MODELS=/root/.ollama
    - DEFAULT_OLLAMA_MODEL=${OLLAMA_RESEARCH_MODEL:-llama3.2:3b}
  entrypoint: ["/bin/bash", "/entrypoint.sh"]
  profiles: [dual-gpu-ollama, dual-gpu-mixed]
  restart: unless-stopped

`compose.gpu.yml`

Add ollama_research block (GPU 1). vllm stays on GPU 1 as-is:

ollama_research:
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ["1"]
            capabilities: [gpu]

`compose.podman-gpu.yml`

Same addition for Podman CDI:

ollama_research:
  devices:
    - nvidia.com/gpu=1
  deploy:
    resources:
      reservations:
        devices: []

`Makefile`

Two additions after existing COMPOSE detection:

DUAL_GPU_MODE ?= $(shell grep -m1 '^DUAL_GPU_MODE=' .env 2>/dev/null | cut -d= -f2 || echo ollama)

# GPU overlay: matches single-gpu, dual-gpu (findstring gpu already covers these)
# Sub-profile injection for dual-gpu modes:
ifeq ($(PROFILE),dual-gpu)
  COMPOSE_FILES += --profile dual-gpu-$(DUAL_GPU_MODE)
endif

Update manage.sh usage block to document dual-gpu profile with DUAL_GPU_MODE note:

dual-gpu     Ollama + Vision on GPU 0; GPU 1 mode set by DUAL_GPU_MODE
             DUAL_GPU_MODE=ollama  (default) ollama_research on GPU 1
             DUAL_GPU_MODE=vllm             vllm on GPU 1
             DUAL_GPU_MODE=mixed            both on GPU 1 (VRAM-split; see preflight warning)

`scripts/preflight.py`

1. _SERVICES — add ollama_research:

"ollama_research": ("ollama_research_port", 11435, "OLLAMA_RESEARCH_PORT", True, True),

2. _LLM_BACKENDS — add entries for both new backends:

"ollama_research": [("ollama_research", "/v1")],
# vllm_research is an alias for vllm's port — preflight updates base_url for both:
"vllm": [("vllm", "/v1"), ("vllm_research", "/v1")],

3. _DOCKER_INTERNAL — add ollama_research:

"ollama_research": ("ollama_research", 11434),  # container-internal port is always 11434

4. recommend_profile() — unchanged (still returns "dual-gpu" for 2 GPUs). Write DUAL_GPU_MODE=ollama to .env when first setting up a 2-GPU system.

5. Mixed-mode VRAM warning — after GPU resource section, before closing line:

dual_gpu_mode = os.environ.get("DUAL_GPU_MODE", "ollama")
if dual_gpu_mode == "mixed" and len(gpus) >= 2:
    if gpus[1]["vram_free_gb"] < 12:
        print(f"║  ⚠  DUAL_GPU_MODE=mixed: GPU 1 has only {gpus[1]['vram_free_gb']:.1f} GB free")
        print(f"║     Running ollama_research + vllm together may cause OOM.")
        print(f"║     Consider DUAL_GPU_MODE=ollama or DUAL_GPU_MODE=vllm instead.")

6. Download size warning — profile-aware block added just before the closing ╚ line:

║  Download sizes (first-run estimates)
║    Docker images
║      ollama/ollama        ~800 MB  (shared by ollama + ollama_research)
║      searxng/searxng      ~300 MB
║      app (Python build)  ~1.5 GB
║      vision service      ~3.0 GB  [single-gpu and above]
║      vllm/vllm-openai   ~10.0 GB  [vllm / mixed mode only]
║
║    Model weights  (lazy-loaded on first use)
║      llama3.2:3b          ~2.0 GB  → OLLAMA_MODELS_DIR
║      moondream2            ~1.8 GB  → vision container cache  [single-gpu+]
║    Note: ollama + ollama_research share the same model dir — no double download
║
║  ⚠  Total first-run: ~X GB  (models persist between restarts)

Total is summed at runtime based on active profile + DUAL_GPU_MODE.

Size table (used by the warning calculator):

Component	Size	Condition
`ollama/ollama` image	800 MB	cpu, single-gpu, dual-gpu
`searxng/searxng` image	300 MB	always
app image	1,500 MB	always
vision service image	3,000 MB	single-gpu, dual-gpu
`vllm/vllm-openai` image	10,000 MB	vllm or mixed mode
llama3.2:3b weights	2,000 MB	cpu, single-gpu, dual-gpu
moondream2 weights	1,800 MB	single-gpu, dual-gpu

`config/llm.yaml`

Add vllm_research backend:

vllm_research:
  api_key: ''
  base_url: http://host.docker.internal:8000/v1  # same port as vllm; preflight keeps in sync
  enabled: true
  model: __auto__
  supports_images: false
  type: openai_compat

Update research_fallback_order:

research_fallback_order:
  - claude_code
  - vllm_research
  - ollama_research
  - github_copilot
  - anthropic

vllm stays in the main fallback_order (cover letters). vllm_research is the explicit research alias for the same service — different config key, same port, makes routing intent readable in the YAML.

Downstream Compatibility

The LLM router requires no changes. _is_reachable() already skips backends that aren't responding. When DUAL_GPU_MODE=ollama, vllm_research is unreachable and skipped; ollama_research is up and used. When DUAL_GPU_MODE=vllm, the reverse. mixed mode makes both reachable; vllm_research wins as the higher-priority entry.

Preflight's update_llm_yaml() keeps base_url values correct for both adopted (external) and Docker-internal routing automatically, since vllm_research is registered under the "vllm" key in _LLM_BACKENDS.

Future Considerations

Triple-GPU / 3+ service configs: When a third product is active, extract this pattern into circuitforge-core as a reusable inference topology manager.
Dual vLLM: Two vLLM instances (e.g., different model sizes per task) follows the same pattern — add vllm_research as a separate compose service on its own port.
VRAM-aware model selection: Preflight could suggest smaller models when VRAM is tight in mixed mode (e.g., swap llama3.2:3b → llama3.2:1b for the research instance).
Queue optimizer (1-GPU / CPU): When only one inference backend is available and a batch of tasks is queued, group by task type (all cover letters first, then all research briefs) to avoid repeated model context switches. Tracked separately.

8.7 KiB Raw Blame History