[New Feature] Dual-GPU support with DUAL_GPU_MODE env var #1
Labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Circuit-Forge/peregrine#1
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Add
DUAL_GPU_MODEenvironment variable to control which inference service runs on GPU 1 in dual-GPU setups. Enables running simultaneous cover letter generation and research without GPU contention.Motivation
Currently all LLM inference (cover letter gen + research) shares a single GPU. On dual-GPU systems, GPU 1 sits idle. This blocks the common pattern of generating a cover letter while company research runs in parallel.
Design
Full design doc:
docs/plans/2026-02-26-dual-gpu-design.mdFull TDD implementation plan:
docs/plans/2026-02-26-dual-gpu-plan.mdThree modes
DUAL_GPU_MODEollamavllmmixedNew Docker Compose sub-profiles
dual-gpu-ollama— dual ollama instances on separate GPUsdual-gpu-vllm— ollama + vllm on separate GPUsdual-gpu-mixed— ollama + vllm + ollama_researchChanges required
config/llm.yaml— addvllm_researchalias +research_fallback_orderscripts/preflight.py— detect DUAL_GPU_MODE, calculate + warn on model download size, write default to .envcompose.yml— addollama_researchservice + profile updatescompose.gpu.yml/compose.podman-gpu.yml— GPU device assignmentsMakefile— injectDUAL_GPU_MODEsub-profile selectionmanage.sh— update help textTier
Available to all tiers (hardware-gated, not license-gated).
Tasks
9-task TDD plan in
docs/plans/2026-02-26-dual-gpu-plan.md. Ready to implement.Implementation complete
All 9 tasks from the implementation plan are merged to
main.What was shipped
DUAL_GPU_MODE=ollama|vllm|mixedenv var selects which service occupies GPU 1ollama_researchservice added tocompose.yml(port 11435, shared model dir — no double download)compose.gpu.yml+compose.podman-gpu.ymlassignollama_researchtodevice_ids: ["1"]--profile dual-gpu-$(DUAL_GPU_MODE)alongside--profile dual-gpumanage.shhelp updated with mode descriptionspreflight.pygains:ollama_researchin_SERVICES,_LLM_BACKENDS,_DOCKER_INTERNAL_download_size_mb()— profile-aware first-run download size estimate_mixed_mode_vram_warning()— warns when GPU 1 has < 12 GB free in mixed modeDUAL_GPU_MODE=ollamadefault to.envon first 2-GPU setupconfig/llm.yamlgainsvllm_researchbackend;research_fallback_orderupdatedtests/test_preflight.py, all greenVerified
docker compose --profile dual-gpu --profile dual-gpu-ollama config→ollama,ollama_research,vision,searxng(no vllm) ✓docker compose --profile dual-gpu --profile dual-gpu-vllm config→ollama,vllm,vision,searxng(no ollama_research) ✓