[New Feature] Dual-GPU support with DUAL_GPU_MODE env var #1

New issue

Closed

opened 2026-02-27 00:01:59 -08:00 by pyr0ball · 1 comment

pyr0ball commented

2026-02-27 00:01:59 -08:00

Owner

Summary

Add DUAL_GPU_MODE environment variable to control which inference service runs on GPU 1 in dual-GPU setups. Enables running simultaneous cover letter generation and research without GPU contention.

Motivation

Currently all LLM inference (cover letter gen + research) shares a single GPU. On dual-GPU systems, GPU 1 sits idle. This blocks the common pattern of generating a cover letter while company research runs in parallel.

Design

Full design doc: docs/plans/2026-02-26-dual-gpu-design.md
Full TDD implementation plan: docs/plans/2026-02-26-dual-gpu-plan.md

Three modes

`DUAL_GPU_MODE`	GPU 0	GPU 1
`ollama`	ollama (cover letters)	ollama_research (port 11435)
`vllm`	ollama (cover letters)	vllm (research)
`mixed`	ollama (cover letters)	both vllm + ollama_research (VRAM warning if <12GB free)

New Docker Compose sub-profiles

dual-gpu-ollama — dual ollama instances on separate GPUs
dual-gpu-vllm — ollama + vllm on separate GPUs
dual-gpu-mixed — ollama + vllm + ollama_research

Changes required

config/llm.yaml — add vllm_research alias + research_fallback_order
scripts/preflight.py — detect DUAL_GPU_MODE, calculate + warn on model download size, write default to .env
compose.yml — add ollama_research service + profile updates
compose.gpu.yml / compose.podman-gpu.yml — GPU device assignments
Makefile — inject DUAL_GPU_MODE sub-profile selection
manage.sh — update help text

Tier

Available to all tiers (hardware-gated, not license-gated).

Tasks

9-task TDD plan in docs/plans/2026-02-26-dual-gpu-plan.md. Ready to implement.

## Summary Add `DUAL_GPU_MODE` environment variable to control which inference service runs on GPU 1 in dual-GPU setups. Enables running simultaneous cover letter generation and research without GPU contention. ## Motivation Currently all LLM inference (cover letter gen + research) shares a single GPU. On dual-GPU systems, GPU 1 sits idle. This blocks the common pattern of generating a cover letter while company research runs in parallel. ## Design Full design doc: `docs/plans/2026-02-26-dual-gpu-design.md` Full TDD implementation plan: `docs/plans/2026-02-26-dual-gpu-plan.md` ### Three modes | `DUAL_GPU_MODE` | GPU 0 | GPU 1 | |----------------|-------|-------| | `ollama` | ollama (cover letters) | ollama_research (port 11435) | | `vllm` | ollama (cover letters) | vllm (research) | | `mixed` | ollama (cover letters) | both vllm + ollama_research (VRAM warning if <12GB free) | ### New Docker Compose sub-profiles - `dual-gpu-ollama` — dual ollama instances on separate GPUs - `dual-gpu-vllm` — ollama + vllm on separate GPUs - `dual-gpu-mixed` — ollama + vllm + ollama_research ### Changes required - `config/llm.yaml` — add `vllm_research` alias + `research_fallback_order` - `scripts/preflight.py` — detect DUAL_GPU_MODE, calculate + warn on model download size, write default to .env - `compose.yml` — add `ollama_research` service + profile updates - `compose.gpu.yml` / `compose.podman-gpu.yml` — GPU device assignments - `Makefile` — inject `DUAL_GPU_MODE` sub-profile selection - `manage.sh` — update help text ## Tier Available to all tiers (hardware-gated, not license-gated). ## Tasks 9-task TDD plan in `docs/plans/2026-02-26-dual-gpu-plan.md`. Ready to implement.

pyr0ball referenced this issue

2026-03-04 09:32:37 -08:00

[New Feature] Vue 3 SPA frontend — replace Streamlit UI with Vite + Vue 3 + FastAPI #8

pyr0ball commented

2026-03-07 14:09:32 -08:00

Author

Owner

Implementation complete

All 9 tasks from the implementation plan are merged to main.

What was shipped

DUAL_GPU_MODE=ollama|vllm|mixed env var selects which service occupies GPU 1
ollama_research service added to compose.yml (port 11435, shared model dir — no double download)
compose.gpu.yml + compose.podman-gpu.yml assign ollama_research to device_ids: ["1"]
Makefile injects --profile dual-gpu-$(DUAL_GPU_MODE) alongside --profile dual-gpu
manage.sh help updated with mode descriptions
preflight.py gains:
- ollama_research in _SERVICES, _LLM_BACKENDS, _DOCKER_INTERNAL
- _download_size_mb() — profile-aware first-run download size estimate
- _mixed_mode_vram_warning() — warns when GPU 1 has < 12 GB free in mixed mode
- Writes DUAL_GPU_MODE=ollama default to .env on first 2-GPU setup
config/llm.yaml gains vllm_research backend; research_fallback_order updated
16 new unit tests in tests/test_preflight.py, all green

Verified

docker compose --profile dual-gpu --profile dual-gpu-ollama config → ollama, ollama_research, vision, searxng (no vllm) ✓
docker compose --profile dual-gpu --profile dual-gpu-vllm config → ollama, vllm, vision, searxng (no ollama_research) ✓
Full test suite: 401 passed, 0 failed
gitleaks pre-push: clean (165 commits scanned)

## Implementation complete All 9 tasks from the [implementation plan](docs/plans/2026-02-26-dual-gpu-plan.md) are merged to `main`. ### What was shipped - `DUAL_GPU_MODE=ollama|vllm|mixed` env var selects which service occupies GPU 1 - `ollama_research` service added to `compose.yml` (port 11435, shared model dir — no double download) - `compose.gpu.yml` + `compose.podman-gpu.yml` assign `ollama_research` to `device_ids: ["1"]` - Makefile injects `--profile dual-gpu-$(DUAL_GPU_MODE)` alongside `--profile dual-gpu` - `manage.sh` help updated with mode descriptions - `preflight.py` gains: - `ollama_research` in `_SERVICES`, `_LLM_BACKENDS`, `_DOCKER_INTERNAL` - `_download_size_mb()` — profile-aware first-run download size estimate - `_mixed_mode_vram_warning()` — warns when GPU 1 has < 12 GB free in mixed mode - Writes `DUAL_GPU_MODE=ollama` default to `.env` on first 2-GPU setup - `config/llm.yaml` gains `vllm_research` backend; `research_fallback_order` updated - 16 new unit tests in `tests/test_preflight.py`, all green ### Verified - `docker compose --profile dual-gpu --profile dual-gpu-ollama config` → `ollama`, `ollama_research`, `vision`, `searxng` (no vllm) ✓ - `docker compose --profile dual-gpu --profile dual-gpu-vllm config` → `ollama`, `vllm`, `vision`, `searxng` (no ollama_research) ✓ - Full test suite: **401 passed, 0 failed** - gitleaks pre-push: clean (165 commits scanned)

pyr0ball closed this issue

2026-03-07 14:09:40 -08:00

pyr0ball referenced this issue from a commit

2026-04-01 07:26:31 -07:00

feat(web): Vue 3 SPA scaffold with avocet lessons applied