feat(resources): cf-orch GPU VRAM orchestration — Plan A core #1

Merged
pyr0ball merged 21 commits from feature/cforch-core-orchestration into main 2026-03-31 10:43:53 -07:00
Owner

Summary

  • Adds circuitforge_core.resources — the cf-orch (CircuitForge Orchestration) VRAM lease management system
  • Coordinator service tracks VRAM leases per (node_id, gpu_id) pair with asyncio lock; evicts lower-priority holders when VRAM is exhausted
  • Agent service polls nvidia-smi, accepts /evict calls (SIGTERM → grace → SIGKILL)
  • 8 public GPU profiles (2 GB → 24 GB + CPU-only 16/32 GB) loaded from YAML with Pydantic v2
  • OrchestratorClient with vram_lease() async context manager for guaranteed release
  • cf-orch CLI (typer): start, agent, status, install-service
  • Docker Compose (compose.yml) for coordinator + agent with NVIDIA GPU device mapping

Test Plan

  • 79/79 tests passing
  • Integration tests: full lease cycle, VRAM exhaustion (503), auto-detect profile
  • Final code review — all HIGH/MEDIUM findings addressed
  • Smoke test: pip install -e .[orch]cf-orch status
  • Smoke test: docker compose -f circuitforge_core/resources/compose.yml up on NVIDIA host

Notes

  • v1: _call_agent_evict is a stub; real process eviction deferred to Plan B
  • Plan B: cf-orch Dashboard, real process eviction, HMAC join tokens
## Summary - Adds `circuitforge_core.resources` — the **cf-orch** (CircuitForge Orchestration) VRAM lease management system - Coordinator service tracks VRAM leases per `(node_id, gpu_id)` pair with asyncio lock; evicts lower-priority holders when VRAM is exhausted - Agent service polls `nvidia-smi`, accepts `/evict` calls (SIGTERM → grace → SIGKILL) - 8 public GPU profiles (2 GB → 24 GB + CPU-only 16/32 GB) loaded from YAML with Pydantic v2 - `OrchestratorClient` with `vram_lease()` async context manager for guaranteed release - `cf-orch` CLI (typer): `start`, `agent`, `status`, `install-service` - Docker Compose (`compose.yml`) for coordinator + agent with NVIDIA GPU device mapping ## Test Plan - [x] 79/79 tests passing - [x] Integration tests: full lease cycle, VRAM exhaustion (503), auto-detect profile - [x] Final code review — all HIGH/MEDIUM findings addressed - [ ] Smoke test: `pip install -e .[orch]` → `cf-orch status` - [ ] Smoke test: `docker compose -f circuitforge_core/resources/compose.yml up` on NVIDIA host ## Notes - v1: `_call_agent_evict` is a stub; real process eviction deferred to Plan B - Plan B: cf-orch Dashboard, real process eviction, HMAC join tokens
pyr0ball added 21 commits 2026-03-30 22:49:50 -07:00
- eviction_engine: replace deprecated asyncio.get_event_loop() with
  get_running_loop() (Python 3.12 compatibility)
- eviction_engine: remove unused httpx import
- coordinator app: return 422 for unknown node_id instead of silently
  falling back to hardcoded localhost URL
- eviction_executor: guard against pid <= 0 to prevent accidental
  SIGTERM to process group
- pyproject.toml: move pytest-asyncio to [dev] extras, not [orch]
- profile_registry: document CPU profile exclusion from list_public()
pyr0ball merged commit 99f4e95018 into main 2026-03-31 10:43:53 -07:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/circuitforge-core#1
No description provided.