feat: auto service lifecycle — /allocate, NodeSelector, idle sweep, CFOrchClient #9

Merged
pyr0ball merged 15 commits from feature/orch-auto-lifecycle into main 2026-04-02 14:11:36 -07:00
Owner

Summary

  • NodeSelector: warm-first GPU scoring — prefers nodes where the service model is already loaded (+1000 MB virtual bonus), then falls back to most free VRAM
  • /api/services/{service}/allocate: new coordinator endpoint for hands-free node selection; returns allocation_id, url, warm flag
  • CFOrchClient: sync + async context manager in circuitforge_core.resources; acquires a vllm/vision allocation, yields it, and releases on exit
  • ServiceRegistry: in-memory allocation tracker + instance state machine (starting → running → idle → stopped); TTL sweep reaps expired allocations
  • Idle sweep: AgentSupervisor calls _run_idle_sweep() every 3 heartbeat ticks; POSTs /services/{service}/stop to the agent when a slot has been idle past idle_stop_after_s
  • idle_stop_after_s: new ServiceProfile field (0 = never stop); set to 600 s for vllm in all four public GPU profiles
  • LLMRouter cf_orch integration: _try_cf_orch_alloc() wires allocate into the inference path when CF_ORCH_URL is set
  • Dashboard: Services table with state color-coding (running/idle/stopped/starting), safe DOM construction throughout

Test Plan

  • 126 tests passing (conda run -n cf pytest tests/ -v)
  • Start coordinator + agent on Heimdall; verify /api/services/vllm/allocate selects the right GPU
  • Leave vllm idle 10+ minutes; verify agent receives /services/vllm/stop
  • Set CF_ORCH_URL in Peregrine .env; verify LLMRouter routes through cf-orch
  • Dashboard / shows Services table with live state
## Summary - **NodeSelector**: warm-first GPU scoring — prefers nodes where the service model is already loaded (+1000 MB virtual bonus), then falls back to most free VRAM - **`/api/services/{service}/allocate`**: new coordinator endpoint for hands-free node selection; returns `allocation_id`, `url`, `warm` flag - **CFOrchClient**: sync + async context manager in `circuitforge_core.resources`; acquires a vllm/vision allocation, yields it, and releases on exit - **ServiceRegistry**: in-memory allocation tracker + instance state machine (`starting → running → idle → stopped`); TTL sweep reaps expired allocations - **Idle sweep**: `AgentSupervisor` calls `_run_idle_sweep()` every 3 heartbeat ticks; POSTs `/services/{service}/stop` to the agent when a slot has been idle past `idle_stop_after_s` - **`idle_stop_after_s`**: new `ServiceProfile` field (0 = never stop); set to 600 s for `vllm` in all four public GPU profiles - **LLMRouter cf_orch integration**: `_try_cf_orch_alloc()` wires allocate into the inference path when `CF_ORCH_URL` is set - **Dashboard**: Services table with state color-coding (running/idle/stopped/starting), safe DOM construction throughout ## Test Plan - [ ] 126 tests passing (`conda run -n cf pytest tests/ -v`) - [ ] Start coordinator + agent on Heimdall; verify `/api/services/vllm/allocate` selects the right GPU - [ ] Leave vllm idle 10+ minutes; verify agent receives `/services/vllm/stop` - [ ] Set `CF_ORCH_URL` in Peregrine `.env`; verify LLMRouter routes through cf-orch - [ ] Dashboard `/` shows Services table with live state
pyr0ball added 15 commits 2026-04-02 14:05:45 -07:00
Implements CFOrchClient with allocate() (sync contextmanager) and
allocate_async() (async contextmanager) for cf-orch GPU resource
allocation. Releases allocation on exit; ignores 404 on release;
raises RuntimeError on non-2xx allocation response. Exports
CFOrchClient and Allocation from circuitforge_core.resources.

Note: async test uses unittest.mock rather than httpretty — httpretty
only patches stdlib sockets and does not intercept httpx async (anyio)
transport.
Add idle_stop_after_s to ServiceProfile (default 0 = never stop).
Set 600s (10 min) timeout on vllm slot in all single-GPU profiles.
Backward compatible; non-vllm services inherit default 0 (no auto-stop).
- CRITICAL: idle sweep now calls mark_stopped() after successful HTTP stop,
  preventing repeated stop POSTs on every 3rd tick for the same instance
- CRITICAL: active_allocations() now filters by gpu_id to avoid marking wrong
  instance idle on multi-GPU nodes when an allocation is released
- CRITICAL: VRAM pre-flight guard in ensure_service was dead code — added the
  actual HTTPException(503) before the candidate loop
- IMPORTANT: register() now updates agent_url on re-registration if it changed,
  so relocated agents are tracked correctly
- IMPORTANT: updated test_service_registry.py callers of active_allocations()
  to pass the now-required gpu_id argument
- ServiceRegistry: add sweep_expired_allocations() to remove stale TTL
  allocations and transition instances to idle; add get_allocation() helper
- AgentSupervisor._run_idle_sweep: call sweep_expired_allocations() before
  idle-timeout check so crashed-caller leaks are cleaned up each sweep tick
- schema._parse_managed: copy raw dict before extracting 'type' key instead
  of mutating caller's dict with pop()
- app.release_allocation: validate allocation belongs to the given service
  path param before releasing; return 404 if mismatch
- router._try_cf_orch_alloc: replace print() with logger.warning(); add
  module-level logger = logging.getLogger(__name__)
- tests: add test_sweep_expired_allocations covering TTL expiry and idle
  state transition
pyr0ball merged commit c5e12b74f2 into main 2026-04-02 14:11:36 -07:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/circuitforge-core#9
No description provided.