feat: auto service lifecycle — /allocate, NodeSelector, idle sweep, CFOrchClient #9

Merged

pyr0ball merged 15 commits from feature/orch-auto-lifecycle into main

2026-04-02 14:11:36 -07:00

pyr0ball commented

2026-04-02 14:05:44 -07:00

Owner

Summary

NodeSelector: warm-first GPU scoring — prefers nodes where the service model is already loaded (+1000 MB virtual bonus), then falls back to most free VRAM
/api/services/{service}/allocate: new coordinator endpoint for hands-free node selection; returns allocation_id, url, warm flag
CFOrchClient: sync + async context manager in circuitforge_core.resources; acquires a vllm/vision allocation, yields it, and releases on exit
ServiceRegistry: in-memory allocation tracker + instance state machine (starting → running → idle → stopped); TTL sweep reaps expired allocations
Idle sweep: AgentSupervisor calls _run_idle_sweep() every 3 heartbeat ticks; POSTs /services/{service}/stop to the agent when a slot has been idle past idle_stop_after_s
idle_stop_after_s: new ServiceProfile field (0 = never stop); set to 600 s for vllm in all four public GPU profiles
LLMRouter cf_orch integration: _try_cf_orch_alloc() wires allocate into the inference path when CF_ORCH_URL is set
Dashboard: Services table with state color-coding (running/idle/stopped/starting), safe DOM construction throughout

Test Plan

126 tests passing (conda run -n cf pytest tests/ -v)
Start coordinator + agent on Heimdall; verify /api/services/vllm/allocate selects the right GPU
Leave vllm idle 10+ minutes; verify agent receives /services/vllm/stop
Set CF_ORCH_URL in Peregrine .env; verify LLMRouter routes through cf-orch
Dashboard / shows Services table with live state

## Summary - **NodeSelector**: warm-first GPU scoring — prefers nodes where the service model is already loaded (+1000 MB virtual bonus), then falls back to most free VRAM - **`/api/services/{service}/allocate`**: new coordinator endpoint for hands-free node selection; returns `allocation_id`, `url`, `warm` flag - **CFOrchClient**: sync + async context manager in `circuitforge_core.resources`; acquires a vllm/vision allocation, yields it, and releases on exit - **ServiceRegistry**: in-memory allocation tracker + instance state machine (`starting → running → idle → stopped`); TTL sweep reaps expired allocations - **Idle sweep**: `AgentSupervisor` calls `_run_idle_sweep()` every 3 heartbeat ticks; POSTs `/services/{service}/stop` to the agent when a slot has been idle past `idle_stop_after_s` - **`idle_stop_after_s`**: new `ServiceProfile` field (0 = never stop); set to 600 s for `vllm` in all four public GPU profiles - **LLMRouter cf_orch integration**: `_try_cf_orch_alloc()` wires allocate into the inference path when `CF_ORCH_URL` is set - **Dashboard**: Services table with state color-coding (running/idle/stopped/starting), safe DOM construction throughout ## Test Plan - [ ] 126 tests passing (`conda run -n cf pytest tests/ -v`) - [ ] Start coordinator + agent on Heimdall; verify `/api/services/vllm/allocate` selects the right GPU - [ ] Leave vllm idle 10+ minutes; verify agent receives `/services/vllm/stop` - [ ] Set `CF_ORCH_URL` in Peregrine `.env`; verify LLMRouter routes through cf-orch - [ ] Dashboard `/` shows Services table with live state

pyr0ball added 15 commits 2026-04-02 14:05:45 -07:00

feat(orch): add NodeSelector — warm-first GPU scoring 13eb0c85f1

refactor(orch): hoist service_max_mb lookup; clarify warm-fallback comments d600fb6651

feat(orch): expose online_agents() and resident_keys() helpers 52d2c5cf38

feat(orch): add /api/services/{service}/allocate with auto node selection 8201f6b3e9

feat(core): add CFOrchClient sync+async context manager defaf39883

Implements CFOrchClient with allocate() (sync contextmanager) and
allocate_async() (async contextmanager) for cf-orch GPU resource
allocation. Releases allocation on exit; ignores 404 on release;
raises RuntimeError on non-2xx allocation response. Exports
CFOrchClient and Allocation from circuitforge_core.resources.

Note: async test uses unittest.mock rather than httpretty — httpretty
only patches stdlib sockets and does not intercept httpx async (anyio)
transport.

fix(orch): hoist service-known check; capture resident_keys once in allocate f741e6a80b

feat(llm): add cf_orch allocation support to LLMRouter backends 17a24173f7

feat(orch): add ServiceRegistry — allocation tracking + idle state machine 9754f522d9

feat(profiles): add idle_stop_after_s field; set 600s for vllm slot 1e168ac636

Add idle_stop_after_s to ServiceProfile (default 0 = never stop).
Set 600s (10 min) timeout on vllm slot in all single-GPU profiles.
Backward compatible; non-vllm services inherit default 0 (no auto-stop).

feat: add idle sweep to AgentSupervisor c299482e0d

feat: wire ServiceRegistry into coordinator allocate endpoints 49ab9e4e88

fix: address coordinator/idle-sweep quality issues from review a4ccaaf3e2

- CRITICAL: idle sweep now calls mark_stopped() after successful HTTP stop,
  preventing repeated stop POSTs on every 3rd tick for the same instance
- CRITICAL: active_allocations() now filters by gpu_id to avoid marking wrong
  instance idle on multi-GPU nodes when an allocation is released
- CRITICAL: VRAM pre-flight guard in ensure_service was dead code — added the
  actual HTTPException(503) before the candidate loop
- IMPORTANT: register() now updates agent_url on re-registration if it changed,
  so relocated agents are tracked correctly
- IMPORTANT: updated test_service_registry.py callers of active_allocations()
  to pass the now-required gpu_id argument

feat: add Services table to coordinator dashboard 02806359af

test: add VRAM pre-flight 503 test for ensure_service 1a20b80a50

fix: TTL sweep, immutability, service-scoped release, logger in orch alloc e58c3aea23

- ServiceRegistry: add sweep_expired_allocations() to remove stale TTL
  allocations and transition instances to idle; add get_allocation() helper
- AgentSupervisor._run_idle_sweep: call sweep_expired_allocations() before
  idle-timeout check so crashed-caller leaks are cleaned up each sweep tick
- schema._parse_managed: copy raw dict before extracting 'type' key instead
  of mutating caller's dict with pop()
- app.release_allocation: validate allocation belongs to the given service
  path param before releasing; return 404 if mismatch
- router._try_cf_orch_alloc: replace print() with logger.warning(); add
  module-level logger = logging.getLogger(__name__)
- tests: add test_sweep_expired_allocations covering TTL expiry and idle
  state transition