fix(orch): transition vllm instance state from starting to running after port probe #10

Closed
opened 2026-04-02 17:00:42 -07:00 by pyr0ball · 1 comment
Owner

Problem

After a successful /allocate, the coordinator seeds the service instance with state: starting. There is no background probe to flip it to running once llm_server is accepting connections.

Symptom: /api/services always shows state: starting even when the server is healthy at port 8000.

Suggested approach

Add a lifespan background task in the coordinator that polls starting instances every few seconds, probes their /health URL, and calls service_registry.upsert_instance(state='running') on success or failed after a configurable timeout.

Alternative: agent emits a /services/{service}/ready callback to the coordinator once is_running() returns True.

Relevant files

  • circuitforge_core/resources/coordinator/app.py — upsert_instance seeds state
  • circuitforge_core/resources/agent/service_manager.py — is_running() socket probe
  • circuitforge_core/resources/agent/service_probe.py — probe utilities
## Problem After a successful `/allocate`, the coordinator seeds the service instance with `state: starting`. There is no background probe to flip it to `running` once llm_server is accepting connections. **Symptom:** `/api/services` always shows `state: starting` even when the server is healthy at port 8000. ## Suggested approach Add a lifespan background task in the coordinator that polls `starting` instances every few seconds, probes their `/health` URL, and calls `service_registry.upsert_instance(state='running')` on success or `failed` after a configurable timeout. **Alternative:** agent emits a `/services/{service}/ready` callback to the coordinator once `is_running()` returns True. ## Relevant files - `circuitforge_core/resources/coordinator/app.py` — upsert_instance seeds state - `circuitforge_core/resources/agent/service_manager.py` — is_running() socket probe - `circuitforge_core/resources/agent/service_probe.py` — probe utilities
Author
Owner

Fixed in feature/orch-llm-server @ a7290c1.

_run_instance_probe_loop runs as a background asyncio task in coordinator lifespan. Polls all starting instances every 5 s via GET /health; transitions to running on 200, or stopped after 300 s timeout.

Fixed in `feature/orch-llm-server` @ a7290c1. `_run_instance_probe_loop` runs as a background asyncio task in coordinator lifespan. Polls all `starting` instances every 5 s via `GET /health`; transitions to `running` on 200, or `stopped` after 300 s timeout.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/circuitforge-core#10
No description provided.