fix(orch): transition vllm instance state from starting to running after port probe #10

New issue

Closed

opened 2026-04-02 17:00:42 -07:00 by pyr0ball · 1 comment

pyr0ball commented

2026-04-02 17:00:42 -07:00

Owner

Problem

After a successful /allocate, the coordinator seeds the service instance with state: starting. There is no background probe to flip it to running once llm_server is accepting connections.

Symptom: /api/services always shows state: starting even when the server is healthy at port 8000.

Suggested approach

Add a lifespan background task in the coordinator that polls starting instances every few seconds, probes their /health URL, and calls service_registry.upsert_instance(state='running') on success or failed after a configurable timeout.

Alternative: agent emits a /services/{service}/ready callback to the coordinator once is_running() returns True.

Relevant files

circuitforge_core/resources/coordinator/app.py — upsert_instance seeds state
circuitforge_core/resources/agent/service_manager.py — is_running() socket probe
circuitforge_core/resources/agent/service_probe.py — probe utilities

## Problem After a successful `/allocate`, the coordinator seeds the service instance with `state: starting`. There is no background probe to flip it to `running` once llm_server is accepting connections. **Symptom:** `/api/services` always shows `state: starting` even when the server is healthy at port 8000. ## Suggested approach Add a lifespan background task in the coordinator that polls `starting` instances every few seconds, probes their `/health` URL, and calls `service_registry.upsert_instance(state='running')` on success or `failed` after a configurable timeout. **Alternative:** agent emits a `/services/{service}/ready` callback to the coordinator once `is_running()` returns True. ## Relevant files - `circuitforge_core/resources/coordinator/app.py` — upsert_instance seeds state - `circuitforge_core/resources/agent/service_manager.py` — is_running() socket probe - `circuitforge_core/resources/agent/service_probe.py` — probe utilities

pyr0ball referenced this issue from a commit

2026-04-02 17:18:22 -07:00

feat(orch): background health probe loop — starting → running transition

pyr0ball commented

2026-04-02 17:22:38 -07:00

Author

Owner

Fixed in feature/orch-llm-server @ a7290c1.

_run_instance_probe_loop runs as a background asyncio task in coordinator lifespan. Polls all starting instances every 5 s via GET /health; transitions to running on 200, or stopped after 300 s timeout.

Fixed in `feature/orch-llm-server` @ a7290c1. `_run_instance_probe_loop` runs as a background asyncio task in coordinator lifespan. Polls all `starting` instances every 5 s via `GET /health`; transitions to `running` on 200, or `stopped` after 300 s timeout.

pyr0ball closed this issue

2026-04-02 17:22:38 -07:00

pyr0ball referenced this issue from a pull request that will close it,

2026-04-02 17:24:01 -07:00

feat(orch): health probe loop + VRAM pre-flight fix #12