feat(orch): agent self-registration + coordinator heartbeat loop #4

Closed
pyr0ball wants to merge 0 commits from feature/orch-agent-registration into main
Owner

Summary

  • POST /api/nodes — agents self-register by posting {node_id, agent_url}; coordinator immediately polls for GPU info
  • Coordinator lifespan now starts/stops the AgentSupervisor heartbeat loop (was never started before — root cause of empty dashboard)
  • cf-orch start pre-registers the local agent so its GPUs appear immediately without a separate cf-orch agent call
  • cf-orch agent fires a registration POST to the coordinator in a daemon thread after a 2s delay, then blocks on uvicorn.run()
  • --advertise-host flag on cf-orch agent for NATted/multi-homed nodes (e.g. Navi behind a VPN)

Test plan

  • conda run -n cf pytest tests/ -q — 94/94 pass
  • cf-orch start --node-id heimdallGET /api/nodes returns heimdall with both RTX 4000s
  • VRAM readings live on dashboard at http://10.1.10.71:7700/
  • Remote node (Navi) registers via cf-orch agent --coordinator http://10.1.10.71:7700 --node-id navi --advertise-host 10.1.10.10
## Summary - `POST /api/nodes` — agents self-register by posting `{node_id, agent_url}`; coordinator immediately polls for GPU info - Coordinator lifespan now starts/stops the `AgentSupervisor` heartbeat loop (was never started before — root cause of empty dashboard) - `cf-orch start` pre-registers the local agent so its GPUs appear immediately without a separate `cf-orch agent` call - `cf-orch agent` fires a registration POST to the coordinator in a daemon thread after a 2s delay, then blocks on `uvicorn.run()` - `--advertise-host` flag on `cf-orch agent` for NATted/multi-homed nodes (e.g. Navi behind a VPN) ## Test plan - [x] `conda run -n cf pytest tests/ -q` — 94/94 pass - [x] `cf-orch start --node-id heimdall` → `GET /api/nodes` returns heimdall with both RTX 4000s - [x] VRAM readings live on dashboard at `http://10.1.10.71:7700/` - [ ] Remote node (Navi) registers via `cf-orch agent --coordinator http://10.1.10.71:7700 --node-id navi --advertise-host 10.1.10.10`
pyr0ball added 1 commit 2026-03-31 19:21:16 -07:00
coordinator/app.py:
- Add POST /api/nodes — agents POST {node_id, agent_url} to self-register;
  coordinator immediately polls the new agent for GPU info
- Add lifespan context manager that starts/stops AgentSupervisor heartbeat
  loop (previously the loop was never started)

cli.py start:
- Add --node-id flag (default 'local')
- Pre-register the local agent URL (http://127.0.0.1:{agent_port}) so the
  heartbeat loop can poll it immediately on startup
- Drop redundant lease_manager.register_gpu() call — supervisor.poll_agent()
  now does this via the heartbeat after the agent responds

cli.py agent:
- Add --advertise-host flag for NATted/multi-homed nodes
- Fire registration POST to coordinator in a daemon thread (2s delay) so
  uvicorn.run() can start binding immediately; no double uvicorn.run()
pyr0ball added 1 commit 2026-04-01 11:18:15 -07:00
Before running a batch of tasks, the scheduler now requests a VRAM lease
from the cf-orch coordinator (POST /api/leases). The lease is held for the
full batch and released in the finally block so it's always cleaned up even
on error. Falls back gracefully if the coordinator is unreachable.

Adds coordinator_url and service_name params to TaskScheduler.__init__
and get_scheduler() so callers can override the default localhost:7700.
pyr0ball closed this pull request 2026-04-02 16:49:12 -07:00

Pull request closed

Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/circuitforge-core#4
No description provided.