agent watchdog: persist known nodes + auto-reconnect after coordinator restart #15
Labels
No labels
architecture
backlog
enhancement
module:documents
module:hardware
module:manage
module:pipeline
module:voice
priority:backlog
priority:high
priority:medium
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Circuit-Forge/circuitforge-core#15
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
When the coordinator restarts,
_agents(in-memory dict inAgentSupervisor) is wiped. External agent nodes (e.g. Navi, Strahl) are not re-registered until someone manually restarts their agent process. In practice this means after any coordinator restart ormanage.sh restart, remote GPU nodes silently disappear from the allocation pool.Root cause
AgentSupervisor.__init__initialisesself._agents = {}— no persistence, no recovery.Proposed fix
1. Persist known nodes to SQLite
On every
register()call, upsert a row in aknown_nodestable:2. Reload on startup
In
AgentSupervisor.__init__(or a newrestore_from_db()method called at lifespan startup), read all rows fromknown_nodesinto_agents. Mark all asonline=Falseuntil first successful poll.3. Probe loop on startup
Immediately after loading known nodes, fire
poll_all()— nodes that respond come online instantly; stale/dead entries stay offline until they re-register.4. Agent-side reconnect loop (agent app)
In the agent process, run a background task that:
POST /api/nodes/registerto the coordinator every 30 sAcceptance criteria
known_nodestable migration included in cf-core DB migrations