agent watchdog: persist known nodes + auto-reconnect after coordinator restart #15

New issue

Closed

opened 2026-04-02 21:48:27 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-02 21:48:27 -07:00

Owner

Problem

When the coordinator restarts, _agents (in-memory dict in AgentSupervisor) is wiped. External agent nodes (e.g. Navi, Strahl) are not re-registered until someone manually restarts their agent process. In practice this means after any coordinator restart or manage.sh restart, remote GPU nodes silently disappear from the allocation pool.

Root cause

AgentSupervisor.__init__ initialises self._agents = {} — no persistence, no recovery.

Proposed fix

1. Persist known nodes to SQLite

On every register() call, upsert a row in a known_nodes table:

CREATE TABLE IF NOT EXISTS known_nodes (
    node_id   TEXT PRIMARY KEY,
    agent_url TEXT NOT NULL,
    last_seen REAL
);

2. Reload on startup

In AgentSupervisor.__init__ (or a new restore_from_db() method called at lifespan startup), read all rows from known_nodes into _agents. Mark all as online=False until first successful poll.

3. Probe loop on startup

Immediately after loading known nodes, fire poll_all() — nodes that respond come online instantly; stale/dead entries stay offline until they re-register.

4. Agent-side reconnect loop (agent app)

In the agent process, run a background task that:

Sends POST /api/nodes/register to the coordinator every 30 s
If the coordinator was unreachable and then comes back, the agent re-registers automatically
This means coordinator restart → agents re-appear within one heartbeat cycle (~30 s) with no manual intervention

Acceptance criteria

Restart coordinator → all previously-registered nodes reappear within 30 s without touching agent processes
New node registers → survives coordinator restart
Dead/removed node eventually ages out (no stale-forever entries)
known_nodes table migration included in cf-core DB migrations

## Problem When the coordinator restarts, `_agents` (in-memory dict in `AgentSupervisor`) is wiped. External agent nodes (e.g. Navi, Strahl) are not re-registered until someone manually restarts their agent process. In practice this means after any coordinator restart or `manage.sh restart`, remote GPU nodes silently disappear from the allocation pool. ## Root cause `AgentSupervisor.__init__` initialises `self._agents = {}` — no persistence, no recovery. ## Proposed fix ### 1. Persist known nodes to SQLite On every `register()` call, upsert a row in a `known_nodes` table: ```sql CREATE TABLE IF NOT EXISTS known_nodes ( node_id TEXT PRIMARY KEY, agent_url TEXT NOT NULL, last_seen REAL ); ``` ### 2. Reload on startup In `AgentSupervisor.__init__` (or a new `restore_from_db()` method called at lifespan startup), read all rows from `known_nodes` into `_agents`. Mark all as `online=False` until first successful poll. ### 3. Probe loop on startup Immediately after loading known nodes, fire `poll_all()` — nodes that respond come online instantly; stale/dead entries stay offline until they re-register. ### 4. Agent-side reconnect loop (agent app) In the agent process, run a background task that: - Sends `POST /api/nodes/register` to the coordinator every 30 s - If the coordinator was unreachable and then comes back, the agent re-registers automatically - This means coordinator restart → agents re-appear within one heartbeat cycle (~30 s) with no manual intervention ## Acceptance criteria - [ ] Restart coordinator → all previously-registered nodes reappear within 30 s without touching agent processes - [ ] New node registers → survives coordinator restart - [ ] Dead/removed node eventually ages out (no stale-forever entries) - [ ] `known_nodes` table migration included in cf-core DB migrations