feat: agent watchdog + Ollama adopt-if-running #17

Merged
pyr0ball merged 2 commits from feature/agent-watchdog into main 2026-04-02 22:12:33 -07:00

2 commits

Author SHA1 Message Date
7bb6b76bd5 feat: ollama adopt-if-running + health_path in ProcessSpec (#16)
- ProcessSpec: adopt (bool) and health_path (str, default /health) fields
- ServiceManager: adopt=True probes health_path before spawning; is_running()
  uses health probe for adopt services rather than proc table + socket check
- _probe_health() helper: urllib GET on localhost:port+path, returns bool
- Agent /services/{service}/start: returns adopted=True when service was
  already running; coordinator sets state=running immediately (no probe wait)
- ServiceInstance: health_path field (default /health)
- service_registry.upsert_instance(): health_path kwarg
- Probe loop uses inst.health_path instead of hardcoded /health
- coordinator allocate_service: looks up health_path from profile spec via
  _get_health_path() and stores on ServiceInstance
- All GPU profiles (2/4/6/8/16/24 GB + cpu-16/32): ollama managed block
  with adopt=true, health_path=/api/tags, port 11434
- 11 new tests
2026-04-02 22:09:42 -07:00
a54a530493 feat: agent watchdog — persist known nodes + auto-reconnect after coordinator restart
closes #15

- NodeStore: SQLite persistence for known agent nodes
  (~/.local/share/circuitforge/cf-orch-nodes.db)
  - upsert on every register(); prune_stale() for 30-day cleanup
  - survives coordinator restarts — data readable by next process

- AgentSupervisor.restore_from_store(): reload known nodes on startup,
  mark all offline; heartbeat loop brings back any that respond

- AgentSupervisor.register(): persists to NodeStore on every call

- cli.py coordinator: NodeStore wired in; restore_from_store() called
  before uvicorn starts

- cli.py agent: one-shot registration replaced with persistent reconnect
  loop (daemon thread, 30 s interval) — coordinator restart → nodes
  reappear within one cycle with no manual intervention on agent hosts

- 16 new tests: NodeStore (8) + AgentSupervisor watchdog (8)
2026-04-02 22:01:55 -07:00