feat: agent watchdog + Ollama adopt-if-running #17

Merged
pyr0ball merged 2 commits from feature/agent-watchdog into main 2026-04-02 22:12:33 -07:00
Owner

Summary

#15 — Agent watchdog / coordinator-restart reconnect

  • NodeStore: SQLite persistence at ~/.local/share/circuitforge/cf-orch-nodes.db
  • AgentSupervisor.restore_from_store(): reloads all known nodes on coordinator startup, marks offline until first successful poll
  • register() now persists every call to NodeStore
  • Agent CLI: one-shot registration replaced with a 30 s reconnect loop — coordinator restart → nodes reappear automatically

#16 — Ollama adopt-if-running + health_path

  • ProcessSpec: adopt: bool + health_path: str fields
  • ServiceManager.start(): with adopt=True, probes health first; claims the running service without spawning a new process
  • ServiceInstance.health_path + upsert_instance(health_path=) — probe loop uses per-instance path
  • All GPU profiles (2/4/6/8/16/24 GB + cpu-16/32): ollama managed block with adopt: true, health_path: /api/tags

Tests

  • 243 total (+27 new): NodeStore (8), AgentSupervisor watchdog (8), Ollama adopt (11)

Closes #15, #16

## Summary **#15 — Agent watchdog / coordinator-restart reconnect** - `NodeStore`: SQLite persistence at `~/.local/share/circuitforge/cf-orch-nodes.db` - `AgentSupervisor.restore_from_store()`: reloads all known nodes on coordinator startup, marks offline until first successful poll - `register()` now persists every call to NodeStore - Agent CLI: one-shot registration replaced with a 30 s reconnect loop — coordinator restart → nodes reappear automatically **#16 — Ollama adopt-if-running + health_path** - `ProcessSpec`: `adopt: bool` + `health_path: str` fields - `ServiceManager.start()`: with `adopt=True`, probes health first; claims the running service without spawning a new process - `ServiceInstance.health_path` + `upsert_instance(health_path=)` — probe loop uses per-instance path - All GPU profiles (2/4/6/8/16/24 GB + cpu-16/32): ollama managed block with `adopt: true`, `health_path: /api/tags` ## Tests - 243 total (+27 new): NodeStore (8), AgentSupervisor watchdog (8), Ollama adopt (11) Closes #15, #16
pyr0ball added 2 commits 2026-04-02 22:10:05 -07:00
closes #15

- NodeStore: SQLite persistence for known agent nodes
  (~/.local/share/circuitforge/cf-orch-nodes.db)
  - upsert on every register(); prune_stale() for 30-day cleanup
  - survives coordinator restarts — data readable by next process

- AgentSupervisor.restore_from_store(): reload known nodes on startup,
  mark all offline; heartbeat loop brings back any that respond

- AgentSupervisor.register(): persists to NodeStore on every call

- cli.py coordinator: NodeStore wired in; restore_from_store() called
  before uvicorn starts

- cli.py agent: one-shot registration replaced with persistent reconnect
  loop (daemon thread, 30 s interval) — coordinator restart → nodes
  reappear within one cycle with no manual intervention on agent hosts

- 16 new tests: NodeStore (8) + AgentSupervisor watchdog (8)
- ProcessSpec: adopt (bool) and health_path (str, default /health) fields
- ServiceManager: adopt=True probes health_path before spawning; is_running()
  uses health probe for adopt services rather than proc table + socket check
- _probe_health() helper: urllib GET on localhost:port+path, returns bool
- Agent /services/{service}/start: returns adopted=True when service was
  already running; coordinator sets state=running immediately (no probe wait)
- ServiceInstance: health_path field (default /health)
- service_registry.upsert_instance(): health_path kwarg
- Probe loop uses inst.health_path instead of hardcoded /health
- coordinator allocate_service: looks up health_path from profile spec via
  _get_health_path() and stores on ServiceInstance
- All GPU profiles (2/4/6/8/16/24 GB + cpu-16/32): ollama managed block
  with adopt=true, health_path=/api/tags, port 11434
- 11 new tests
pyr0ball merged commit d45d4e1de6 into main 2026-04-02 22:12:33 -07:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/circuitforge-core#17
No description provided.