circuitforge-core/tests/test_resources
pyr0ball a54a530493 feat: agent watchdog — persist known nodes + auto-reconnect after coordinator restart
closes #15

- NodeStore: SQLite persistence for known agent nodes
  (~/.local/share/circuitforge/cf-orch-nodes.db)
  - upsert on every register(); prune_stale() for 30-day cleanup
  - survives coordinator restarts — data readable by next process

- AgentSupervisor.restore_from_store(): reload known nodes on startup,
  mark all offline; heartbeat loop brings back any that respond

- AgentSupervisor.register(): persists to NodeStore on every call

- cli.py coordinator: NodeStore wired in; restore_from_store() called
  before uvicorn starts

- cli.py agent: one-shot registration replaced with persistent reconnect
  loop (daemon thread, 30 s interval) — coordinator restart → nodes
  reappear within one cycle with no manual intervention on agent hosts

- 16 new tests: NodeStore (8) + AgentSupervisor watchdog (8)
2026-04-02 22:01:55 -07:00
..
__init__.py feat(resources): add shared VRAMLease, GpuInfo, NodeInfo models 2026-03-30 20:21:37 -07:00
test_agent_app.py feat(resources): add cforch-agent FastAPI app with /health /gpu-info /evict 2026-03-30 20:51:08 -07:00
test_agent_supervisor.py feat: add idle sweep to AgentSupervisor 2026-04-02 12:30:28 -07:00
test_agent_watchdog.py feat: agent watchdog — persist known nodes + auto-reconnect after coordinator restart 2026-04-02 22:01:55 -07:00
test_cli.py feat(resources): add cf-orch CLI with start, agent, status, install-service commands 2026-03-30 22:27:11 -07:00
test_client.py feat(core): add CFOrchClient sync+async context manager 2026-04-02 11:44:35 -07:00
test_coordinator_allocate.py feat: wire ServiceRegistry into coordinator allocate endpoints 2026-04-02 12:30:58 -07:00
test_coordinator_app.py fix(orch): tighten VRAM pre-flight to require full max_mb free (not half) 2026-04-02 16:44:36 -07:00
test_coordinator_probe.py feat: hardware detection, cf-docuvision service, documents ingestion pipeline 2026-04-02 18:53:25 -07:00
test_docuvision.py feat: hardware detection, cf-docuvision service, documents ingestion pipeline 2026-04-02 18:53:25 -07:00
test_eviction_engine.py feat(resources): add AgentSupervisor and EvictionEngine 2026-03-30 21:44:42 -07:00
test_eviction_executor.py feat(resources): add EvictionExecutor with SIGTERM/grace/SIGKILL sequence 2026-03-30 20:46:45 -07:00
test_gpu_monitor.py fix(resources): patch subprocess at import site in gpu_monitor tests 2026-03-30 20:45:01 -07:00
test_integration.py feat: wire ServiceRegistry into coordinator allocate endpoints 2026-04-02 12:30:58 -07:00
test_lease_manager.py fix(resources): rename lambda var; convert asyncio.run test to async 2026-03-30 20:41:03 -07:00
test_models.py fix(resources): add expires_at sentinel comment; move pytest import to module level 2026-03-30 20:25:58 -07:00
test_node_selector.py fix(orch): tighten VRAM pre-flight to require full max_mb free (not half) 2026-04-02 16:44:36 -07:00
test_node_store.py feat: agent watchdog — persist known nodes + auto-reconnect after coordinator restart 2026-04-02 22:01:55 -07:00
test_profile_registry.py fix(resources): move MagicMock import to module level in profile registry tests 2026-03-30 20:36:40 -07:00
test_service_manager.py feat(orch): replace Ouro/vllm-Docker with generic HF inference server; add ProcessSpec 2026-04-02 15:33:08 -07:00
test_service_registry.py fix: TTL sweep, immutability, service-scoped release, logger in orch alloc 2026-04-02 12:55:38 -07:00