circuitforge-core

Author	SHA1	Message	Date
pyr0ball	a54a530493	feat: agent watchdog — persist known nodes + auto-reconnect after coordinator restart closes #15 - NodeStore: SQLite persistence for known agent nodes (~/.local/share/circuitforge/cf-orch-nodes.db) - upsert on every register(); prune_stale() for 30-day cleanup - survives coordinator restarts — data readable by next process - AgentSupervisor.restore_from_store(): reload known nodes on startup, mark all offline; heartbeat loop brings back any that respond - AgentSupervisor.register(): persists to NodeStore on every call - cli.py coordinator: NodeStore wired in; restore_from_store() called before uvicorn starts - cli.py agent: one-shot registration replaced with persistent reconnect loop (daemon thread, 30 s interval) — coordinator restart → nodes reappear within one cycle with no manual intervention on agent hosts - 16 new tests: NodeStore (8) + AgentSupervisor watchdog (8)	2026-04-02 22:01:55 -07:00
pyr0ball	cd9864b5e8	feat: hardware detection, cf-docuvision service, documents ingestion pipeline Closes #5, #7, #8, #13 ## hardware module (closes #5) - HardwareSpec, LLMBackendConfig, LLMConfig dataclasses - VramTier ladder (CPU / 2 / 4 / 6 / 8 / 16 / 24 GB) with select_tier() - generate_profile() maps HardwareSpec → LLMConfig for llm.yaml generation - detect_hardware() with nvidia-smi / rocm-smi / system_profiler / cpu fallback - 31 tests across tiers, generator, and detect ## cf-docuvision service (closes #8) - FastAPI service wrapping ByteDance/Dolphin-v2 (Qwen2.5-VL backbone) - POST /extract: image_b64 or image_path + hint → ExtractResponse - Lazy model loading; JSON-structured output with plain-text fallback - ProcessSpec managed blocks added to all four GPU profiles (6/8/16/24 GB) - 14 tests ## documents module (closes #7) - StructuredDocument, Element, ParsedTable dataclasses (frozen, composable) - DocuvisionClient: thin HTTP client for cf-docuvision POST /extract - ingest(): primary cf-docuvision path → LLMRouter vision fallback → empty doc - CF_DOCUVISION_URL env var for URL override - 22 tests ## coordinator probe loop (closes #13) - _run_instance_probe_loop: starting → running on 200; starting → stopped on timeout - 4 async tests with CancelledError-based tick control	2026-04-02 18:53:25 -07:00
pyr0ball	bd132851ec	fix(orch): tighten VRAM pre-flight to require full max_mb free (not half) max_mb // 2 was too loose — Qwen2.5-3B needs ~5.9 GB on an 8 GB card but the threshold only required 3.25 GB free, allowing Ollama to hold 4.5 GB while a load attempt was still dispatched (causing OOM crash). - node_selector: can_fit = free_mb >= service_max_mb (was // 2) - coordinator /start: same threshold fix + updated error message - tests: two new node_selector tests pin the full-ceiling semantics; updated stale docstring in coordinator app test	2026-04-02 16:44:36 -07:00
pyr0ball	c78341fc6f	feat(orch): replace Ouro/vllm-Docker with generic HF inference server; add ProcessSpec - Add circuitforge_core/resources/inference/llm_server.py: generic OpenAI-compatible FastAPI server for any HuggingFace causal LM (Phi-4-mini-instruct, Qwen2.5-3B-Instruct) - Add service_manager.py + service_probe.py: ProcessSpec start/stop/is_running support (Popen-based; socket probe confirms readiness before marking running) - Update all 4 public GPU profiles to use ProcessSpec→llm_server instead of Docker vllm: 6gb (max_mb 5500), 8gb (max_mb 6500), 16gb/24gb (max_mb 9000) - Model candidates: Phi-4-mini-instruct first (7.2GB), Qwen2.5-3B-Instruct fallback (5.8GB) - Remove ouro_server.py (Ouro incompatible with transformers 5.x; vllm Docker also incompatible) - Add 17 tests for ServiceManager ProcessSpec (start/stop/is_running/list/get_url)	2026-04-02 15:33:08 -07:00
pyr0ball	e58c3aea23	fix: TTL sweep, immutability, service-scoped release, logger in orch alloc - ServiceRegistry: add sweep_expired_allocations() to remove stale TTL allocations and transition instances to idle; add get_allocation() helper - AgentSupervisor._run_idle_sweep: call sweep_expired_allocations() before idle-timeout check so crashed-caller leaks are cleaned up each sweep tick - schema._parse_managed: copy raw dict before extracting 'type' key instead of mutating caller's dict with pop() - app.release_allocation: validate allocation belongs to the given service path param before releasing; return 404 if mismatch - router._try_cf_orch_alloc: replace print() with logger.warning(); add module-level logger = logging.getLogger(__name__) - tests: add test_sweep_expired_allocations covering TTL expiry and idle state transition	2026-04-02 12:55:38 -07:00
pyr0ball	1a20b80a50	test: add VRAM pre-flight 503 test for ensure_service	2026-04-02 12:49:50 -07:00
pyr0ball	a4ccaaf3e2	fix: address coordinator/idle-sweep quality issues from review - CRITICAL: idle sweep now calls mark_stopped() after successful HTTP stop, preventing repeated stop POSTs on every 3rd tick for the same instance - CRITICAL: active_allocations() now filters by gpu_id to avoid marking wrong instance idle on multi-GPU nodes when an allocation is released - CRITICAL: VRAM pre-flight guard in ensure_service was dead code — added the actual HTTPException(503) before the candidate loop - IMPORTANT: register() now updates agent_url on re-registration if it changed, so relocated agents are tracked correctly - IMPORTANT: updated test_service_registry.py callers of active_allocations() to pass the now-required gpu_id argument	2026-04-02 12:45:31 -07:00
pyr0ball	49ab9e4e88	feat: wire ServiceRegistry into coordinator allocate endpoints	2026-04-02 12:30:58 -07:00
pyr0ball	c299482e0d	feat: add idle sweep to AgentSupervisor	2026-04-02 12:30:28 -07:00
pyr0ball	1e168ac636	feat(profiles): add idle_stop_after_s field; set 600s for vllm slot Add idle_stop_after_s to ServiceProfile (default 0 = never stop). Set 600s (10 min) timeout on vllm slot in all single-GPU profiles. Backward compatible; non-vllm services inherit default 0 (no auto-stop).	2026-04-02 12:24:19 -07:00
pyr0ball	9754f522d9	feat(orch): add ServiceRegistry — allocation tracking + idle state machine	2026-04-02 12:22:46 -07:00
pyr0ball	defaf39883	feat(core): add CFOrchClient sync+async context manager Implements CFOrchClient with allocate() (sync contextmanager) and allocate_async() (async contextmanager) for cf-orch GPU resource allocation. Releases allocation on exit; ignores 404 on release; raises RuntimeError on non-2xx allocation response. Exports CFOrchClient and Allocation from circuitforge_core.resources. Note: async test uses unittest.mock rather than httpretty — httpretty only patches stdlib sockets and does not intercept httpx async (anyio) transport.	2026-04-02 11:44:35 -07:00
pyr0ball	8201f6b3e9	feat(orch): add /api/services/{service}/allocate with auto node selection	2026-04-02 11:25:38 -07:00
pyr0ball	52d2c5cf38	feat(orch): expose online_agents() and resident_keys() helpers	2026-04-02 11:22:29 -07:00
pyr0ball	13eb0c85f1	feat(orch): add NodeSelector — warm-first GPU scoring	2026-04-02 11:18:44 -07:00
pyr0ball	7aa0ad7a51	feat(dashboard): add self-hosted coordinator dashboard at GET / - dashboard.html: node-centric layout — GPU cards with VRAM bars and sparklines, active leases table with TTL progress bars, service health pill, auto-refreshes every 5s via fetch() against the local JSON API - All dynamic content set via DOM textContent / createElementNS — no innerHTML with user-sourced strings - coordinator/app.py: serves dashboard.html at GET / (HTMLResponse, excluded from OpenAPI schema); HTML read at import time from package dir - test_dashboard_serves_html: verifies 200, content-type text/html, and key route markers present	2026-03-31 18:57:25 -07:00
pyr0ball	d755e9ea2c	test(resources): add integration tests for full lease/eviction cycle	2026-03-30 22:37:06 -07:00
pyr0ball	70017abd35	feat(resources): add cf-orch CLI with start, agent, status, install-service commands	2026-03-30 22:27:11 -07:00
pyr0ball	4bcd297b18	feat(resources): add cforch-coordinator FastAPI app with lease/node/profile endpoints	2026-03-30 22:01:46 -07:00
pyr0ball	cede761d82	feat(resources): add AgentSupervisor and EvictionEngine	2026-03-30 21:44:42 -07:00
pyr0ball	7718911652	feat(resources): add cforch-agent FastAPI app with /health /gpu-info /evict	2026-03-30 20:51:08 -07:00
pyr0ball	4a857d5339	feat(resources): add EvictionExecutor with SIGTERM/grace/SIGKILL sequence	2026-03-30 20:46:45 -07:00
pyr0ball	a79fd10f45	fix(resources): patch subprocess at import site in gpu_monitor tests	2026-03-30 20:45:01 -07:00
pyr0ball	3dcbe801f1	feat(resources): add GpuMonitor for nvidia-smi polling	2026-03-30 20:42:57 -07:00
pyr0ball	6b239b76e3	fix(resources): rename lambda var; convert asyncio.run test to async	2026-03-30 20:41:03 -07:00
pyr0ball	d60503f059	feat(resources): add LeaseManager with VRAM tracking and eviction candidate selection	2026-03-30 20:38:51 -07:00
pyr0ball	cdd8072b32	fix(resources): move MagicMock import to module level in profile registry tests	2026-03-30 20:36:40 -07:00
pyr0ball	0389f4f167	feat(resources): add ProfileRegistry with auto-detect and public profile loading	2026-03-30 20:34:16 -07:00
pyr0ball	bfc1f7b7b9	fix(resources): guard non-dict YAML in load_profile; remove unused FIXTURES constant	2026-03-30 20:30:30 -07:00
pyr0ball	c6a58b6a37	feat(resources): add GPU profile schema and public 8GB/6GB/2GB profiles	2026-03-30 20:28:06 -07:00
pyr0ball	b774afb6b0	fix(resources): add expires_at sentinel comment; move pytest import to module level	2026-03-30 20:25:58 -07:00
pyr0ball	0888f0f16b	feat(resources): add shared VRAMLease, GpuInfo, NodeInfo models	2026-03-30 20:21:37 -07:00

32 commits