LLMRouter env-var auto-config:
- No llm.yaml required — auto-configures from ANTHROPIC_API_KEY,
OPENAI_API_KEY, or OLLAMA_HOST on first use
- Bare-metal self-hosters can run any CF product with just env vars
- Falls back to FileNotFoundError with actionable message only when
no env vars are set either
CFOrchClient auth:
- Reads CF_LICENSE_KEY env var (or explicit api_key param)
- Sends Authorization: Bearer <key> on all allocation/release requests
- Required for the hosted public coordinator; no-op for local deployments
HeimdallAuthMiddleware (new):
- FastAPI middleware for cf-orch coordinator
- Enabled by HEIMDALL_URL env var; self-hosted deployments skip it
- 5-min TTL cache (matching Kiwi cloud session) keeps Heimdall off the
per-allocation hot path
- /api/health exempt; free-tier keys rejected with 403 + reason
- 13 tests covering cache TTL, tier ranking, and middleware gating
- ProcessSpec: adopt (bool) and health_path (str, default /health) fields
- ServiceManager: adopt=True probes health_path before spawning; is_running()
uses health probe for adopt services rather than proc table + socket check
- _probe_health() helper: urllib GET on localhost:port+path, returns bool
- Agent /services/{service}/start: returns adopted=True when service was
already running; coordinator sets state=running immediately (no probe wait)
- ServiceInstance: health_path field (default /health)
- service_registry.upsert_instance(): health_path kwarg
- Probe loop uses inst.health_path instead of hardcoded /health
- coordinator allocate_service: looks up health_path from profile spec via
_get_health_path() and stores on ServiceInstance
- All GPU profiles (2/4/6/8/16/24 GB + cpu-16/32): ollama managed block
with adopt=true, health_path=/api/tags, port 11434
- 11 new tests
closes#15
- NodeStore: SQLite persistence for known agent nodes
(~/.local/share/circuitforge/cf-orch-nodes.db)
- upsert on every register(); prune_stale() for 30-day cleanup
- survives coordinator restarts — data readable by next process
- AgentSupervisor.restore_from_store(): reload known nodes on startup,
mark all offline; heartbeat loop brings back any that respond
- AgentSupervisor.register(): persists to NodeStore on every call
- cli.py coordinator: NodeStore wired in; restore_from_store() called
before uvicorn starts
- cli.py agent: one-shot registration replaced with persistent reconnect
loop (daemon thread, 30 s interval) — coordinator restart → nodes
reappear within one cycle with no manual intervention on agent hosts
- 16 new tests: NodeStore (8) + AgentSupervisor watchdog (8)
Coordinator now polls all 'starting' instances every 5 s via GET /health.
On 200: state → running. After 300 s without a healthy response: state →
stopped. Closes#10.
max_mb // 2 was too loose — Qwen2.5-3B needs ~5.9 GB on an 8 GB card
but the threshold only required 3.25 GB free, allowing Ollama to hold
4.5 GB while a load attempt was still dispatched (causing OOM crash).
- node_selector: can_fit = free_mb >= service_max_mb (was // 2)
- coordinator /start: same threshold fix + updated error message
- tests: two new node_selector tests pin the full-ceiling semantics;
updated stale docstring in coordinator app test
- apply_chat_template() returns BatchEncoding in transformers 5.x (not bare tensor);
extract .input_ids explicitly with fallback for 4.x compat
- Switch from deprecated torch_dtype= to dtype= in from_pretrained()
- ServiceRegistry: add sweep_expired_allocations() to remove stale TTL
allocations and transition instances to idle; add get_allocation() helper
- AgentSupervisor._run_idle_sweep: call sweep_expired_allocations() before
idle-timeout check so crashed-caller leaks are cleaned up each sweep tick
- schema._parse_managed: copy raw dict before extracting 'type' key instead
of mutating caller's dict with pop()
- app.release_allocation: validate allocation belongs to the given service
path param before releasing; return 404 if mismatch
- router._try_cf_orch_alloc: replace print() with logger.warning(); add
module-level logger = logging.getLogger(__name__)
- tests: add test_sweep_expired_allocations covering TTL expiry and idle
state transition
- CRITICAL: idle sweep now calls mark_stopped() after successful HTTP stop,
preventing repeated stop POSTs on every 3rd tick for the same instance
- CRITICAL: active_allocations() now filters by gpu_id to avoid marking wrong
instance idle on multi-GPU nodes when an allocation is released
- CRITICAL: VRAM pre-flight guard in ensure_service was dead code — added the
actual HTTPException(503) before the candidate loop
- IMPORTANT: register() now updates agent_url on re-registration if it changed,
so relocated agents are tracked correctly
- IMPORTANT: updated test_service_registry.py callers of active_allocations()
to pass the now-required gpu_id argument
Add idle_stop_after_s to ServiceProfile (default 0 = never stop).
Set 600s (10 min) timeout on vllm slot in all single-GPU profiles.
Backward compatible; non-vllm services inherit default 0 (no auto-stop).
Implements CFOrchClient with allocate() (sync contextmanager) and
allocate_async() (async contextmanager) for cf-orch GPU resource
allocation. Releases allocation on exit; ignores 404 on release;
raises RuntimeError on non-2xx allocation response. Exports
CFOrchClient and Allocation from circuitforge_core.resources.
Note: async test uses unittest.mock rather than httpretty — httpretty
only patches stdlib sockets and does not intercept httpx async (anyio)
transport.
Previously shutdown() only joined the scheduler loop thread. Batch
worker threads (which decrement _reserved_vram in their finally block)
could still be running when shutdown returned, leaving stale VRAM
accounting. Now snapshots active workers under lock and joins them all.
Snapshot-then-join pattern avoids holding the lock across blocking join
calls (which would deadlock since workers acquire the same lock on exit).
Before running a batch of tasks, the scheduler now requests a VRAM lease
from the cf-orch coordinator (POST /api/leases). The lease is held for the
full batch and released in the finally block so it's always cleaned up even
on error. Falls back gracefully if the coordinator is unreachable.
Adds coordinator_url and service_name params to TaskScheduler.__init__
and get_scheduler() so callers can override the default localhost:7700.
coordinator/app.py:
- Add POST /api/nodes — agents POST {node_id, agent_url} to self-register;
coordinator immediately polls the new agent for GPU info
- Add lifespan context manager that starts/stops AgentSupervisor heartbeat
loop (previously the loop was never started)
cli.py start:
- Add --node-id flag (default 'local')
- Pre-register the local agent URL (http://127.0.0.1:{agent_port}) so the
heartbeat loop can poll it immediately on startup
- Drop redundant lease_manager.register_gpu() call — supervisor.poll_agent()
now does this via the heartbeat after the agent responds
cli.py agent:
- Add --advertise-host flag for NATted/multi-homed nodes
- Fire registration POST to coordinator in a daemon thread (2s delay) so
uvicorn.run() can start binding immediately; no double uvicorn.run()