Commit graph

65 commits

Author SHA1 Message Date
a36f469d60 chore: CHANGELOG for v0.3.0 2026-04-02 18:56:49 -07:00
1de5ec767c Merge pull request 'feat: hardware detection, cf-docuvision service, documents ingestion pipeline' (#14) from feature/hardware-docuvision into main 2026-04-02 18:55:50 -07:00
cd9864b5e8 feat: hardware detection, cf-docuvision service, documents ingestion pipeline
Closes #5, #7, #8, #13

## hardware module (closes #5)
- HardwareSpec, LLMBackendConfig, LLMConfig dataclasses
- VramTier ladder (CPU / 2 / 4 / 6 / 8 / 16 / 24 GB) with select_tier()
- generate_profile() maps HardwareSpec → LLMConfig for llm.yaml generation
- detect_hardware() with nvidia-smi / rocm-smi / system_profiler / cpu fallback
- 31 tests across tiers, generator, and detect

## cf-docuvision service (closes #8)
- FastAPI service wrapping ByteDance/Dolphin-v2 (Qwen2.5-VL backbone)
- POST /extract: image_b64 or image_path + hint → ExtractResponse
- Lazy model loading; JSON-structured output with plain-text fallback
- ProcessSpec managed blocks added to all four GPU profiles (6/8/16/24 GB)
- 14 tests

## documents module (closes #7)
- StructuredDocument, Element, ParsedTable dataclasses (frozen, composable)
- DocuvisionClient: thin HTTP client for cf-docuvision POST /extract
- ingest(): primary cf-docuvision path → LLMRouter vision fallback → empty doc
- CF_DOCUVISION_URL env var for URL override
- 22 tests

## coordinator probe loop (closes #13)
- _run_instance_probe_loop: starting → running on 200; starting → stopped on timeout
- 4 async tests with CancelledError-based tick control
2026-04-02 18:53:25 -07:00
482c430cdb docs: add CHANGELOG for v0.1.0 and v0.2.0 2026-04-02 17:25:06 -07:00
749e51ccca Merge pull request 'feat(orch): health probe loop + VRAM pre-flight fix' (#12) from feature/orch-llm-server into main 2026-04-02 17:24:09 -07:00
a7290c1240 feat(orch): background health probe loop — starting → running transition
Coordinator now polls all 'starting' instances every 5 s via GET /health.
On 200: state → running. After 300 s without a healthy response: state →
stopped. Closes #10.
2026-04-02 17:18:16 -07:00
bd132851ec fix(orch): tighten VRAM pre-flight to require full max_mb free (not half)
max_mb // 2 was too loose — Qwen2.5-3B needs ~5.9 GB on an 8 GB card
but the threshold only required 3.25 GB free, allowing Ollama to hold
4.5 GB while a load attempt was still dispatched (causing OOM crash).

- node_selector: can_fit = free_mb >= service_max_mb (was // 2)
- coordinator /start: same threshold fix + updated error message
- tests: two new node_selector tests pin the full-ceiling semantics;
  updated stale docstring in coordinator app test
2026-04-02 16:44:36 -07:00
2d095f0090 fix(llm-server): handle transformers 5.x BatchEncoding; use dtype kwarg
- apply_chat_template() returns BatchEncoding in transformers 5.x (not bare tensor);
  extract .input_ids explicitly with fallback for 4.x compat
- Switch from deprecated torch_dtype= to dtype= in from_pretrained()
2026-04-02 16:36:07 -07:00
c78341fc6f feat(orch): replace Ouro/vllm-Docker with generic HF inference server; add ProcessSpec
- Add circuitforge_core/resources/inference/llm_server.py: generic OpenAI-compatible
  FastAPI server for any HuggingFace causal LM (Phi-4-mini-instruct, Qwen2.5-3B-Instruct)
- Add service_manager.py + service_probe.py: ProcessSpec start/stop/is_running support
  (Popen-based; socket probe confirms readiness before marking running)
- Update all 4 public GPU profiles to use ProcessSpec→llm_server instead of Docker vllm:
  6gb (max_mb 5500), 8gb (max_mb 6500), 16gb/24gb (max_mb 9000)
- Model candidates: Phi-4-mini-instruct first (7.2GB), Qwen2.5-3B-Instruct fallback (5.8GB)
- Remove ouro_server.py (Ouro incompatible with transformers 5.x; vllm Docker also incompatible)
- Add 17 tests for ServiceManager ProcessSpec (start/stop/is_running/list/get_url)
2026-04-02 15:33:08 -07:00
27999925cf fix(orch): seed ServiceInstance on first allocate start 2026-04-02 14:22:55 -07:00
c5e12b74f2 Merge pull request 'feat: auto service lifecycle — /allocate, NodeSelector, idle sweep, CFOrchClient' (#9) from feature/orch-auto-lifecycle into main 2026-04-02 14:11:36 -07:00
e58c3aea23 fix: TTL sweep, immutability, service-scoped release, logger in orch alloc
- ServiceRegistry: add sweep_expired_allocations() to remove stale TTL
  allocations and transition instances to idle; add get_allocation() helper
- AgentSupervisor._run_idle_sweep: call sweep_expired_allocations() before
  idle-timeout check so crashed-caller leaks are cleaned up each sweep tick
- schema._parse_managed: copy raw dict before extracting 'type' key instead
  of mutating caller's dict with pop()
- app.release_allocation: validate allocation belongs to the given service
  path param before releasing; return 404 if mismatch
- router._try_cf_orch_alloc: replace print() with logger.warning(); add
  module-level logger = logging.getLogger(__name__)
- tests: add test_sweep_expired_allocations covering TTL expiry and idle
  state transition
2026-04-02 12:55:38 -07:00
1a20b80a50 test: add VRAM pre-flight 503 test for ensure_service 2026-04-02 12:49:50 -07:00
02806359af feat: add Services table to coordinator dashboard 2026-04-02 12:47:27 -07:00
a4ccaaf3e2 fix: address coordinator/idle-sweep quality issues from review
- CRITICAL: idle sweep now calls mark_stopped() after successful HTTP stop,
  preventing repeated stop POSTs on every 3rd tick for the same instance
- CRITICAL: active_allocations() now filters by gpu_id to avoid marking wrong
  instance idle on multi-GPU nodes when an allocation is released
- CRITICAL: VRAM pre-flight guard in ensure_service was dead code — added the
  actual HTTPException(503) before the candidate loop
- IMPORTANT: register() now updates agent_url on re-registration if it changed,
  so relocated agents are tracked correctly
- IMPORTANT: updated test_service_registry.py callers of active_allocations()
  to pass the now-required gpu_id argument
2026-04-02 12:45:31 -07:00
49ab9e4e88 feat: wire ServiceRegistry into coordinator allocate endpoints 2026-04-02 12:30:58 -07:00
c299482e0d feat: add idle sweep to AgentSupervisor 2026-04-02 12:30:28 -07:00
1e168ac636 feat(profiles): add idle_stop_after_s field; set 600s for vllm slot
Add idle_stop_after_s to ServiceProfile (default 0 = never stop).
Set 600s (10 min) timeout on vllm slot in all single-GPU profiles.
Backward compatible; non-vllm services inherit default 0 (no auto-stop).
2026-04-02 12:24:19 -07:00
9754f522d9 feat(orch): add ServiceRegistry — allocation tracking + idle state machine 2026-04-02 12:22:46 -07:00
17a24173f7 feat(llm): add cf_orch allocation support to LLMRouter backends 2026-04-02 12:19:17 -07:00
f741e6a80b fix(orch): hoist service-known check; capture resident_keys once in allocate 2026-04-02 11:45:48 -07:00
defaf39883 feat(core): add CFOrchClient sync+async context manager
Implements CFOrchClient with allocate() (sync contextmanager) and
allocate_async() (async contextmanager) for cf-orch GPU resource
allocation. Releases allocation on exit; ignores 404 on release;
raises RuntimeError on non-2xx allocation response. Exports
CFOrchClient and Allocation from circuitforge_core.resources.

Note: async test uses unittest.mock rather than httpretty — httpretty
only patches stdlib sockets and does not intercept httpx async (anyio)
transport.
2026-04-02 11:44:35 -07:00
8201f6b3e9 feat(orch): add /api/services/{service}/allocate with auto node selection 2026-04-02 11:25:38 -07:00
52d2c5cf38 feat(orch): expose online_agents() and resident_keys() helpers 2026-04-02 11:22:29 -07:00
d600fb6651 refactor(orch): hoist service_max_mb lookup; clarify warm-fallback comments 2026-04-02 11:21:20 -07:00
13eb0c85f1 feat(orch): add NodeSelector — warm-first GPU scoring 2026-04-02 11:18:44 -07:00
427182aae7 feat: cf-orch agent registration + VRAM lease wiring
Merges feature/orch-agent-registration into main.

- Agent self-registration and coordinator heartbeat loop
- TaskScheduler acquires/releases cf-orch VRAM lease per batch worker
- shutdown() now joins batch worker threads for clean teardown
- 94 tests passing
2026-04-01 11:21:38 -07:00
aa51794f45 fix(scheduler): join batch worker threads in shutdown()
Previously shutdown() only joined the scheduler loop thread. Batch
worker threads (which decrement _reserved_vram in their finally block)
could still be running when shutdown returned, leaving stale VRAM
accounting. Now snapshots active workers under lock and joins them all.

Snapshot-then-join pattern avoids holding the lock across blocking join
calls (which would deadlock since workers acquire the same lock on exit).
2026-04-01 11:21:30 -07:00
6b8e421eb2 feat(scheduler): acquire/release cf-orch VRAM lease per batch worker
Before running a batch of tasks, the scheduler now requests a VRAM lease
from the cf-orch coordinator (POST /api/leases). The lease is held for the
full batch and released in the finally block so it's always cleaned up even
on error. Falls back gracefully if the coordinator is unreachable.

Adds coordinator_url and service_name params to TaskScheduler.__init__
and get_scheduler() so callers can override the default localhost:7700.
2026-04-01 11:06:16 -07:00
67701f0d29 feat(orch): agent self-registration and coordinator heartbeat loop
coordinator/app.py:
- Add POST /api/nodes — agents POST {node_id, agent_url} to self-register;
  coordinator immediately polls the new agent for GPU info
- Add lifespan context manager that starts/stops AgentSupervisor heartbeat
  loop (previously the loop was never started)

cli.py start:
- Add --node-id flag (default 'local')
- Pre-register the local agent URL (http://127.0.0.1:{agent_port}) so the
  heartbeat loop can poll it immediately on startup
- Drop redundant lease_manager.register_gpu() call — supervisor.poll_agent()
  now does this via the heartbeat after the agent responds

cli.py agent:
- Add --advertise-host flag for NATted/multi-homed nodes
- Fire registration POST to coordinator in a daemon thread (2s delay) so
  uvicorn.run() can start binding immediately; no double uvicorn.run()
2026-03-31 19:20:35 -07:00
4596aad290 Merge pull request 'feat(dashboard): self-hosted coordinator dashboard at GET /' (#3) from feature/orch-dashboard into main 2026-03-31 18:59:46 -07:00
7aa0ad7a51 feat(dashboard): add self-hosted coordinator dashboard at GET /
- dashboard.html: node-centric layout — GPU cards with VRAM bars and
  sparklines, active leases table with TTL progress bars, service health
  pill, auto-refreshes every 5s via fetch() against the local JSON API
- All dynamic content set via DOM textContent / createElementNS — no
  innerHTML with user-sourced strings
- coordinator/app.py: serves dashboard.html at GET / (HTMLResponse,
  excluded from OpenAPI schema); HTML read at import time from package dir
- test_dashboard_serves_html: verifies 200, content-type text/html,
  and key route markers present
2026-03-31 18:57:25 -07:00
563b73ce85 Merge pull request 'feat(tasks): shared VRAM-aware LLM task scheduler' (#2) from feature/shared-task-scheduler into main 2026-03-31 10:45:21 -07:00
99f4e95018 Merge pull request 'feat(resources): cf-orch GPU VRAM orchestration — Plan A core' (#1) from feature/cforch-core-orchestration into main 2026-03-31 10:43:52 -07:00
c027fe6137 fix(core): SQLite timeout=30, INSERT OR IGNORE migrations, parameterize tier unlockables
- get_connection(): add timeout=30 to both sqlite3 and pysqlcipher3 paths so
  concurrent writers retry instead of immediately raising OperationalError
- run_migrations(): INSERT OR IGNORE so two Store() calls racing on first boot
  don't hit a UNIQUE constraint on the migrations table
- can_use() / tier_label(): accept _byok_unlockable and _local_vision_unlockable
  overrides so products pass their own frozensets rather than sharing module-level
  constants (required for circuitforge-core to serve multiple products cleanly)
2026-03-31 10:37:51 -07:00
22bad8590a fix(tasks): fix VRAM accounting race, lock scope, type annotations
- C1: Remove _reserved_vram decrement from _scheduler_loop reaper; sole
  responsibility now belongs to _batch_worker's finally block, eliminating
  the double-decrement race that could drive _reserved_vram negative.
- C2: Move TaskScheduler construction (including VRAM detection httpx call)
  outside _scheduler_lock in get_scheduler(); lock is now only held for the
  final singleton assignment, preventing 2s lock contention on first call.
- I1: Add RunTaskFn type alias (Callable[...]) and use it in __init__ and
  get_scheduler() instead of bare Callable.
- I2: Replace namedtuple TaskSpec with typed NamedTuple class.
- I3: Parameterize _queues annotation as dict[str, deque[TaskSpec]].
- I4: Wrap _queues read in start() with self._lock.
- I5: Replace time.sleep() ordering assertion in test_vram_budget_blocks_second_type
  with event-based synchronization using type_a_started/type_b_started events.
- M2: Use sqlite3.connect() as context manager in _load_queued_tasks.
- M3: Strengthen weak assertion in test_enqueue_returns_false_when_queue_full.
- M4: Add test_reserved_vram_zero_after_task_completes to catch C1 regression.
2026-03-31 09:15:09 -07:00
09a5087c72 test(tasks): add preflight fallback coverage to scheduler tests
Adds test_detect_vram_preflight_fallback to cover the spec path where
cf-orch is unreachable but scripts.preflight.get_gpus() succeeds,
verifying detect_available_vram_gb() returns the summed total VRAM.
Uses sys.modules injection to simulate the preflight module being present.
2026-03-30 23:15:19 -07:00
5801928f8e feat(tasks): add shared VRAM-aware LLM task scheduler
Extract generic batch scheduler into circuitforge_core.tasks.scheduler
so any CircuitForge product can use it. Includes VRAM detection via
cf-orch coordinator (cooperative free-VRAM), preflight fallback, and
unlimited fallback; singleton API; full test coverage (12 tests).
2026-03-30 23:12:23 -07:00
db4e3047fd fix(resources): address code review findings from final review
- eviction_engine: replace deprecated asyncio.get_event_loop() with
  get_running_loop() (Python 3.12 compatibility)
- eviction_engine: remove unused httpx import
- coordinator app: return 422 for unknown node_id instead of silently
  falling back to hardcoded localhost URL
- eviction_executor: guard against pid <= 0 to prevent accidental
  SIGTERM to process group
- pyproject.toml: move pytest-asyncio to [dev] extras, not [orch]
- profile_registry: document CPU profile exclusion from list_public()
2026-03-30 22:46:07 -07:00
d755e9ea2c test(resources): add integration tests for full lease/eviction cycle 2026-03-30 22:37:06 -07:00
1f296c0cdb feat(resources): add [orch] package extras, cf-orch entry point, Docker compose 2026-03-30 22:34:40 -07:00
5fb3a2b41e style(resources): apply Black formatting to cli.py 2026-03-30 22:33:38 -07:00
70017abd35 feat(resources): add cf-orch CLI with start, agent, status, install-service commands 2026-03-30 22:27:11 -07:00
dba49a47fe refactor(resources): rename cforch → cf-orch in FastAPI titles 2026-03-30 22:22:48 -07:00
4bcd297b18 feat(resources): add cforch-coordinator FastAPI app with lease/node/profile endpoints 2026-03-30 22:01:46 -07:00
cede761d82 feat(resources): add AgentSupervisor and EvictionEngine 2026-03-30 21:44:42 -07:00
7718911652 feat(resources): add cforch-agent FastAPI app with /health /gpu-info /evict 2026-03-30 20:51:08 -07:00
4a857d5339 feat(resources): add EvictionExecutor with SIGTERM/grace/SIGKILL sequence 2026-03-30 20:46:45 -07:00
a79fd10f45 fix(resources): patch subprocess at import site in gpu_monitor tests 2026-03-30 20:45:01 -07:00
3dcbe801f1 feat(resources): add GpuMonitor for nvidia-smi polling 2026-03-30 20:42:57 -07:00