feat(orch): health probe loop + VRAM pre-flight fix #12

Merged
pyr0ball merged 4 commits from feature/orch-llm-server into main 2026-04-02 17:24:10 -07:00
Owner

Summary

  • #10 — Background health probe loop: coordinator now polls starting instances every 5 s via GET /health, transitions to running on 200 or stopped after 300 s timeout
  • #11 / VRAM — Tighten VRAM pre-flight to require full service_max_mb free (was // 2); fixes instances starting when VRAM is insufficient

Commits

  • a7290c1 feat(orch): background health probe loop — starting → running transition
  • bd13285 fix(orch): tighten VRAM pre-flight to require full max_mb free (not half)

Closes #10

## Summary - **#10** — Background health probe loop: coordinator now polls `starting` instances every 5 s via `GET /health`, transitions to `running` on 200 or `stopped` after 300 s timeout - **#11 / VRAM** — Tighten VRAM pre-flight to require full `service_max_mb` free (was `// 2`); fixes instances starting when VRAM is insufficient ## Commits - `a7290c1` feat(orch): background health probe loop — starting → running transition - `bd13285` fix(orch): tighten VRAM pre-flight to require full max_mb free (not half) Closes #10
pyr0ball added 4 commits 2026-04-02 17:24:02 -07:00
- Add circuitforge_core/resources/inference/llm_server.py: generic OpenAI-compatible
  FastAPI server for any HuggingFace causal LM (Phi-4-mini-instruct, Qwen2.5-3B-Instruct)
- Add service_manager.py + service_probe.py: ProcessSpec start/stop/is_running support
  (Popen-based; socket probe confirms readiness before marking running)
- Update all 4 public GPU profiles to use ProcessSpec→llm_server instead of Docker vllm:
  6gb (max_mb 5500), 8gb (max_mb 6500), 16gb/24gb (max_mb 9000)
- Model candidates: Phi-4-mini-instruct first (7.2GB), Qwen2.5-3B-Instruct fallback (5.8GB)
- Remove ouro_server.py (Ouro incompatible with transformers 5.x; vllm Docker also incompatible)
- Add 17 tests for ServiceManager ProcessSpec (start/stop/is_running/list/get_url)
- apply_chat_template() returns BatchEncoding in transformers 5.x (not bare tensor);
  extract .input_ids explicitly with fallback for 4.x compat
- Switch from deprecated torch_dtype= to dtype= in from_pretrained()
max_mb // 2 was too loose — Qwen2.5-3B needs ~5.9 GB on an 8 GB card
but the threshold only required 3.25 GB free, allowing Ollama to hold
4.5 GB while a load attempt was still dispatched (causing OOM crash).

- node_selector: can_fit = free_mb >= service_max_mb (was // 2)
- coordinator /start: same threshold fix + updated error message
- tests: two new node_selector tests pin the full-ceiling semantics;
  updated stale docstring in coordinator app test
Coordinator now polls all 'starting' instances every 5 s via GET /health.
On 200: state → running. After 300 s without a healthy response: state →
stopped. Closes #10.
pyr0ball merged commit 749e51ccca into main 2026-04-02 17:24:10 -07:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/circuitforge-core#12
No description provided.