feat(orch): health probe loop + VRAM pre-flight fix #12

Merged

pyr0ball merged 4 commits from feature/orch-llm-server into main

2026-04-02 17:24:10 -07:00

pyr0ball commented

2026-04-02 17:24:01 -07:00

Owner

Summary

#10 — Background health probe loop: coordinator now polls starting instances every 5 s via GET /health, transitions to running on 200 or stopped after 300 s timeout
#11 / VRAM — Tighten VRAM pre-flight to require full service_max_mb free (was // 2); fixes instances starting when VRAM is insufficient

Commits

a7290c1 feat(orch): background health probe loop — starting → running transition
bd13285 fix(orch): tighten VRAM pre-flight to require full max_mb free (not half)

Closes #10

## Summary - **#10** — Background health probe loop: coordinator now polls `starting` instances every 5 s via `GET /health`, transitions to `running` on 200 or `stopped` after 300 s timeout - **#11 / VRAM** — Tighten VRAM pre-flight to require full `service_max_mb` free (was `// 2`); fixes instances starting when VRAM is insufficient ## Commits - `a7290c1` feat(orch): background health probe loop — starting → running transition - `bd13285` fix(orch): tighten VRAM pre-flight to require full max_mb free (not half) Closes #10

pyr0ball added 4 commits 2026-04-02 17:24:02 -07:00

feat(orch): replace Ouro/vllm-Docker with generic HF inference server; add ProcessSpec c78341fc6f

- Add circuitforge_core/resources/inference/llm_server.py: generic OpenAI-compatible
  FastAPI server for any HuggingFace causal LM (Phi-4-mini-instruct, Qwen2.5-3B-Instruct)
- Add service_manager.py + service_probe.py: ProcessSpec start/stop/is_running support
  (Popen-based; socket probe confirms readiness before marking running)
- Update all 4 public GPU profiles to use ProcessSpec→llm_server instead of Docker vllm:
  6gb (max_mb 5500), 8gb (max_mb 6500), 16gb/24gb (max_mb 9000)
- Model candidates: Phi-4-mini-instruct first (7.2GB), Qwen2.5-3B-Instruct fallback (5.8GB)
- Remove ouro_server.py (Ouro incompatible with transformers 5.x; vllm Docker also incompatible)
- Add 17 tests for ServiceManager ProcessSpec (start/stop/is_running/list/get_url)

fix(llm-server): handle transformers 5.x BatchEncoding; use dtype kwarg 2d095f0090

- apply_chat_template() returns BatchEncoding in transformers 5.x (not bare tensor);
  extract .input_ids explicitly with fallback for 4.x compat
- Switch from deprecated torch_dtype= to dtype= in from_pretrained()

fix(orch): tighten VRAM pre-flight to require full max_mb free (not half) bd132851ec

max_mb // 2 was too loose — Qwen2.5-3B needs ~5.9 GB on an 8 GB card
but the threshold only required 3.25 GB free, allowing Ollama to hold
4.5 GB while a load attempt was still dispatched (causing OOM crash).

- node_selector: can_fit = free_mb >= service_max_mb (was // 2)
- coordinator /start: same threshold fix + updated error message
- tests: two new node_selector tests pin the full-ceiling semantics;
  updated stale docstring in coordinator app test

feat(orch): background health probe loop — starting → running transition a7290c1240

Coordinator now polls all 'starting' instances every 5 s via GET /health.
On 200: state → running. After 300 s without a healthy response: state →
stopped. Closes #10.