circuitforge-core

Author	SHA1	Message	Date
pyr0ball	d16bc569cf	chore: bump version to 0.7.0 — affiliates + preferences modules	2026-04-04 18:28:52 -07:00
pyr0ball	ccd2a35deb	test: affiliates integration tests — full wrap_url round-trip	2026-04-04 18:28:27 -07:00
pyr0ball	fe19de3d9a	feat: affiliates public API surface (__init__.py)	2026-04-04 18:27:45 -07:00
pyr0ball	7837fbcad2	feat: affiliates router — wrap_url() with opt-out, BYOK, and CF env-var resolution	2026-04-04 18:20:21 -07:00
pyr0ball	73cec07bd2	feat: affiliates disclosure — per-retailer tooltip copy + first-encounter banner constants	2026-04-04 18:14:58 -07:00
pyr0ball	4c3f3a95a5	feat: affiliates programs — AffiliateProgram, registry, eBay EPN + Amazon Associates builders	2026-04-04 18:12:45 -07:00
pyr0ball	d719ea2309	feat: preferences public helpers — get_user_preference / set_user_preference (closes #22 self-hosted)	2026-04-04 18:10:24 -07:00
pyr0ball	0d9d030320	feat: preferences LocalFileStore — YAML-backed single-user preference store	2026-04-04 18:07:35 -07:00
pyr0ball	9ee31a09c1	feat: preferences dot-path utilities (get_path, set_path)	2026-04-04 18:04:44 -07:00
pyr0ball	e6cd3a2e96	chore: sync __version__ to 0.6.0 (matches pyproject.toml)	2026-04-03 16:48:11 -07:00
pyr0ball	cb51ba72bc	feat: cf-orch Docker image + Forgejo CI pipeline Dockerfile.orch — multi-mode image (coordinator \| agent): - coordinator: runs cf-orch coordinator on $CF_ORCH_PORT (default 7700) - agent: connects to $CF_COORDINATOR_URL, serves $CF_AGENT_GPU_IDS .forgejo/workflows/docker.yml — publishes on every vN.N.N tag: - ghcr.io/circuit-forge/cf-orch:latest - ghcr.io/circuit-forge/cf-orch:vX.Y.Z - Layer cache via GHA cache backend Closes #19. Bumps to v0.6.0.	2026-04-03 09:10:29 -07:00
pyr0ball	3deae056de	feat: local-first LLM config + hosted coordinator auth LLMRouter env-var auto-config: - No llm.yaml required — auto-configures from ANTHROPIC_API_KEY, OPENAI_API_KEY, or OLLAMA_HOST on first use - Bare-metal self-hosters can run any CF product with just env vars - Falls back to FileNotFoundError with actionable message only when no env vars are set either CFOrchClient auth: - Reads CF_LICENSE_KEY env var (or explicit api_key param) - Sends Authorization: Bearer <key> on all allocation/release requests - Required for the hosted public coordinator; no-op for local deployments HeimdallAuthMiddleware (new): - FastAPI middleware for cf-orch coordinator - Enabled by HEIMDALL_URL env var; self-hosted deployments skip it - 5-min TTL cache (matching Kiwi cloud session) keeps Heimdall off the per-allocation hot path - /api/health exempt; free-tier keys rejected with 403 + reason - 13 tests covering cache TTL, tier ranking, and middleware gating	2026-04-03 08:32:15 -07:00
pyr0ball	9544f695e6	chore: CHANGELOG for v0.5.0	2026-04-02 23:05:22 -07:00
pyr0ball	7397e227e2	Merge pull request 'feat: manage.py cross-platform product manager' (#18 ) from feature/manage-py into main	2026-04-02 23:04:58 -07:00
pyr0ball	8d87ed4c9f	feat: manage.py cross-platform product manager (closes #6 ) - circuitforge_core.manage module — replaces bash-only manage.sh - config.py: ManageConfig from manage.toml (TOML via tomllib/tomli) app name, default_url, docker compose_file/project, native services Falls back to directory name when no manage.toml present - docker_mode.py: DockerManager wrapping 'docker compose' (v2 plugin) or 'docker-compose' (v1 fallback); docker_available() probe Commands: start, stop, restart, status, logs, build - native_mode.py: NativeManager with PID file process management platformdirs for platform-appropriate PID/log paths Windows-compatible log tailing (polling, no tail -f) Cross-platform kill: SIGTERM→SIGKILL on Unix, taskkill /F on Windows - cli.py: typer CLI — start/stop/restart/status/logs/build/open/install-shims Mode auto-detection: Docker available + compose file → docker; else native --mode docker\|native\|auto override - templates/manage.sh: bash shim (conda, venv, python3 detection) - templates/manage.ps1: PowerShell shim (same detection, Windows) - templates/manage.toml.example: annotated config template - __main__.py: python -m circuitforge_core.manage entry point - pyproject.toml: manage extras group (platformdirs, typer) cf-manage console script; version bumped to 0.5.0 - 36 tests: config (6), docker_mode (9), native_mode (21)	2026-04-02 23:04:35 -07:00
pyr0ball	6e3474b97b	chore: CHANGELOG for v0.4.0	2026-04-02 22:13:01 -07:00
pyr0ball	d45d4e1de6	Merge pull request 'feat: agent watchdog + Ollama adopt-if-running' (#17 ) from feature/agent-watchdog into main	2026-04-02 22:12:32 -07:00
pyr0ball	7bb6b76bd5	feat: ollama adopt-if-running + health_path in ProcessSpec (#16 ) - ProcessSpec: adopt (bool) and health_path (str, default /health) fields - ServiceManager: adopt=True probes health_path before spawning; is_running() uses health probe for adopt services rather than proc table + socket check - _probe_health() helper: urllib GET on localhost:port+path, returns bool - Agent /services/{service}/start: returns adopted=True when service was already running; coordinator sets state=running immediately (no probe wait) - ServiceInstance: health_path field (default /health) - service_registry.upsert_instance(): health_path kwarg - Probe loop uses inst.health_path instead of hardcoded /health - coordinator allocate_service: looks up health_path from profile spec via _get_health_path() and stores on ServiceInstance - All GPU profiles (2/4/6/8/16/24 GB + cpu-16/32): ollama managed block with adopt=true, health_path=/api/tags, port 11434 - 11 new tests	2026-04-02 22:09:42 -07:00
pyr0ball	a54a530493	feat: agent watchdog — persist known nodes + auto-reconnect after coordinator restart closes #15 - NodeStore: SQLite persistence for known agent nodes (~/.local/share/circuitforge/cf-orch-nodes.db) - upsert on every register(); prune_stale() for 30-day cleanup - survives coordinator restarts — data readable by next process - AgentSupervisor.restore_from_store(): reload known nodes on startup, mark all offline; heartbeat loop brings back any that respond - AgentSupervisor.register(): persists to NodeStore on every call - cli.py coordinator: NodeStore wired in; restore_from_store() called before uvicorn starts - cli.py agent: one-shot registration replaced with persistent reconnect loop (daemon thread, 30 s interval) — coordinator restart → nodes reappear within one cycle with no manual intervention on agent hosts - 16 new tests: NodeStore (8) + AgentSupervisor watchdog (8)	2026-04-02 22:01:55 -07:00
pyr0ball	a36f469d60	chore: CHANGELOG for v0.3.0	2026-04-02 18:56:49 -07:00
pyr0ball	1de5ec767c	Merge pull request 'feat: hardware detection, cf-docuvision service, documents ingestion pipeline' (#14 ) from feature/hardware-docuvision into main	2026-04-02 18:55:50 -07:00
pyr0ball	cd9864b5e8	feat: hardware detection, cf-docuvision service, documents ingestion pipeline Closes #5, #7, #8, #13 ## hardware module (closes #5) - HardwareSpec, LLMBackendConfig, LLMConfig dataclasses - VramTier ladder (CPU / 2 / 4 / 6 / 8 / 16 / 24 GB) with select_tier() - generate_profile() maps HardwareSpec → LLMConfig for llm.yaml generation - detect_hardware() with nvidia-smi / rocm-smi / system_profiler / cpu fallback - 31 tests across tiers, generator, and detect ## cf-docuvision service (closes #8) - FastAPI service wrapping ByteDance/Dolphin-v2 (Qwen2.5-VL backbone) - POST /extract: image_b64 or image_path + hint → ExtractResponse - Lazy model loading; JSON-structured output with plain-text fallback - ProcessSpec managed blocks added to all four GPU profiles (6/8/16/24 GB) - 14 tests ## documents module (closes #7) - StructuredDocument, Element, ParsedTable dataclasses (frozen, composable) - DocuvisionClient: thin HTTP client for cf-docuvision POST /extract - ingest(): primary cf-docuvision path → LLMRouter vision fallback → empty doc - CF_DOCUVISION_URL env var for URL override - 22 tests ## coordinator probe loop (closes #13) - _run_instance_probe_loop: starting → running on 200; starting → stopped on timeout - 4 async tests with CancelledError-based tick control	2026-04-02 18:53:25 -07:00
pyr0ball	482c430cdb	docs: add CHANGELOG for v0.1.0 and v0.2.0	2026-04-02 17:25:06 -07:00
pyr0ball	749e51ccca	Merge pull request 'feat(orch): health probe loop + VRAM pre-flight fix' (#12 ) from feature/orch-llm-server into main	2026-04-02 17:24:09 -07:00
pyr0ball	a7290c1240	feat(orch): background health probe loop — starting → running transition Coordinator now polls all 'starting' instances every 5 s via GET /health. On 200: state → running. After 300 s without a healthy response: state → stopped. Closes #10.	2026-04-02 17:18:16 -07:00
pyr0ball	bd132851ec	fix(orch): tighten VRAM pre-flight to require full max_mb free (not half) max_mb // 2 was too loose — Qwen2.5-3B needs ~5.9 GB on an 8 GB card but the threshold only required 3.25 GB free, allowing Ollama to hold 4.5 GB while a load attempt was still dispatched (causing OOM crash). - node_selector: can_fit = free_mb >= service_max_mb (was // 2) - coordinator /start: same threshold fix + updated error message - tests: two new node_selector tests pin the full-ceiling semantics; updated stale docstring in coordinator app test	2026-04-02 16:44:36 -07:00
pyr0ball	2d095f0090	fix(llm-server): handle transformers 5.x BatchEncoding; use dtype kwarg - apply_chat_template() returns BatchEncoding in transformers 5.x (not bare tensor); extract .input_ids explicitly with fallback for 4.x compat - Switch from deprecated torch_dtype= to dtype= in from_pretrained()	2026-04-02 16:36:07 -07:00
pyr0ball	c78341fc6f	feat(orch): replace Ouro/vllm-Docker with generic HF inference server; add ProcessSpec - Add circuitforge_core/resources/inference/llm_server.py: generic OpenAI-compatible FastAPI server for any HuggingFace causal LM (Phi-4-mini-instruct, Qwen2.5-3B-Instruct) - Add service_manager.py + service_probe.py: ProcessSpec start/stop/is_running support (Popen-based; socket probe confirms readiness before marking running) - Update all 4 public GPU profiles to use ProcessSpec→llm_server instead of Docker vllm: 6gb (max_mb 5500), 8gb (max_mb 6500), 16gb/24gb (max_mb 9000) - Model candidates: Phi-4-mini-instruct first (7.2GB), Qwen2.5-3B-Instruct fallback (5.8GB) - Remove ouro_server.py (Ouro incompatible with transformers 5.x; vllm Docker also incompatible) - Add 17 tests for ServiceManager ProcessSpec (start/stop/is_running/list/get_url)	2026-04-02 15:33:08 -07:00
pyr0ball	27999925cf	fix(orch): seed ServiceInstance on first allocate start	2026-04-02 14:22:55 -07:00
pyr0ball	c5e12b74f2	Merge pull request 'feat: auto service lifecycle — /allocate, NodeSelector, idle sweep, CFOrchClient' (#9 ) from feature/orch-auto-lifecycle into main	2026-04-02 14:11:36 -07:00
pyr0ball	e58c3aea23	fix: TTL sweep, immutability, service-scoped release, logger in orch alloc - ServiceRegistry: add sweep_expired_allocations() to remove stale TTL allocations and transition instances to idle; add get_allocation() helper - AgentSupervisor._run_idle_sweep: call sweep_expired_allocations() before idle-timeout check so crashed-caller leaks are cleaned up each sweep tick - schema._parse_managed: copy raw dict before extracting 'type' key instead of mutating caller's dict with pop() - app.release_allocation: validate allocation belongs to the given service path param before releasing; return 404 if mismatch - router._try_cf_orch_alloc: replace print() with logger.warning(); add module-level logger = logging.getLogger(__name__) - tests: add test_sweep_expired_allocations covering TTL expiry and idle state transition	2026-04-02 12:55:38 -07:00
pyr0ball	1a20b80a50	test: add VRAM pre-flight 503 test for ensure_service	2026-04-02 12:49:50 -07:00
pyr0ball	02806359af	feat: add Services table to coordinator dashboard	2026-04-02 12:47:27 -07:00
pyr0ball	a4ccaaf3e2	fix: address coordinator/idle-sweep quality issues from review - CRITICAL: idle sweep now calls mark_stopped() after successful HTTP stop, preventing repeated stop POSTs on every 3rd tick for the same instance - CRITICAL: active_allocations() now filters by gpu_id to avoid marking wrong instance idle on multi-GPU nodes when an allocation is released - CRITICAL: VRAM pre-flight guard in ensure_service was dead code — added the actual HTTPException(503) before the candidate loop - IMPORTANT: register() now updates agent_url on re-registration if it changed, so relocated agents are tracked correctly - IMPORTANT: updated test_service_registry.py callers of active_allocations() to pass the now-required gpu_id argument	2026-04-02 12:45:31 -07:00
pyr0ball	49ab9e4e88	feat: wire ServiceRegistry into coordinator allocate endpoints	2026-04-02 12:30:58 -07:00
pyr0ball	c299482e0d	feat: add idle sweep to AgentSupervisor	2026-04-02 12:30:28 -07:00
pyr0ball	1e168ac636	feat(profiles): add idle_stop_after_s field; set 600s for vllm slot Add idle_stop_after_s to ServiceProfile (default 0 = never stop). Set 600s (10 min) timeout on vllm slot in all single-GPU profiles. Backward compatible; non-vllm services inherit default 0 (no auto-stop).	2026-04-02 12:24:19 -07:00
pyr0ball	9754f522d9	feat(orch): add ServiceRegistry — allocation tracking + idle state machine	2026-04-02 12:22:46 -07:00
pyr0ball	17a24173f7	feat(llm): add cf_orch allocation support to LLMRouter backends	2026-04-02 12:19:17 -07:00
pyr0ball	f741e6a80b	fix(orch): hoist service-known check; capture resident_keys once in allocate	2026-04-02 11:45:48 -07:00
pyr0ball	defaf39883	feat(core): add CFOrchClient sync+async context manager Implements CFOrchClient with allocate() (sync contextmanager) and allocate_async() (async contextmanager) for cf-orch GPU resource allocation. Releases allocation on exit; ignores 404 on release; raises RuntimeError on non-2xx allocation response. Exports CFOrchClient and Allocation from circuitforge_core.resources. Note: async test uses unittest.mock rather than httpretty — httpretty only patches stdlib sockets and does not intercept httpx async (anyio) transport.	2026-04-02 11:44:35 -07:00
pyr0ball	8201f6b3e9	feat(orch): add /api/services/{service}/allocate with auto node selection	2026-04-02 11:25:38 -07:00
pyr0ball	52d2c5cf38	feat(orch): expose online_agents() and resident_keys() helpers	2026-04-02 11:22:29 -07:00
pyr0ball	d600fb6651	refactor(orch): hoist service_max_mb lookup; clarify warm-fallback comments	2026-04-02 11:21:20 -07:00
pyr0ball	13eb0c85f1	feat(orch): add NodeSelector — warm-first GPU scoring	2026-04-02 11:18:44 -07:00
pyr0ball	427182aae7	feat: cf-orch agent registration + VRAM lease wiring Merges feature/orch-agent-registration into main. - Agent self-registration and coordinator heartbeat loop - TaskScheduler acquires/releases cf-orch VRAM lease per batch worker - shutdown() now joins batch worker threads for clean teardown - 94 tests passing	2026-04-01 11:21:38 -07:00
pyr0ball	aa51794f45	fix(scheduler): join batch worker threads in shutdown() Previously shutdown() only joined the scheduler loop thread. Batch worker threads (which decrement _reserved_vram in their finally block) could still be running when shutdown returned, leaving stale VRAM accounting. Now snapshots active workers under lock and joins them all. Snapshot-then-join pattern avoids holding the lock across blocking join calls (which would deadlock since workers acquire the same lock on exit).	2026-04-01 11:21:30 -07:00
pyr0ball	6b8e421eb2	feat(scheduler): acquire/release cf-orch VRAM lease per batch worker Before running a batch of tasks, the scheduler now requests a VRAM lease from the cf-orch coordinator (POST /api/leases). The lease is held for the full batch and released in the finally block so it's always cleaned up even on error. Falls back gracefully if the coordinator is unreachable. Adds coordinator_url and service_name params to TaskScheduler.__init__ and get_scheduler() so callers can override the default localhost:7700.	2026-04-01 11:06:16 -07:00
pyr0ball	67701f0d29	feat(orch): agent self-registration and coordinator heartbeat loop coordinator/app.py: - Add POST /api/nodes — agents POST {node_id, agent_url} to self-register; coordinator immediately polls the new agent for GPU info - Add lifespan context manager that starts/stops AgentSupervisor heartbeat loop (previously the loop was never started) cli.py start: - Add --node-id flag (default 'local') - Pre-register the local agent URL (http://127.0.0.1:{agent_port}) so the heartbeat loop can poll it immediately on startup - Drop redundant lease_manager.register_gpu() call — supervisor.poll_agent() now does this via the heartbeat after the agent responds cli.py agent: - Add --advertise-host flag for NATted/multi-homed nodes - Fire registration POST to coordinator in a daemon thread (2s delay) so uvicorn.run() can start binding immediately; no double uvicorn.run()	2026-03-31 19:20:35 -07:00
pyr0ball	4596aad290	Merge pull request 'feat(dashboard): self-hosted coordinator dashboard at GET /' (#3 ) from feature/orch-dashboard into main	2026-03-31 18:59:46 -07:00

1 2

84 commits