Adds benchmark_plans.py script, plans_bench API router, PlansBenchTab Vue
component, and registers /api/plans-bench in api.py. Also extends models
registry (cf-text catalog integration), cforch client, LlmEvalTab, and
ModelsView with cf-orch fleet support. Wires Planning mode into BenchmarkView.
- GET /api/train/jobs now returns {"jobs":[...]} instead of bare array
- GET /api/train/results now returns {"results":[...]} instead of bare array
- POST /api/train/jobs body key renamed config -> config_json to match Pydantic model
- SSE log handler now handles 'progress' event type (backend never emits 'log')
- Dashboard _get_active_jobs() adds model_key to SELECT and return dict
- corrections.py docstring updated: both /api/corrections and /api/sft prefixes noted
- test_train.py assertions updated to unwrap new envelope shapes
Replace 149-line api.py (with inline helpers, JSONL utilities, and ad-hoc
router registrations) with a 57-line pure factory. All business logic was
already extracted to domain modules in B1-B7; this removes the dead code
and adds the /api/corrections/* prefix alongside the /api/sft/* backward-
compat alias. Smoke tests updated to cover the new /api/corrections/ingest
and /api/dashboard routes.
Adds IngestRequest model and POST /api/sft/ingest route to
app/data/corrections.py. Sibling CF products (Peregrine, Kiwi, etc.)
can push pre-approved corrections via Bearer token auth
(AVOCET_INGESTION_SECRET). Records land as status=approved in both
sft_candidates.jsonl and sft_approved.jsonl immediately.
7 tests in tests/test_data_corrections.py cover 503 (secret unset),
401 (missing/malformed header), 403 (wrong secret), happy-path writes
to both files, and optional label field.
Replaces the ad-hoc _running_procs dict in api.py with a persistent,
inspectable SQLite job queue. Removes old /api/finetune/* routes and
_best_cuda_device from api.py. Adds /api/train/* routes (list, create,
get, cancel, run SSE, results). 16 new tests all passing.
app/cloud_session.py:
- Thin wrapper around cf_core.cloud_session.CloudSessionFactory
- BYOK detection reads ~/.config/circuitforge/llm.yaml (same path as other products)
- get_session: FastAPI dependency, returns CloudUser (user_id, tier, has_byok)
- require_tier: dependency factory for tier-gated routes
app/imitate.py:
- _run_cftext gains user_id: str | None param; non-None values included in
the cf-orch ServiceAllocateRequest so premium users get their custom models
- run_imitate injects session via Depends(_get_imitate_session); extracts user_id,
filters out local/anon sessions (they get the shared catalog), passes real
cloud user_id to the ThreadPoolExecutor fanout
- _get_imitate_session wraps get_session with a try/except so imitate keeps
working in envs where cloud_session deps aren't installed
- BenchmarkView.vue: convert from monolithic view to tabbed shell; each tab is
now its own component (ClassifierTab, CompareTab, LlmEvalTab, StyleTab, VoiceTab)
- StyleTab + VoiceTab: new benchmark modes for style and voice model evaluation
- app/style.py: FastAPI router for style imitation benchmarks
- app/voice.py: FastAPI router for voice benchmark endpoints
- scripts/benchmark_style.py + benchmark_voice.py: headless runner scripts
Backend:
- Run all cf-text model allocations concurrently via ThreadPoolExecutor + as_completed
- Announce model_start events upfront so the UI can show loading states immediately
- Replace timer-based startup polling with coordinator state signals: waits for
state=="running" (success) or state=="stopped" (fail-fast) on the matching
node/gpu instance; falls back to health poll after 6 consecutive probe misses
- Add /api/cforch/catalog endpoint: fetches live cf-text model list from cf-orch,
filtering out proxy entries (ollama://, vllm://, http://) so only loadable models
are returned
Frontend (ImitateView.vue):
- Show per-model loading spinners as results arrive via SSE stream
- Display cold-start badge when coordinator signals the model was freshly loaded
Backend (app/imitate.py):
- GET /api/imitate/products — reads imitate: config, checks online status
- GET /api/imitate/products/{id}/sample — fetches real item from product API
- GET /api/imitate/run (SSE) — streams ollama responses for selected models
- POST /api/imitate/push-corrections — queues results in SFT corrections JSONL
Frontend (ImitateView.vue):
- Step 1: product picker grid (online/offline status, icon from config)
- Step 2: raw sample preview + editable prompt textarea
- Step 3: ollama model multi-select, temperature slider, SSE run with live log
- Step 4: response cards side by side, push to Corrections button
Wiring:
- app/api.py: include imitate_router at /api/imitate
- web/src/router: /imitate route + lazy import
- AppSidebar: Imitate nav entry (mirror icon)
- config/label_tool.yaml.example: imitate: section with peregrine example
- 16 unit tests (100% passing)
Also: BenchmarkView.vue Compare panel — side-by-side run diff for bench results
TaskEntry now includes prompt/system fields (default ""). Switch from
exact dict comparison to field-by-field assertions so the test is
forward-compatible with optional schema additions.
- sft.py: _DEFAULT_BENCH_RESULTS_DIR set to circuitforge-orch bench
results path; set_default_bench_results_dir() seam for test isolation
- test fixture resets default to tmp_path to avoid real-fs interference
- 136 tests passing
Closes#14
- _load_cforch_config() falls back to CF_ORCH_URL / CF_LICENSE_KEY /
OLLAMA_HOST / OLLAMA_MODEL env vars when label_tool.yaml cforch: key
is absent or empty (yaml wins when both present)
- CF_LICENSE_KEY forwarded to benchmark subprocess env so cf-orch agent
can authenticate without it appearing in command args
- GET /api/cforch/config endpoint — returns resolved connection state;
redacts license key (returns license_key_set bool only)
- SettingsView: connection status pill (cf-orch / Ollama / unconfigured)
loaded from /api/cforch/config on mount; shows env vs yaml source
- .env.example documenting all relevant vars
- config/label_tool.yaml.example: full cforch: section with all keys
- environment.yml: add circuitforge-core>=0.9.0 dependency
- .gitignore: add .env
- 4 new tests (17 total in test_cforch.py); 136 passing overall
Closes#10
- GET /api/models/lookup now returns compatible: bool and warning: str|null
- compatible=false + warning when pipeline_tag is absent (no task tag on HF)
or present but not in the supported adapter map
- Warning message names the unsupported pipeline_tag and lists supported types
- ModelsView: yellow compat-warning banner below preview description;
Add button relabels to "Add anyway" with muted styling when incompatible
- test_models: accept 405 for path-traversal DELETE tests (StaticFiles mount
returns 405 for non-GET methods when web/dist exists)
Backend (app/api.py):
- GET /api/benchmark/models — returns installed models grouped by adapter
type (ZeroShotAdapter, RerankerAdapter, GenerationAdapter, Unknown);
reads _MODELS_DIR via app.models so test overrides are respected
- GET /api/benchmark/run — add model_names query param (comma-separated);
when set, passes --models <names...> to benchmark_classifier.py
- GET /api/stats — add benchmark_results field from benchmark_results.json
Frontend:
- BenchmarkView: collapsible Model Selection panel with per-category
checkboxes, select-all per category (supports indeterminate state),
collapsed summary badge ("All models (N)" or "N of M selected");
model_names only sent when a strict subset is selected
- StatsView: Benchmark Results table (accuracy, macro_f1, weighted_f1)
with best-model highlighting per metric; hidden when no results exist
Adds optional failure_category to SubmitRequest and candidate records so
reviewers can classify why a model response was wrong, not just what to do
with it. Enables the fine-tune harness to filter training data by failure
type (e.g. exclude scoring artifacts, train only on genuine wrong answers).
Taxonomy: scoring_artifact | style_violation | partial_answer |
wrong_answer | format_error | hallucination
- app/sft.py: FailureCategory Literal type; SubmitRequest.failure_category;
stored on candidate record in POST /submit correct branch
- tests/test_sft.py: 3 new tests (stores value, null round-trip, 422 on invalid)
- stores/sft.ts: SftFailureCategory type exported; SftQueueItem + SftLastAction
updated; setLastAction accepts optional category param
- SftCard.vue: chip-group selector shown during correct/discard/flag flow;
two-step confirm for discard/flag reveals chips before emitting; category
forwarded in all emit payloads
- CorrectionsView.vue: handleCorrect/Discard/Flag accept and forward category
to POST /api/sft/submit body and store.setLastAction
- SftCard.test.ts: 11 new tests covering chip visibility, selection,
single-active enforcement, pending-action flow, emit payloads, cancel
- manage.sh: dev command starts uvicorn --reload on :8503 and Vite dev
server (auto-port from 5173); kills API on EXIT/INT/TERM trap
- manage.sh: ENV_UI defaults to 'cf' env (overridable via AVOCET_ENV)
- vite.config.ts: add server.proxy to forward /api to :8503 so Vite
dev server can reach the backend without CORS issues
- sft.py GET /config: use `or {}` guard so `sft: ~` (null YAML) doesn't
return None instead of the default empty config
- CorrectionsView: convert handleCorrect/Discard/Flag and handleUndo from
optimistic to pessimistic — queue mutation only happens after server
confirms; failures leave item in queue so user can retry cleanly
- SettingsView: call loadSftConfig() on mount so saved bench_results_dir
is populated instead of always starting empty
- Add /corrections route to Vue router (lazy-loaded CorrectionsView)
- Add Corrections nav item (✍️) to AppSidebar after Benchmark
- Add cf-orch Integration section to SettingsView with bench_results_dir
field, run scanner, and per-run import table
- Add GET /api/sft/config and POST /api/sft/config endpoints to app/sft.py