fix: imitate extractor + health_path — support CF cloud API shapes

- _extract_sample: add saved_searches, entries, calls, records as recognized list-wrapper keys (snipe/osprey response shapes) - _is_online: accept health_path param (default /api/health) so products using /api/v1/health/ (kiwi) report correctly - products endpoint: pass health_path from config into _is_online
Merge pull request 'feat: Imitate tab — pull CF product samples, compare LLM responses' (#23 ) from feat/imitate into main
2026-04-09 20:24:26 -07:00 · 2026-04-09 20:13:20 -07:00 · 2026-04-09 20:12:57 -07:00 · 2026-04-09 20:11:01 -07:00 · 2026-04-09 12:30:56 -07:00 · 2026-04-09 12:28:38 -07:00
23 changed files with 5792 additions and 44 deletions
--- a/.env.example
+++ b/.env.example
@ -0,0 +1,19 @@
+# Avocet — environment variable configuration
+# Copy to .env and fill in values. All keys are optional.
+# label_tool.yaml takes precedence over env vars where both exist.
+
+# ── Local inference (Ollama) ───────────────────────────────────────────────────
+# OLLAMA_HOST defaults to http://localhost:11434 if unset.
+OLLAMA_HOST=http://localhost:11434
+OLLAMA_MODEL=llama3.2:3b
+
+# ── cf-orch coordinator (paid/premium tiers) ───────────────────────────────────
+# Required for multi-GPU LLM benchmarking via the cf-orch benchmark harness.
+# Free-tier users can leave these unset and use Ollama only.
+CF_ORCH_URL=http://localhost:7700
+CF_LICENSE_KEY=CFG-AVCT-xxxx-xxxx-xxxx
+
+# ── Cloud LLM backends (optional — paid/premium) ──────────────────────────────
+# Set one of these to use a cloud LLM instead of a local model.
+# ANTHROPIC_API_KEY=sk-ant-...
+# OPENAI_API_KEY=sk-...
--- a/app/api.py
+++ b/app/api.py
@ -145,6 +145,16 @@ app = FastAPI(title="Avocet API")
 from app.sft import router as sft_router
 app.include_router(sft_router, prefix="/api/sft")

+from app.models import router as models_router
+import app.models as _models_module
+app.include_router(models_router, prefix="/api/models")
+
+from app.cforch import router as cforch_router
+app.include_router(cforch_router, prefix="/api/cforch")
+
+from app.imitate import router as imitate_router
+app.include_router(imitate_router, prefix="/api/imitate")
+
 # In-memory last-action store (single user, local tool — in-memory is fine)
 _last_action: dict | None = None

@ -298,10 +308,18 @@ def get_stats():
        lbl = r.get("label", "")
        if lbl:
            counts[lbl] = counts.get(lbl, 0) + 1
+    benchmark_results: dict = {}
+    benchmark_path = _DATA_DIR / "benchmark_results.json"
+    if benchmark_path.exists():
+        try:
+            benchmark_results = json.loads(benchmark_path.read_text(encoding="utf-8"))
+        except Exception:
+            pass
    return {
        "total": len(records),
        "counts": counts,
        "score_file_bytes": _score_file().stat().st_size if _score_file().exists() else 0,
+        "benchmark_results": benchmark_results,
    }


@ -336,6 +354,36 @@ from fastapi.responses import StreamingResponse
 # Benchmark endpoints
 # ---------------------------------------------------------------------------

+@app.get("/api/benchmark/models")
+def get_benchmark_models() -> dict:
+    """Return installed models grouped by adapter_type category."""
+    models_dir: Path = _models_module._MODELS_DIR
+    categories: dict[str, list[dict]] = {
+        "ZeroShotAdapter": [],
+        "RerankerAdapter": [],
+        "GenerationAdapter": [],
+        "Unknown": [],
+    }
+    if models_dir.exists():
+        for sub in models_dir.iterdir():
+            if not sub.is_dir():
+                continue
+            info_path = sub / "model_info.json"
+            adapter_type = "Unknown"
+            repo_id: str | None = None
+            if info_path.exists():
+                try:
+                    info = json.loads(info_path.read_text(encoding="utf-8"))
+                    adapter_type = info.get("adapter_type") or info.get("adapter_recommendation") or "Unknown"
+                    repo_id = info.get("repo_id")
+                except Exception:
+                    pass
+            bucket = adapter_type if adapter_type in categories else "Unknown"
+            entry: dict = {"name": sub.name, "repo_id": repo_id, "adapter_type": adapter_type}
+            categories[bucket].append(entry)
+    return {"categories": categories}
+
+
@app.get("/api/benchmark/results")
 def get_benchmark_results():
    """Return the most recently saved benchmark results, or an empty envelope."""
@ -346,13 +394,17 @@ def get_benchmark_results():


@app.get("/api/benchmark/run")
-def run_benchmark(include_slow: bool = False):
+def run_benchmark(include_slow: bool = False, model_names: str = ""):
    """Spawn the benchmark script and stream stdout as SSE progress events."""
    python_bin = "/devl/miniconda3/envs/job-seeker-classifiers/bin/python"
    script = str(_ROOT / "scripts" / "benchmark_classifier.py")
    cmd = [python_bin, script, "--score", "--save"]
    if include_slow:
        cmd.append("--include-slow")
+    if model_names:
+        names = [n.strip() for n in model_names.split(",") if n.strip()]
+        if names:
+            cmd.extend(["--models"] + names)

    def generate():
        try:
--- a/app/cforch.py
+++ b/app/cforch.py
@ -0,0 +1,337 @@
+"""Avocet — cf-orch benchmark integration API.
+
+Wraps cf-orch's benchmark.py script and exposes it via the Avocet API.
+Config is read from label_tool.yaml under the `cforch:` key.
+
+All endpoints are registered on `router` (a FastAPI APIRouter).
+api.py includes this router with prefix="/api/cforch".
+
+Module-level globals (_CONFIG_DIR, _BENCH_RUNNING, _bench_proc) follow the
+same testability pattern as sft.py — override _CONFIG_DIR via set_config_dir()
+in test fixtures.
+"""
+from __future__ import annotations
+
+import json
+import logging
+import os
+import re
+import subprocess as _subprocess
+from pathlib import Path
+from typing import Any
+
+import yaml
+from fastapi import APIRouter, HTTPException
+from fastapi.responses import StreamingResponse
+
+logger = logging.getLogger(__name__)
+
+_ROOT = Path(__file__).parent.parent
+_CONFIG_DIR: Path | None = None   # override in tests
+_BENCH_RUNNING: bool = False
+_bench_proc: Any = None            # live Popen object while benchmark runs
+
+router = APIRouter()
+
+
+# ── Testability seams ──────────────────────────────────────────────────────────
+
+def set_config_dir(path: Path | None) -> None:
+    global _CONFIG_DIR
+    _CONFIG_DIR = path
+
+
+# ── Internal helpers ───────────────────────────────────────────────────────────
+
+def _config_file() -> Path:
+    if _CONFIG_DIR is not None:
+        return _CONFIG_DIR / "label_tool.yaml"
+    return _ROOT / "config" / "label_tool.yaml"
+
+
+def _load_cforch_config() -> dict:
+    """Read label_tool.yaml cforch section, falling back to environment variables.
+
+    Priority (highest to lowest):
+      1. label_tool.yaml cforch: key
+      2. Environment variables (CF_ORCH_URL, CF_LICENSE_KEY, OLLAMA_HOST, OLLAMA_MODEL)
+    """
+    f = _config_file()
+    file_cfg: dict = {}
+    if f.exists():
+        try:
+            raw = yaml.safe_load(f.read_text(encoding="utf-8")) or {}
+            file_cfg = raw.get("cforch", {}) or {}
+        except yaml.YAMLError as exc:
+            logger.warning("Failed to parse cforch config %s: %s", f, exc)
+
+    # Env var fallbacks — only used when the yaml key is absent or empty
+    def _coalesce(file_val: str, env_key: str) -> str:
+        return file_val if file_val else os.environ.get(env_key, "")
+
+    return {
+        **file_cfg,
+        "coordinator_url": _coalesce(file_cfg.get("coordinator_url", ""), "CF_ORCH_URL"),
+        "license_key":     _coalesce(file_cfg.get("license_key", ""),     "CF_LICENSE_KEY"),
+        "ollama_url":      _coalesce(file_cfg.get("ollama_url", ""),       "OLLAMA_HOST"),
+        "ollama_model":    _coalesce(file_cfg.get("ollama_model", ""),     "OLLAMA_MODEL"),
+    }
+
+
+def _strip_ansi(text: str) -> str:
+    """Remove ANSI escape codes from a string."""
+    return re.sub(r'\x1b\[[0-9;]*m', '', text)
+
+
+def _find_latest_summary(results_dir: str | None) -> Path | None:
+    """Find the newest summary.json under results_dir, or None if not found."""
+    if not results_dir:
+        return None
+    rdir = Path(results_dir)
+    if not rdir.exists():
+        return None
+    # Subdirs are named YYYY-MM-DD-HHMMSS; sort lexicographically for chronological order
+    subdirs = sorted(
+        [d for d in rdir.iterdir() if d.is_dir()],
+        key=lambda d: d.name,
+    )
+    for subdir in reversed(subdirs):
+        summary = subdir / "summary.json"
+        if summary.exists():
+            return summary
+    return None
+
+
+# ── GET /tasks ─────────────────────────────────────────────────────────────────
+
+@router.get("/tasks")
+def get_tasks() -> dict:
+    """Return task list from bench_tasks.yaml."""
+    cfg = _load_cforch_config()
+    tasks_path = cfg.get("bench_tasks", "")
+    if not tasks_path:
+        return {"tasks": [], "types": []}
+
+    p = Path(tasks_path)
+    if not p.exists():
+        return {"tasks": [], "types": []}
+
+    try:
+        raw = yaml.safe_load(p.read_text(encoding="utf-8")) or {}
+    except yaml.YAMLError as exc:
+        logger.warning("Failed to parse bench_tasks.yaml %s: %s", p, exc)
+        return {"tasks": [], "types": []}
+
+    tasks_raw = raw.get("tasks", []) or []
+    tasks: list[dict] = []
+    seen_types: list[str] = []
+    types_set: set[str] = set()
+
+    for t in tasks_raw:
+        if not isinstance(t, dict):
+            continue
+        tasks.append({
+            "id":     t.get("id", ""),
+            "name":   t.get("name", ""),
+            "type":   t.get("type", ""),
+            "prompt": (t.get("prompt") or "").strip(),
+            "system": (t.get("system") or "").strip(),
+        })
+        task_type = t.get("type", "")
+        if task_type and task_type not in types_set:
+            seen_types.append(task_type)
+            types_set.add(task_type)
+
+    return {"tasks": tasks, "types": seen_types}
+
+
+# ── GET /models ────────────────────────────────────────────────────────────────
+
+@router.get("/models")
+def get_models() -> dict:
+    """Return model list from bench_models.yaml."""
+    cfg = _load_cforch_config()
+    models_path = cfg.get("bench_models", "")
+    if not models_path:
+        return {"models": []}
+
+    p = Path(models_path)
+    if not p.exists():
+        return {"models": []}
+
+    try:
+        raw = yaml.safe_load(p.read_text(encoding="utf-8")) or {}
+    except yaml.YAMLError as exc:
+        logger.warning("Failed to parse bench_models.yaml %s: %s", p, exc)
+        return {"models": []}
+
+    models_raw = raw.get("models", []) or []
+    models: list[dict] = []
+    for m in models_raw:
+        if not isinstance(m, dict):
+            continue
+        models.append({
+            "name": m.get("name", ""),
+            "id": m.get("id", ""),
+            "service": m.get("service", "ollama"),
+            "tags": m.get("tags", []) or [],
+            "vram_estimate_mb": m.get("vram_estimate_mb", 0),
+        })
+
+    return {"models": models}
+
+
+# ── GET /run ───────────────────────────────────────────────────────────────────
+
+@router.get("/run")
+def run_benchmark(
+    task_ids: str = "",
+    model_tags: str = "",
+    coordinator_url: str = "",
+    ollama_url: str = "",
+) -> StreamingResponse:
+    """Spawn cf-orch benchmark.py and stream stdout as SSE progress events."""
+    global _BENCH_RUNNING, _bench_proc
+
+    if _BENCH_RUNNING:
+        raise HTTPException(409, "A benchmark is already running")
+
+    cfg = _load_cforch_config()
+    bench_script = cfg.get("bench_script", "")
+    bench_tasks = cfg.get("bench_tasks", "")
+    bench_models = cfg.get("bench_models", "")
+    results_dir = cfg.get("results_dir", "")
+    python_bin = cfg.get("python_bin", "/devl/miniconda3/envs/cf/bin/python")
+    cfg_coordinator = cfg.get("coordinator_url", "")
+    cfg_ollama      = cfg.get("ollama_url", "")
+    cfg_license_key = cfg.get("license_key", "")
+
+    def generate():
+        global _BENCH_RUNNING, _bench_proc
+
+        if not bench_script or not Path(bench_script).exists():
+            yield f"data: {json.dumps({'type': 'error', 'message': 'bench_script not configured or not found'})}\n\n"
+            return
+
+        cmd = [
+            python_bin,
+            bench_script,
+            "--tasks", bench_tasks,
+            "--models", bench_models,
+            "--output", results_dir,
+        ]
+
+        if task_ids:
+            cmd.extend(["--filter-tasks"] + task_ids.split(","))
+        if model_tags:
+            cmd.extend(["--filter-tags"] + model_tags.split(","))
+
+        # query param overrides config, config overrides env var (already resolved by _load_cforch_config)
+        effective_coordinator = coordinator_url if coordinator_url else cfg_coordinator
+        effective_ollama      = ollama_url      if ollama_url      else cfg_ollama
+        if effective_coordinator:
+            cmd.extend(["--coordinator", effective_coordinator])
+        if effective_ollama:
+            cmd.extend(["--ollama-url", effective_ollama])
+
+        # Pass license key as env var so subprocess can authenticate with cf-orch
+        proc_env = {**os.environ}
+        if cfg_license_key:
+            proc_env["CF_LICENSE_KEY"] = cfg_license_key
+
+        _BENCH_RUNNING = True
+        try:
+            proc = _subprocess.Popen(
+                cmd,
+                stdout=_subprocess.PIPE,
+                stderr=_subprocess.STDOUT,
+                text=True,
+                bufsize=1,
+                env=proc_env,
+            )
+            _bench_proc = proc
+            try:
+                for line in proc.stdout:
+                    line = _strip_ansi(line.rstrip())
+                    if line:
+                        yield f"data: {json.dumps({'type': 'progress', 'message': line})}\n\n"
+                proc.wait()
+                if proc.returncode == 0:
+                    summary_path = _find_latest_summary(results_dir)
+                    if summary_path is not None:
+                        try:
+                            summary = json.loads(summary_path.read_text(encoding="utf-8"))
+                            yield f"data: {json.dumps({'type': 'result', 'summary': summary})}\n\n"
+                        except Exception as exc:
+                            logger.warning("Failed to read summary.json: %s", exc)
+                    yield f"data: {json.dumps({'type': 'complete'})}\n\n"
+                else:
+                    yield f"data: {json.dumps({'type': 'error', 'message': f'Process exited with code {proc.returncode}'})}\n\n"
+            finally:
+                _bench_proc = None
+        except Exception as exc:
+            yield f"data: {json.dumps({'type': 'error', 'message': str(exc)})}\n\n"
+        finally:
+            _BENCH_RUNNING = False
+
+    return StreamingResponse(
+        generate(),
+        media_type="text/event-stream",
+        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
+    )
+
+
+# ── GET /config ────────────────────────────────────────────────────────────────
+
+@router.get("/config")
+def get_cforch_config() -> dict:
+    """Return resolved cf-orch connection config (env vars merged with yaml).
+
+    Redacts license_key — only returns whether it is set, not the value.
+    Used by the Settings UI to show current connection state.
+    """
+    cfg = _load_cforch_config()
+    return {
+        "coordinator_url": cfg.get("coordinator_url", ""),
+        "ollama_url":      cfg.get("ollama_url", ""),
+        "ollama_model":    cfg.get("ollama_model", ""),
+        "license_key_set": bool(cfg.get("license_key", "")),
+        "source": "env" if not _config_file().exists() else "yaml+env",
+    }
+
+
+# ── GET /results ───────────────────────────────────────────────────────────────
+
+@router.get("/results")
+def get_results() -> dict:
+    """Return the latest benchmark summary.json from results_dir."""
+    cfg = _load_cforch_config()
+    results_dir = cfg.get("results_dir", "")
+    summary_path = _find_latest_summary(results_dir)
+    if summary_path is None:
+        raise HTTPException(404, "No benchmark results found")
+    try:
+        return json.loads(summary_path.read_text(encoding="utf-8"))
+    except Exception as exc:
+        raise HTTPException(500, f"Failed to read summary.json: {exc}") from exc
+
+
+# ── POST /cancel ───────────────────────────────────────────────────────────────
+
+@router.post("/cancel")
+def cancel_benchmark() -> dict:
+    """Kill the running benchmark subprocess."""
+    global _BENCH_RUNNING, _bench_proc
+
+    if not _BENCH_RUNNING:
+        raise HTTPException(404, "No benchmark is currently running")
+
+    if _bench_proc is not None:
+        try:
+            _bench_proc.terminate()
+        except Exception as exc:
+            logger.warning("Failed to terminate benchmark process: %s", exc)
+
+    _BENCH_RUNNING = False
+    _bench_proc = None
+    return {"status": "cancelled"}
--- a/app/imitate.py
+++ b/app/imitate.py
@ -0,0 +1,352 @@
+"""Avocet — Imitate tab API.
+
+Fetches real samples from sibling CF product APIs, sends them through selected
+local LLMs (ollama), and streams responses back to the UI. Results can be
+pushed into the SFT corrections queue for human review.
+
+All endpoints registered on `router`. api.py includes this with prefix="/api/imitate".
+
+Module-level globals follow the same testability pattern as cforch.py and sft.py:
+override _CONFIG_DIR and _DATA_DIR via set_config_dir() / set_data_dir() in tests.
+"""
+from __future__ import annotations
+
+import json
+import logging
+import time
+import uuid
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+from urllib.error import URLError
+from urllib.request import Request, urlopen
+
+import yaml
+from fastapi import APIRouter, HTTPException
+from fastapi.responses import StreamingResponse
+from pydantic import BaseModel
+
+from app.utils import append_jsonl
+
+logger = logging.getLogger(__name__)
+
+_ROOT = Path(__file__).parent.parent
+_CONFIG_DIR: Path | None = None
+_DATA_DIR: Path = _ROOT / "data"
+
+router = APIRouter()
+
+
+# ── Testability seams ──────────────────────────────────────────────────────────
+
+def set_config_dir(path: Path | None) -> None:
+    global _CONFIG_DIR
+    _CONFIG_DIR = path
+
+
+def set_data_dir(path: Path) -> None:
+    global _DATA_DIR
+    _DATA_DIR = path
+
+
+# ── Internal helpers ───────────────────────────────────────────────────────────
+
+def _config_file() -> Path:
+    if _CONFIG_DIR is not None:
+        return _CONFIG_DIR / "label_tool.yaml"
+    return _ROOT / "config" / "label_tool.yaml"
+
+
+def _load_imitate_config() -> dict:
+    """Read label_tool.yaml and return the imitate sub-dict (or {} if absent)."""
+    f = _config_file()
+    if not f.exists():
+        return {}
+    try:
+        raw = yaml.safe_load(f.read_text(encoding="utf-8")) or {}
+    except yaml.YAMLError as exc:
+        logger.warning("Failed to parse imitate config %s: %s", f, exc)
+        return {}
+    return raw.get("imitate", {}) or {}
+
+
+def _load_cforch_config() -> dict:
+    """Read cforch section for ollama_url fallback."""
+    f = _config_file()
+    if not f.exists():
+        return {}
+    try:
+        raw = yaml.safe_load(f.read_text(encoding="utf-8")) or {}
+    except yaml.YAMLError as exc:
+        return {}
+    return raw.get("cforch", {}) or {}
+
+
+def _ollama_url(cfg: dict) -> str:
+    cforch = _load_cforch_config()
+    return cfg.get("ollama_url") or cforch.get("ollama_url") or "http://localhost:11434"
+
+
+def _http_get_json(url: str, timeout: int = 5) -> Any:
+    """Fetch JSON from url; raise URLError on failure."""
+    req = Request(url, headers={"Accept": "application/json"})
+    with urlopen(req, timeout=timeout) as resp:
+        return json.loads(resp.read().decode("utf-8"))
+
+
+def _is_online(base_url: str, health_path: str = "/api/health") -> bool:
+    """Return True if the product's health endpoint responds OK."""
+    try:
+        data = _http_get_json(f"{base_url.rstrip('/')}{health_path}", timeout=2)
+        return bool(data)
+    except Exception:
+        return False
+
+
+def _extract_sample(
+    raw: Any, text_fields: list[str], sample_index: int = 0
+) -> dict[str, Any]:
+    """Pull one item from a list or dict response and extract text_fields."""
+    item: dict[str, Any]
+    if isinstance(raw, list):
+        if not raw:
+            return {}
+        item = raw[min(sample_index, len(raw) - 1)]
+    elif isinstance(raw, dict):
+        # may be {items: [...]} or the item itself
+        for key in ("items", "results", "data", "jobs", "listings", "pantry",
+                    "saved_searches", "entries", "calls", "records"):
+            if key in raw and isinstance(raw[key], list):
+                lst = raw[key]
+                item = lst[min(sample_index, len(lst) - 1)] if lst else {}
+                break
+        else:
+            item = raw
+    else:
+        return {}
+
+    parts = []
+    for field in text_fields:
+        val = item.get(field)
+        if val and str(val).strip():
+            parts.append(f"**{field}**: {val}")
+    return {"item": item, "text": "\n\n".join(parts)}
+
+
+def _candidates_file() -> Path:
+    return _DATA_DIR / "sft_candidates.jsonl"
+
+
+def _sse(data: dict) -> str:
+    return f"data: {json.dumps(data)}\n\n"
+
+
+def _run_ollama_streaming(
+    ollama_base: str,
+    model_id: str,
+    prompt: str,
+    temperature: float,
+) -> tuple[str, int]:
+    """Call ollama /api/generate with stream=True; return (full_response, elapsed_ms).
+
+    Blocks until the model finishes; yields nothing — streaming is handled by
+    the SSE generator in run_imitate().
+    """
+    url = f"{ollama_base.rstrip('/')}/api/generate"
+    payload = json.dumps({
+        "model": model_id,
+        "prompt": prompt,
+        "stream": False,
+        "options": {"temperature": temperature},
+    }).encode("utf-8")
+    req = Request(url, data=payload, method="POST",
+                  headers={"Content-Type": "application/json"})
+    t0 = time.time()
+    try:
+        with urlopen(req, timeout=120) as resp:
+            body = json.loads(resp.read().decode("utf-8"))
+        elapsed = int((time.time() - t0) * 1000)
+        return body.get("response", ""), elapsed
+    except Exception as exc:
+        elapsed = int((time.time() - t0) * 1000)
+        raise RuntimeError(str(exc)) from exc
+
+
+# ── GET /products ──────────────────────────────────────────────────────────────
+
+@router.get("/products")
+def get_products() -> dict:
+    """List configured CF products with live online status."""
+    cfg = _load_imitate_config()
+    products_raw = cfg.get("products", []) or []
+    products = []
+    for p in products_raw:
+        if not isinstance(p, dict):
+            continue
+        base_url = p.get("base_url", "")
+        products.append({
+            "id":          p.get("id", ""),
+            "name":        p.get("name", ""),
+            "icon":        p.get("icon", "📦"),
+            "description": p.get("description", ""),
+            "base_url":    base_url,
+            "online":      _is_online(base_url, p.get("health_path", "/api/health")) if base_url else False,
+        })
+    return {"products": products}
+
+
+# ── GET /products/{product_id}/sample ─────────────────────────────────────────
+
+@router.get("/products/{product_id}/sample")
+def get_sample(product_id: str, index: int = 0) -> dict:
+    """Fetch a real sample from the given product's API."""
+    cfg = _load_imitate_config()
+    products_raw = cfg.get("products", []) or []
+
+    product: dict | None = None
+    for p in products_raw:
+        if isinstance(p, dict) and p.get("id") == product_id:
+            product = p
+            break
+
+    if product is None:
+        raise HTTPException(404, f"Product '{product_id}' not in config")
+
+    base_url = product.get("base_url", "").rstrip("/")
+    endpoint = product.get("sample_endpoint", "")
+    if not base_url or not endpoint:
+        raise HTTPException(422, "Product missing base_url or sample_endpoint")
+
+    url = f"{base_url}{endpoint}"
+    try:
+        raw = _http_get_json(url, timeout=5)
+    except URLError as exc:
+        raise HTTPException(503, f"Product API unreachable: {exc}") from exc
+    except Exception as exc:
+        raise HTTPException(502, f"Bad response from product API: {exc}") from exc
+
+    text_fields = product.get("text_fields", []) or []
+    extracted = _extract_sample(raw, text_fields, index)
+    if not extracted:
+        raise HTTPException(404, "No sample items returned by product API")
+
+    prompt_template = product.get("prompt_template", "{text}")
+    prompt = prompt_template.replace("{text}", extracted["text"])
+
+    return {
+        "product_id":    product_id,
+        "sample_index":  index,
+        "text":          extracted["text"],
+        "prompt":        prompt,
+        "raw_item":      extracted.get("item", {}),
+    }
+
+
+# ── GET /run (SSE) ─────────────────────────────────────────────────────────────
+
+@router.get("/run")
+def run_imitate(
+    prompt: str = "",
+    model_ids: str = "",      # comma-separated ollama model IDs
+    temperature: float = 0.7,
+    product_id: str = "",
+) -> StreamingResponse:
+    """Run a prompt through selected ollama models and stream results as SSE."""
+
+    if not prompt.strip():
+        raise HTTPException(422, "prompt is required")
+
+    ids = [m.strip() for m in model_ids.split(",") if m.strip()]
+    if not ids:
+        raise HTTPException(422, "model_ids is required")
+
+    cfg = _load_imitate_config()
+    ollama_base = _ollama_url(cfg)
+
+    def generate():
+        results: list[dict] = []
+        yield _sse({"type": "start", "total_models": len(ids)})
+
+        for model_id in ids:
+            yield _sse({"type": "model_start", "model": model_id})
+            try:
+                response, elapsed_ms = _run_ollama_streaming(
+                    ollama_base, model_id, prompt, temperature
+                )
+                result = {
+                    "model":      model_id,
+                    "response":   response,
+                    "elapsed_ms": elapsed_ms,
+                    "error":      None,
+                }
+            except Exception as exc:
+                result = {
+                    "model":      model_id,
+                    "response":   "",
+                    "elapsed_ms": 0,
+                    "error":      str(exc),
+                }
+            results.append(result)
+            yield _sse({"type": "model_done", **result})
+
+        yield _sse({"type": "complete", "results": results})
+
+    return StreamingResponse(
+        generate(),
+        media_type="text/event-stream",
+        headers={
+            "Cache-Control": "no-cache",
+            "X-Accel-Buffering": "no",
+        },
+    )
+
+
+# ── POST /push-corrections ─────────────────────────────────────────────────────
+
+class ImitateResult(BaseModel):
+    model: str
+    response: str
+    elapsed_ms: int
+    error: str | None = None
+
+
+class PushCorrectionsRequest(BaseModel):
+    product_id: str
+    prompt: str
+    results: list[ImitateResult]
+
+
+@router.post("/push-corrections")
+def push_corrections(req: PushCorrectionsRequest) -> dict:
+    """Append imitate results to sft_candidates.jsonl for human review."""
+    if not req.prompt.strip():
+        raise HTTPException(422, "prompt is required")
+    if not req.results:
+        raise HTTPException(422, "results list is empty")
+
+    ts = datetime.now(timezone.utc).isoformat()
+    records = []
+    for r in req.results:
+        if r.error or not r.response.strip():
+            continue
+        records.append({
+            "id":             str(uuid.uuid4()),
+            "source":         "imitate",
+            "product_id":     req.product_id,
+            "prompt_messages": [{"role": "user", "content": req.prompt}],
+            "model_response": r.response,
+            "model_id":       r.model,
+            "elapsed_ms":     r.elapsed_ms,
+            "status":         "pending",
+            "created_at":     ts,
+        })
+
+    if not records:
+        raise HTTPException(422, "No non-error results to push")
+
+    dest = _candidates_file()
+    dest.parent.mkdir(parents=True, exist_ok=True)
+    for record in records:
+        append_jsonl(dest, record)
+
+    return {"pushed": len(records)}
--- a/app/models.py
+++ b/app/models.py
@ -0,0 +1,448 @@
+"""Avocet — HF model lifecycle API.
+
+Handles model metadata lookup, approval queue, download with progress,
+and installed model management.
+
+All endpoints are registered on `router` (a FastAPI APIRouter).
+api.py includes this router with prefix="/api/models".
+
+Module-level globals (_MODELS_DIR, _QUEUE_DIR) follow the same
+testability pattern as sft.py — override them via set_models_dir() and
+set_queue_dir() in test fixtures.
+"""
+from __future__ import annotations
+
+import json
+import logging
+import shutil
+import threading
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+from uuid import uuid4
+
+import httpx
+from fastapi import APIRouter, HTTPException
+from fastapi.responses import StreamingResponse
+from pydantic import BaseModel
+
+from app.utils import read_jsonl, write_jsonl
+
+try:
+    from huggingface_hub import snapshot_download
+except ImportError:  # pragma: no cover
+    snapshot_download = None  # type: ignore[assignment]
+
+logger = logging.getLogger(__name__)
+
+_ROOT = Path(__file__).parent.parent
+_MODELS_DIR: Path = _ROOT / "models"
+_QUEUE_DIR: Path = _ROOT / "data"
+
+router = APIRouter()
+
+# ── Download progress shared state ────────────────────────────────────────────
+# Updated by the background download thread; read by GET /download/stream.
+_download_progress: dict[str, Any] = {}
+
+# ── HF pipeline_tag → adapter recommendation ──────────────────────────────────
+_TAG_TO_ADAPTER: dict[str, str] = {
+    "zero-shot-classification": "ZeroShotAdapter",
+    "text-classification": "ZeroShotAdapter",
+    "natural-language-inference": "ZeroShotAdapter",
+    "sentence-similarity": "RerankerAdapter",
+    "text-ranking": "RerankerAdapter",
+    "text-generation": "GenerationAdapter",
+    "text2text-generation": "GenerationAdapter",
+}
+
+
+# ── Testability seams ──────────────────────────────────────────────────────────
+
+def set_models_dir(path: Path) -> None:
+    global _MODELS_DIR
+    _MODELS_DIR = path
+
+
+def set_queue_dir(path: Path) -> None:
+    global _QUEUE_DIR
+    _QUEUE_DIR = path
+
+
+# ── Internal helpers ───────────────────────────────────────────────────────────
+
+def _queue_file() -> Path:
+    return _QUEUE_DIR / "model_queue.jsonl"
+
+
+def _read_queue() -> list[dict]:
+    return read_jsonl(_queue_file())
+
+
+def _write_queue(records: list[dict]) -> None:
+    write_jsonl(_queue_file(), records)
+
+
+def _safe_model_name(repo_id: str) -> str:
+    """Convert repo_id to a filesystem-safe directory name (HF convention)."""
+    return repo_id.replace("/", "--")
+
+
+def _is_installed(repo_id: str) -> bool:
+    """Check if a model is already downloaded in _MODELS_DIR."""
+    safe_name = _safe_model_name(repo_id)
+    model_dir = _MODELS_DIR / safe_name
+    return model_dir.exists() and (
+        (model_dir / "config.json").exists()
+        or (model_dir / "training_info.json").exists()
+        or (model_dir / "model_info.json").exists()
+    )
+
+
+def _is_queued(repo_id: str) -> bool:
+    """Check if repo_id is already in the queue (non-dismissed)."""
+    for entry in _read_queue():
+        if entry.get("repo_id") == repo_id and entry.get("status") != "dismissed":
+            return True
+    return False
+
+
+def _update_queue_entry(entry_id: str, updates: dict) -> dict | None:
+    """Update a queue entry by id. Returns updated entry or None if not found."""
+    records = _read_queue()
+    for i, r in enumerate(records):
+        if r.get("id") == entry_id:
+            records[i] = {**r, **updates}
+            _write_queue(records)
+            return records[i]
+    return None
+
+
+def _get_queue_entry(entry_id: str) -> dict | None:
+    for r in _read_queue():
+        if r.get("id") == entry_id:
+            return r
+    return None
+
+
+# ── Background download ────────────────────────────────────────────────────────
+
+def _run_download(entry_id: str, repo_id: str, pipeline_tag: str | None, adapter_recommendation: str | None) -> None:
+    """Background thread: download model via huggingface_hub.snapshot_download."""
+    global _download_progress
+    safe_name = _safe_model_name(repo_id)
+    local_dir = _MODELS_DIR / safe_name
+
+    _download_progress = {
+        "active": True,
+        "repo_id": repo_id,
+        "downloaded_bytes": 0,
+        "total_bytes": 0,
+        "pct": 0.0,
+        "done": False,
+        "error": None,
+    }
+
+    try:
+        if snapshot_download is None:
+            raise RuntimeError("huggingface_hub is not installed")
+
+        snapshot_download(
+            repo_id=repo_id,
+            local_dir=str(local_dir),
+        )
+
+        # Write model_info.json alongside downloaded files
+        model_info = {
+            "repo_id": repo_id,
+            "pipeline_tag": pipeline_tag,
+            "adapter_recommendation": adapter_recommendation,
+            "downloaded_at": datetime.now(timezone.utc).isoformat(),
+        }
+        local_dir.mkdir(parents=True, exist_ok=True)
+        (local_dir / "model_info.json").write_text(
+            json.dumps(model_info, indent=2), encoding="utf-8"
+        )
+
+        _download_progress["done"] = True
+        _download_progress["pct"] = 100.0
+        _update_queue_entry(entry_id, {"status": "ready"})
+
+    except Exception as exc:
+        logger.exception("Download failed for %s: %s", repo_id, exc)
+        _download_progress["error"] = str(exc)
+        _download_progress["done"] = True
+        _update_queue_entry(entry_id, {"status": "failed", "error": str(exc)})
+    finally:
+        _download_progress["active"] = False
+
+
+# ── GET /lookup ────────────────────────────────────────────────────────────────
+
+@router.get("/lookup")
+def lookup_model(repo_id: str) -> dict:
+    """Validate repo_id and fetch metadata from the HF API."""
+    # Validate: must contain exactly one '/', no whitespace
+    if "/" not in repo_id or any(c.isspace() for c in repo_id):
+        raise HTTPException(422, f"Invalid repo_id {repo_id!r}: must be 'owner/model-name' with no whitespace")
+
+    hf_url = f"https://huggingface.co/api/models/{repo_id}"
+    try:
+        resp = httpx.get(hf_url, timeout=10.0)
+    except httpx.RequestError as exc:
+        raise HTTPException(502, f"Network error reaching HuggingFace API: {exc}") from exc
+
+    if resp.status_code == 404:
+        raise HTTPException(404, f"Model {repo_id!r} not found on HuggingFace")
+    if resp.status_code != 200:
+        raise HTTPException(502, f"HuggingFace API returned status {resp.status_code}")
+
+    data = resp.json()
+    pipeline_tag = data.get("pipeline_tag")
+    adapter_recommendation = _TAG_TO_ADAPTER.get(pipeline_tag) if pipeline_tag else None
+
+    # Determine compatibility and surface a human-readable warning
+    _supported = ", ".join(sorted(_TAG_TO_ADAPTER.keys()))
+    if adapter_recommendation is not None:
+        compatible = True
+        warning: str | None = None
+    elif pipeline_tag is None:
+        compatible = False
+        warning = (
+            "This model has no task tag on HuggingFace — adapter type is unknown. "
+            "It may not work with Avocet's email classification pipeline."
+        )
+        logger.warning("No pipeline_tag for %s — no adapter recommendation", repo_id)
+    else:
+        compatible = False
+        warning = (
+            f"\"{pipeline_tag}\" models are not supported by Avocet's email classification adapters. "
+            f"Supported task types: {_supported}."
+        )
+        logger.warning("Unsupported pipeline_tag %r for %s", pipeline_tag, repo_id)
+
+    # Estimate model size from siblings list
+    siblings = data.get("siblings") or []
+    model_size_bytes: int = sum(s.get("size", 0) for s in siblings if isinstance(s, dict))
+
+    # Description: first 300 chars of card data (modelId field used as fallback)
+    card_data = data.get("cardData") or {}
+    description_raw = card_data.get("description") or data.get("modelId") or ""
+    description = description_raw[:300] if description_raw else ""
+
+    return {
+        "repo_id": repo_id,
+        "pipeline_tag": pipeline_tag,
+        "adapter_recommendation": adapter_recommendation,
+        "compatible": compatible,
+        "warning": warning,
+        "model_size_bytes": model_size_bytes,
+        "description": description,
+        "tags": data.get("tags") or [],
+        "downloads": data.get("downloads") or 0,
+        "already_installed": _is_installed(repo_id),
+        "already_queued": _is_queued(repo_id),
+    }
+
+
+# ── GET /queue ─────────────────────────────────────────────────────────────────
+
+@router.get("/queue")
+def get_queue() -> list[dict]:
+    """Return all non-dismissed queue entries sorted newest-first."""
+    records = _read_queue()
+    active = [r for r in records if r.get("status") != "dismissed"]
+    return sorted(active, key=lambda r: r.get("queued_at", ""), reverse=True)
+
+
+# ── POST /queue ────────────────────────────────────────────────────────────────
+
+class QueueAddRequest(BaseModel):
+    repo_id: str
+    pipeline_tag: str | None = None
+    adapter_recommendation: str | None = None
+
+
+@router.post("/queue", status_code=201)
+def add_to_queue(req: QueueAddRequest) -> dict:
+    """Add a model to the approval queue with status 'pending'."""
+    if _is_installed(req.repo_id):
+        raise HTTPException(409, f"{req.repo_id!r} is already installed")
+    if _is_queued(req.repo_id):
+        raise HTTPException(409, f"{req.repo_id!r} is already in the queue")
+
+    entry = {
+        "id": str(uuid4()),
+        "repo_id": req.repo_id,
+        "pipeline_tag": req.pipeline_tag,
+        "adapter_recommendation": req.adapter_recommendation,
+        "status": "pending",
+        "queued_at": datetime.now(timezone.utc).isoformat(),
+    }
+    records = _read_queue()
+    records.append(entry)
+    _write_queue(records)
+    return entry
+
+
+# ── POST /queue/{id}/approve ───────────────────────────────────────────────────
+
+@router.post("/queue/{entry_id}/approve")
+def approve_queue_entry(entry_id: str) -> dict:
+    """Approve a pending queue entry and start background download."""
+    entry = _get_queue_entry(entry_id)
+    if entry is None:
+        raise HTTPException(404, f"Queue entry {entry_id!r} not found")
+    if entry.get("status") != "pending":
+        raise HTTPException(409, f"Entry is not in pending state (current: {entry.get('status')!r})")
+
+    updated = _update_queue_entry(entry_id, {"status": "downloading"})
+
+    thread = threading.Thread(
+        target=_run_download,
+        args=(entry_id, entry["repo_id"], entry.get("pipeline_tag"), entry.get("adapter_recommendation")),
+        daemon=True,
+        name=f"model-download-{entry_id}",
+    )
+    thread.start()
+
+    return {"ok": True}
+
+
+# ── DELETE /queue/{id} ─────────────────────────────────────────────────────────
+
+@router.delete("/queue/{entry_id}")
+def dismiss_queue_entry(entry_id: str) -> dict:
+    """Dismiss (soft-delete) a queue entry."""
+    entry = _get_queue_entry(entry_id)
+    if entry is None:
+        raise HTTPException(404, f"Queue entry {entry_id!r} not found")
+
+    _update_queue_entry(entry_id, {"status": "dismissed"})
+    return {"ok": True}
+
+
+# ── GET /download/stream ───────────────────────────────────────────────────────
+
+@router.get("/download/stream")
+def download_stream() -> StreamingResponse:
+    """SSE stream of download progress. Yields one idle event if no download active."""
+
+    def generate():
+        prog = _download_progress
+        if not prog.get("active") and not (prog.get("done") and not prog.get("error")):
+            yield f"data: {json.dumps({'type': 'idle'})}\n\n"
+            return
+
+        if prog.get("done"):
+            if prog.get("error"):
+                yield f"data: {json.dumps({'type': 'error', 'error': prog['error']})}\n\n"
+            else:
+                yield f"data: {json.dumps({'type': 'done', 'repo_id': prog.get('repo_id')})}\n\n"
+            return
+
+        # Stream live progress
+        import time
+        while True:
+            p = dict(_download_progress)
+            if p.get("done"):
+                if p.get("error"):
+                    yield f"data: {json.dumps({'type': 'error', 'error': p['error']})}\n\n"
+                else:
+                    yield f"data: {json.dumps({'type': 'done', 'repo_id': p.get('repo_id')})}\n\n"
+                break
+            event = json.dumps({
+                "type": "progress",
+                "repo_id": p.get("repo_id"),
+                "downloaded_bytes": p.get("downloaded_bytes", 0),
+                "total_bytes": p.get("total_bytes", 0),
+                "pct": p.get("pct", 0.0),
+            })
+            yield f"data: {event}\n\n"
+            time.sleep(0.5)
+
+    return StreamingResponse(
+        generate(),
+        media_type="text/event-stream",
+        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
+    )
+
+
+# ── GET /installed ─────────────────────────────────────────────────────────────
+
+@router.get("/installed")
+def list_installed() -> list[dict]:
+    """Scan _MODELS_DIR and return info on each installed model."""
+    if not _MODELS_DIR.exists():
+        return []
+
+    results: list[dict] = []
+    for sub in _MODELS_DIR.iterdir():
+        if not sub.is_dir():
+            continue
+
+        has_training_info = (sub / "training_info.json").exists()
+        has_config = (sub / "config.json").exists()
+        has_model_info = (sub / "model_info.json").exists()
+
+        if not (has_training_info or has_config or has_model_info):
+            continue
+
+        model_type = "finetuned" if has_training_info else "downloaded"
+
+        # Compute directory size
+        size_bytes = sum(f.stat().st_size for f in sub.rglob("*") if f.is_file())
+
+        # Load adapter/model_id from model_info.json or training_info.json
+        adapter: str | None = None
+        model_id: str | None = None
+
+        if has_model_info:
+            try:
+                info = json.loads((sub / "model_info.json").read_text(encoding="utf-8"))
+                adapter = info.get("adapter_recommendation")
+                model_id = info.get("repo_id")
+            except Exception:
+                pass
+        elif has_training_info:
+            try:
+                info = json.loads((sub / "training_info.json").read_text(encoding="utf-8"))
+                adapter = info.get("adapter")
+                model_id = info.get("base_model") or info.get("model_id")
+            except Exception:
+                pass
+
+        results.append({
+            "name": sub.name,
+            "path": str(sub),
+            "type": model_type,
+            "adapter": adapter,
+            "size_bytes": size_bytes,
+            "model_id": model_id,
+        })
+
+    return results
+
+
+# ── DELETE /installed/{name} ───────────────────────────────────────────────────
+
+@router.delete("/installed/{name}")
+def delete_installed(name: str) -> dict:
+    """Remove an installed model directory by name. Blocks path traversal."""
+    # Validate: single path component, no slashes or '..'
+    if "/" in name or "\\" in name or ".." in name or not name or name.startswith("."):
+        raise HTTPException(400, f"Invalid model name {name!r}: must be a single directory name with no path separators or '..'")
+
+    model_path = _MODELS_DIR / name
+
+    # Extra safety: confirm resolved path is inside _MODELS_DIR
+    try:
+        model_path.resolve().relative_to(_MODELS_DIR.resolve())
+    except ValueError:
+        raise HTTPException(400, f"Path traversal detected for name {name!r}")
+
+    if not model_path.exists():
+        raise HTTPException(404, f"Installed model {name!r} not found")
+
+    shutil.rmtree(model_path)
+    return {"ok": True}
--- a/app/sft.py
+++ b/app/sft.py
@ -51,17 +51,26 @@ def _config_file() -> Path:
    return _ROOT / "config" / "label_tool.yaml"


+_DEFAULT_BENCH_RESULTS_DIR = "/Library/Development/CircuitForge/circuitforge-orch/scripts/bench_results"
+
+
+def set_default_bench_results_dir(path: str) -> None:
+    """Override the default bench_results_dir — used by tests to avoid real filesystem."""
+    global _DEFAULT_BENCH_RESULTS_DIR
+    _DEFAULT_BENCH_RESULTS_DIR = path
+
+
 def _get_bench_results_dir() -> Path:
    f = _config_file()
-    if not f.exists():
-        return Path("/nonexistent-bench-results")
-    try:
-        raw = yaml.safe_load(f.read_text(encoding="utf-8")) or {}
-    except yaml.YAMLError as exc:
-        logger.warning("Failed to parse SFT config %s: %s", f, exc)
-        return Path("/nonexistent-bench-results")
-    d = raw.get("sft", {}).get("bench_results_dir", "")
-    return Path(d) if d else Path("/nonexistent-bench-results")
+    if f.exists():
+        try:
+            raw = yaml.safe_load(f.read_text(encoding="utf-8")) or {}
+            d = raw.get("sft", {}).get("bench_results_dir", "")
+            if d:
+                return Path(d)
+        except yaml.YAMLError as exc:
+            logger.warning("Failed to parse SFT config %s: %s", f, exc)
+    return Path(_DEFAULT_BENCH_RESULTS_DIR)


 def _candidates_file() -> Path:
@ -151,10 +160,21 @@ def get_queue(page: int = 1, per_page: int = 20):

 # ── POST /submit ───────────────────────────────────────────────────────────

+FailureCategory = Literal[
+    "scoring_artifact",
+    "style_violation",
+    "partial_answer",
+    "wrong_answer",
+    "format_error",
+    "hallucination",
+]
+
+
 class SubmitRequest(BaseModel):
    id: str
    action: Literal["correct", "discard", "flag"]
    corrected_response: str | None = None
+    failure_category: FailureCategory | None = None


@router.post("/submit")
@ -174,7 +194,12 @@ def post_submit(req: SubmitRequest):
        raise HTTPException(409, f"Record is not in needs_review state (current: {record.get('status')})")

    if req.action == "correct":
-        records[idx] = {**record, "status": "approved", "corrected_response": req.corrected_response}
+        records[idx] = {
+            **record,
+            "status": "approved",
+            "corrected_response": req.corrected_response,
+            "failure_category": req.failure_category,
+        }
        _write_candidates(records)
        append_jsonl(_approved_file(), records[idx])
    elif req.action == "discard":
--- a/config/label_tool.yaml.example
+++ b/config/label_tool.yaml.example
@ -26,3 +26,66 @@ max_per_account: 500
 # produced by circuitforge-orch's benchmark harness.
 sft:
  bench_results_dir: /path/to/circuitforge-orch/scripts/bench_results
+
+# cf-orch integration — LLM benchmark harness via cf-orch coordinator.
+# All keys here override the corresponding environment variables.
+# Omit any key to fall back to the env var (see .env.example).
+cforch:
+  # Path to cf-orch's benchmark.py script
+  bench_script: /path/to/circuitforge-orch/scripts/benchmark.py
+  # Task and model definition files (yaml)
+  bench_tasks:  /path/to/circuitforge-orch/scripts/bench_tasks.yaml
+  bench_models: /path/to/circuitforge-orch/scripts/bench_models.yaml
+  # Where benchmark results are written (also used for SFT candidate discovery)
+  results_dir:  /path/to/circuitforge-orch/scripts/bench_results
+  # Python interpreter with cf-orch installed
+  python_bin:   /devl/miniconda3/envs/cf/bin/python
+
+  # Connection config — override env vars CF_ORCH_URL / CF_LICENSE_KEY / OLLAMA_HOST
+  # coordinator_url: http://localhost:7700
+  # license_key:     CFG-AVCT-xxxx-xxxx-xxxx
+  # ollama_url:      http://localhost:11434
+  # ollama_model:    llama3.2:3b
+
+# Imitate tab — pull real samples from sibling CF product APIs and run them
+# through local LLMs to build a corrections dataset.
+# ollama_url defaults to cforch.ollama_url if omitted here.
+imitate:
+  ollama_url: http://localhost:11434   # optional — falls back to cforch.ollama_url
+
+  products:
+    - id: peregrine
+      name: Peregrine
+      icon: "🦅"
+      description: Job search assistant
+      base_url: http://localhost:8502
+      sample_endpoint: /api/jobs
+      text_fields: [title, description]
+      prompt_template: "Analyze this job listing and identify key requirements:\n\n{text}"
+
+    - id: kiwi
+      name: Kiwi
+      icon: "🥝"
+      description: Pantry tracker
+      base_url: http://localhost:8511
+      sample_endpoint: /api/inventory
+      text_fields: [name, category, notes]
+      prompt_template: "Describe this pantry item and estimate how best to use it:\n\n{text}"
+
+    - id: snipe
+      name: Snipe
+      icon: "🎯"
+      description: eBay trust scoring
+      base_url: http://localhost:8509
+      sample_endpoint: /api/listings
+      text_fields: [title, description, seller_info]
+      prompt_template: "Evaluate the trustworthiness of this listing and flag any red flags:\n\n{text}"
+
+    - id: osprey
+      name: Osprey
+      icon: "📞"
+      description: Gov't hold-line automation
+      base_url: http://localhost:8520
+      sample_endpoint: /api/calls/recent
+      text_fields: [agency, issue, notes]
+      prompt_template: "Draft a concise summary of this government call record:\n\n{text}"
--- a/environment.yml
+++ b/environment.yml
@ -22,5 +22,8 @@ dependencies:
    # Optional: BGE reranker adapter
    # - FlagEmbedding

+    # CircuitForge shared core (LLM router, tier system, config)
+    - circuitforge-core>=0.9.0
+
    # Dev
    - pytest>=8.0
--- a/tests/test_cforch.py
+++ b/tests/test_cforch.py
@ -0,0 +1,369 @@
+"""Tests for app/cforch.py — /api/cforch/* endpoints."""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+
+import pytest
+import yaml
+from fastapi.testclient import TestClient
+
+
+# ── Fixtures ───────────────────────────────────────────────────────────────────
+
+@pytest.fixture(autouse=True)
+def reset_cforch_globals(tmp_path):
+    """Redirect _CONFIG_DIR to tmp_path and reset running-state globals."""
+    from app import cforch as cforch_module
+
+    prev_config_dir = cforch_module._CONFIG_DIR
+    prev_running = cforch_module._BENCH_RUNNING
+    prev_proc = cforch_module._bench_proc
+
+    cforch_module.set_config_dir(tmp_path)
+    cforch_module._BENCH_RUNNING = False
+    cforch_module._bench_proc = None
+
+    yield tmp_path
+
+    cforch_module.set_config_dir(prev_config_dir)
+    cforch_module._BENCH_RUNNING = prev_running
+    cforch_module._bench_proc = prev_proc
+
+
+@pytest.fixture
+def client():
+    from app.api import app
+    return TestClient(app)
+
+
+@pytest.fixture
+def config_dir(reset_cforch_globals):
+    """Return the tmp config dir (already set as _CONFIG_DIR)."""
+    return reset_cforch_globals
+
+
+def _write_config(config_dir: Path, cforch_cfg: dict) -> None:
+    """Write a label_tool.yaml with the given cforch block into config_dir."""
+    cfg = {"cforch": cforch_cfg}
+    (config_dir / "label_tool.yaml").write_text(
+        yaml.dump(cfg), encoding="utf-8"
+    )
+
+
+def _write_tasks_yaml(path: Path, tasks: list[dict]) -> None:
+    path.write_text(yaml.dump({"tasks": tasks}), encoding="utf-8")
+
+
+def _write_models_yaml(path: Path, models: list[dict]) -> None:
+    path.write_text(yaml.dump({"models": models}), encoding="utf-8")
+
+
+# ── GET /tasks ─────────────────────────────────────────────────────────────────
+
+def test_tasks_returns_empty_when_not_configured(client):
+    """No config file present — endpoint returns empty lists."""
+    r = client.get("/api/cforch/tasks")
+    assert r.status_code == 200
+    data = r.json()
+    assert data == {"tasks": [], "types": []}
+
+
+def test_tasks_parses_yaml(client, config_dir, tmp_path):
+    tasks_file = tmp_path / "bench_tasks.yaml"
+    _write_tasks_yaml(tasks_file, [
+        {"id": "t1", "name": "Task One", "type": "instruction"},
+        {"id": "t2", "name": "Task Two", "type": "reasoning"},
+    ])
+    _write_config(config_dir, {"bench_tasks": str(tasks_file)})
+
+    r = client.get("/api/cforch/tasks")
+    assert r.status_code == 200
+    data = r.json()
+    assert len(data["tasks"]) == 2
+    # TaskEntry now includes optional prompt/system fields (default "")
+    t1 = data["tasks"][0]
+    assert t1["id"] == "t1" and t1["name"] == "Task One" and t1["type"] == "instruction"
+    t2 = data["tasks"][1]
+    assert t2["id"] == "t2" and t2["name"] == "Task Two" and t2["type"] == "reasoning"
+    assert "instruction" in data["types"]
+    assert "reasoning" in data["types"]
+
+
+def test_tasks_returns_types_deduplicated(client, config_dir, tmp_path):
+    """Multiple tasks sharing a type — types list must not duplicate."""
+    tasks_file = tmp_path / "bench_tasks.yaml"
+    _write_tasks_yaml(tasks_file, [
+        {"id": "t1", "name": "A", "type": "instruction"},
+        {"id": "t2", "name": "B", "type": "instruction"},
+        {"id": "t3", "name": "C", "type": "reasoning"},
+    ])
+    _write_config(config_dir, {"bench_tasks": str(tasks_file)})
+
+    r = client.get("/api/cforch/tasks")
+    data = r.json()
+    assert data["types"].count("instruction") == 1
+    assert len(data["types"]) == 2
+
+
+# ── GET /models ────────────────────────────────────────────────────────────────
+
+def test_models_returns_empty_when_not_configured(client):
+    """No config file present — endpoint returns empty model list."""
+    r = client.get("/api/cforch/models")
+    assert r.status_code == 200
+    assert r.json() == {"models": []}
+
+
+def test_models_parses_bench_models_yaml(client, config_dir, tmp_path):
+    models_file = tmp_path / "bench_models.yaml"
+    _write_models_yaml(models_file, [
+        {
+            "name": "llama3",
+            "id": "llama3:8b",
+            "service": "ollama",
+            "tags": ["fast", "small"],
+            "vram_estimate_mb": 6000,
+        }
+    ])
+    _write_config(config_dir, {"bench_models": str(models_file)})
+
+    r = client.get("/api/cforch/models")
+    assert r.status_code == 200
+    data = r.json()
+    assert len(data["models"]) == 1
+    m = data["models"][0]
+    assert m["name"] == "llama3"
+    assert m["id"] == "llama3:8b"
+    assert m["service"] == "ollama"
+    assert m["tags"] == ["fast", "small"]
+    assert m["vram_estimate_mb"] == 6000
+
+
+# ── GET /run ───────────────────────────────────────────────────────────────────
+
+def test_run_returns_409_when_already_running(client):
+    """If _BENCH_RUNNING is True, GET /run returns 409."""
+    from app import cforch as cforch_module
+    cforch_module._BENCH_RUNNING = True
+
+    r = client.get("/api/cforch/run")
+    assert r.status_code == 409
+
+
+def test_run_returns_error_when_bench_script_not_configured(client):
+    """No config at all — SSE stream contains an error event."""
+    r = client.get("/api/cforch/run")
+    assert r.status_code == 200
+    assert '"type": "error"' in r.text
+    assert "bench_script not configured" in r.text
+
+
+def test_run_streams_progress_events(client, config_dir, tmp_path):
+    """Mock subprocess — SSE stream emits progress events from stdout."""
+    bench_script = tmp_path / "fake_benchmark.py"
+    bench_script.write_text("# fake", encoding="utf-8")
+
+    tasks_file = tmp_path / "bench_tasks.yaml"
+    tasks_file.write_text(yaml.dump({"tasks": []}), encoding="utf-8")
+    models_file = tmp_path / "bench_models.yaml"
+    models_file.write_text(yaml.dump({"models": []}), encoding="utf-8")
+    results_dir = tmp_path / "results"
+    results_dir.mkdir()
+
+    _write_config(config_dir, {
+        "bench_script": str(bench_script),
+        "bench_tasks": str(tasks_file),
+        "bench_models": str(models_file),
+        "results_dir": str(results_dir),
+        "python_bin": "/usr/bin/python3",
+    })
+
+    mock_proc = MagicMock()
+    mock_proc.stdout = iter(["Running task 1\n", "Running task 2\n"])
+    mock_proc.returncode = 1  # non-zero so we don't need summary.json
+
+    def mock_wait():
+        pass
+
+    mock_proc.wait = mock_wait
+
+    with patch("app.cforch._subprocess.Popen", return_value=mock_proc):
+        r = client.get("/api/cforch/run")
+
+    assert r.status_code == 200
+    assert '"type": "progress"' in r.text
+    assert "Running task 1" in r.text
+    assert "Running task 2" in r.text
+
+
+def test_run_emits_result_on_success(client, config_dir, tmp_path):
+    """Mock subprocess exit 0 + write fake summary.json — stream emits result event."""
+    bench_script = tmp_path / "fake_benchmark.py"
+    bench_script.write_text("# fake", encoding="utf-8")
+
+    tasks_file = tmp_path / "bench_tasks.yaml"
+    tasks_file.write_text(yaml.dump({"tasks": []}), encoding="utf-8")
+    models_file = tmp_path / "bench_models.yaml"
+    models_file.write_text(yaml.dump({"models": []}), encoding="utf-8")
+
+    results_dir = tmp_path / "results"
+    run_dir = results_dir / "2026-04-08-120000"
+    run_dir.mkdir(parents=True)
+    summary_data = {"score": 0.92, "models_evaluated": 3}
+    (run_dir / "summary.json").write_text(json.dumps(summary_data), encoding="utf-8")
+
+    _write_config(config_dir, {
+        "bench_script": str(bench_script),
+        "bench_tasks": str(tasks_file),
+        "bench_models": str(models_file),
+        "results_dir": str(results_dir),
+        "python_bin": "/usr/bin/python3",
+    })
+
+    mock_proc = MagicMock()
+    mock_proc.stdout = iter([])
+    mock_proc.returncode = 0
+    mock_proc.wait = MagicMock()
+
+    with patch("app.cforch._subprocess.Popen", return_value=mock_proc):
+        r = client.get("/api/cforch/run")
+
+    assert r.status_code == 200
+    assert '"type": "result"' in r.text
+    assert '"score": 0.92' in r.text
+    assert '"type": "complete"' in r.text
+
+
+# ── GET /results ───────────────────────────────────────────────────────────────
+
+def test_results_returns_404_when_no_results(client):
+    """No results_dir configured — endpoint returns 404."""
+    r = client.get("/api/cforch/results")
+    assert r.status_code == 404
+
+
+def test_results_returns_latest_summary(client, config_dir, tmp_path):
+    """Write fake results dir with one subdir containing summary.json."""
+    results_dir = tmp_path / "results"
+    run_dir = results_dir / "2026-04-08-150000"
+    run_dir.mkdir(parents=True)
+    summary_data = {"score": 0.88, "run": "test"}
+    (run_dir / "summary.json").write_text(json.dumps(summary_data), encoding="utf-8")
+
+    _write_config(config_dir, {"results_dir": str(results_dir)})
+
+    r = client.get("/api/cforch/results")
+    assert r.status_code == 200
+    data = r.json()
+    assert data["score"] == 0.88
+    assert data["run"] == "test"
+
+
+# ── POST /cancel ───────────────────────────────────────────────────────────────
+
+def test_cancel_returns_404_when_not_running(client):
+    """POST /cancel when no benchmark running — returns 404."""
+    r = client.post("/api/cforch/cancel")
+    assert r.status_code == 404
+
+
+def test_cancel_terminates_running_benchmark(client):
+    """POST /cancel when benchmark is running — terminates proc and returns cancelled."""
+    from app import cforch as cforch_module
+
+    mock_proc = MagicMock()
+    cforch_module._BENCH_RUNNING = True
+    cforch_module._bench_proc = mock_proc
+
+    r = client.post("/api/cforch/cancel")
+    assert r.status_code == 200
+    assert r.json() == {"status": "cancelled"}
+    mock_proc.terminate.assert_called_once()
+    assert cforch_module._BENCH_RUNNING is False
+    assert cforch_module._bench_proc is None
+
+
+# ── GET /config ────────────────────────────────────────────────────────────────
+
+def test_config_returns_empty_when_no_yaml_no_env(client, monkeypatch):
+    """No yaml, no env vars — all fields empty, license_key_set False."""
+    for key in ("CF_ORCH_URL", "CF_LICENSE_KEY", "OLLAMA_HOST", "OLLAMA_MODEL"):
+        monkeypatch.delenv(key, raising=False)
+
+    r = client.get("/api/cforch/config")
+    assert r.status_code == 200
+    data = r.json()
+    assert data["coordinator_url"] == ""
+    assert data["ollama_url"] == ""
+    assert data["license_key_set"] is False
+
+
+def test_config_reads_env_vars_when_no_yaml(client, monkeypatch):
+    """Env vars populate fields when label_tool.yaml has no cforch section."""
+    monkeypatch.setenv("CF_ORCH_URL",      "http://orch.example.com:7700")
+    monkeypatch.setenv("CF_LICENSE_KEY",   "CFG-AVCT-TEST-TEST-TEST")
+    monkeypatch.setenv("OLLAMA_HOST",      "http://ollama.local:11434")
+    monkeypatch.setenv("OLLAMA_MODEL",     "mistral:7b")
+
+    r = client.get("/api/cforch/config")
+    assert r.status_code == 200
+    data = r.json()
+    assert data["coordinator_url"] == "http://orch.example.com:7700"
+    assert data["ollama_url"]      == "http://ollama.local:11434"
+    assert data["ollama_model"]    == "mistral:7b"
+    assert data["license_key_set"] is True   # set, but value not exposed
+
+
+def test_config_yaml_overrides_env(client, config_dir, monkeypatch):
+    """label_tool.yaml cforch values take priority over env vars."""
+    monkeypatch.setenv("CF_ORCH_URL",  "http://env-orch:7700")
+    monkeypatch.setenv("OLLAMA_HOST",  "http://env-ollama:11434")
+
+    _write_config(config_dir, {
+        "coordinator_url": "http://yaml-orch:7700",
+        "ollama_url":      "http://yaml-ollama:11434",
+    })
+
+    r = client.get("/api/cforch/config")
+    assert r.status_code == 200
+    data = r.json()
+    assert data["coordinator_url"] == "http://yaml-orch:7700"
+    assert data["ollama_url"]      == "http://yaml-ollama:11434"
+    assert data["source"] == "yaml+env"
+
+
+def test_run_passes_license_key_env_to_subprocess(client, config_dir, tmp_path, monkeypatch):
+    """CF_LICENSE_KEY must be forwarded to the benchmark subprocess env."""
+    monkeypatch.setenv("CF_LICENSE_KEY", "CFG-AVCT-ENV-ONLY-KEY")
+
+    bench_script = tmp_path / "benchmark.py"
+    bench_script.write_text("# stub", encoding="utf-8")
+    tasks_file   = tmp_path / "bench_tasks.yaml"
+    tasks_file.write_text(yaml.dump({"tasks": []}), encoding="utf-8")
+    models_file  = tmp_path / "bench_models.yaml"
+    models_file.write_text(yaml.dump({"models": []}), encoding="utf-8")
+
+    _write_config(config_dir, {
+        "bench_script":  str(bench_script),
+        "bench_tasks":   str(tasks_file),
+        "bench_models":  str(models_file),
+        "results_dir":   str(tmp_path / "results"),
+        "python_bin":    "/usr/bin/python3",
+    })
+
+    captured_env: dict = {}
+
+    def fake_popen(cmd, **kwargs):
+        captured_env.update(kwargs.get("env", {}))
+        mock = MagicMock()
+        mock.stdout = iter([])
+        mock.returncode = 0
+        mock.wait = MagicMock()
+        return mock
+
+    with patch("app.cforch._subprocess.Popen", side_effect=fake_popen):
+        client.get("/api/cforch/run")
+
+    assert captured_env.get("CF_LICENSE_KEY") == "CFG-AVCT-ENV-ONLY-KEY"
--- a/tests/test_imitate.py
+++ b/tests/test_imitate.py
@ -0,0 +1,242 @@
+"""Tests for app/imitate.py — product registry, sample extraction, corrections push."""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+
+import pytest
+from fastapi.testclient import TestClient
+
+from app.api import app
+from app import imitate as _imitate_module
+
+
+# ── Fixtures ───────────────────────────────────────────────────────────────────
+
+@pytest.fixture(autouse=True)
+def reset_module_globals(tmp_path):
+    """Reset module-level config + data dir globals after each test."""
+    orig_cfg  = _imitate_module._CONFIG_DIR
+    orig_data = _imitate_module._DATA_DIR
+    yield
+    _imitate_module._CONFIG_DIR = orig_cfg
+    _imitate_module._DATA_DIR   = orig_data
+
+
+@pytest.fixture()
+def config_dir(tmp_path) -> Path:
+    _imitate_module.set_config_dir(tmp_path)
+    return tmp_path
+
+
+@pytest.fixture()
+def data_dir(tmp_path) -> Path:
+    _imitate_module.set_data_dir(tmp_path)
+    return tmp_path
+
+
+@pytest.fixture()
+def cfg_with_products(config_dir: Path) -> Path:
+    """Write a label_tool.yaml with two products."""
+    (config_dir / "label_tool.yaml").write_text(
+        """
+imitate:
+  ollama_url: http://localhost:11434
+  products:
+    - id: peregrine
+      name: Peregrine
+      icon: "🦅"
+      description: Job search assistant
+      base_url: http://peregrine.local
+      sample_endpoint: /api/jobs
+      text_fields: [title, description]
+      prompt_template: "Analyze: {text}"
+    - id: kiwi
+      name: Kiwi
+      icon: "🥝"
+      description: Pantry tracker
+      base_url: http://kiwi.local
+      sample_endpoint: /api/inventory
+      text_fields: [name, notes]
+      prompt_template: "Describe: {text}"
+"""
+    )
+    return config_dir
+
+
+@pytest.fixture()
+def client() -> TestClient:
+    return TestClient(app, raise_server_exceptions=True)
+
+
+# ── GET /products ──────────────────────────────────────────────────────────────
+
+def test_products_empty_when_no_config(config_dir, client):
+    """Returns empty list when label_tool.yaml has no imitate section."""
+    (config_dir / "label_tool.yaml").write_text("accounts: []\n")
+    resp = client.get("/api/imitate/products")
+    assert resp.status_code == 200
+    assert resp.json()["products"] == []
+
+
+def test_products_listed(cfg_with_products, client):
+    """All configured products are returned with expected fields."""
+    with patch.object(_imitate_module, "_is_online", return_value=True):
+        resp = client.get("/api/imitate/products")
+    assert resp.status_code == 200
+    products = resp.json()["products"]
+    assert len(products) == 2
+    ids = {p["id"] for p in products}
+    assert ids == {"peregrine", "kiwi"}
+    peregrine = next(p for p in products if p["id"] == "peregrine")
+    assert peregrine["name"] == "Peregrine"
+    assert peregrine["icon"] == "🦅"
+    assert peregrine["online"] is True
+
+
+def test_products_offline_when_unreachable(cfg_with_products, client):
+    """Products with unreachable base_url are marked offline."""
+    with patch.object(_imitate_module, "_is_online", return_value=False):
+        resp = client.get("/api/imitate/products")
+    assert all(not p["online"] for p in resp.json()["products"])
+
+
+# ── GET /products/{id}/sample ─────────────────────────────────────────────────
+
+def test_sample_unknown_product(cfg_with_products, client):
+    """Returns 404 for a product id not in config."""
+    resp = client.get("/api/imitate/products/nonexistent/sample")
+    assert resp.status_code == 404
+
+
+def test_sample_fetched_from_list(cfg_with_products, client):
+    """Extracts first item from a list API response."""
+    fake_api = [
+        {"title": "Engineer", "description": "Build things"},
+        {"title": "Other",    "description": "Ignore me"},
+    ]
+    with patch.object(_imitate_module, "_http_get_json", return_value=fake_api):
+        resp = client.get("/api/imitate/products/peregrine/sample")
+    assert resp.status_code == 200
+    body = resp.json()
+    assert "Engineer" in body["text"]
+    assert "Build things" in body["text"]
+    assert "Analyze:" in body["prompt"]
+
+
+def test_sample_fetched_from_dict_with_items_key(cfg_with_products, client):
+    """Extracts from a wrapper dict with a recognised list key."""
+    fake_api = {"items": [{"title": "Wrapped Job", "description": "In a wrapper"}]}
+    with patch.object(_imitate_module, "_http_get_json", return_value=fake_api):
+        resp = client.get("/api/imitate/products/peregrine/sample")
+    assert resp.status_code == 200
+    assert "Wrapped Job" in resp.json()["text"]
+
+
+def test_sample_503_when_api_unreachable(cfg_with_products, client):
+    """Returns 503 when the product API is not reachable."""
+    from urllib.error import URLError
+    with patch.object(_imitate_module, "_http_get_json", side_effect=URLError("refused")):
+        resp = client.get("/api/imitate/products/peregrine/sample")
+    assert resp.status_code == 503
+
+
+def test_sample_404_on_empty_list(cfg_with_products, client):
+    """Returns 404 when product API returns an empty list."""
+    with patch.object(_imitate_module, "_http_get_json", return_value=[]):
+        resp = client.get("/api/imitate/products/peregrine/sample")
+    assert resp.status_code == 404
+
+
+# ── POST /push-corrections ─────────────────────────────────────────────────────
+
+def test_push_corrections_appends_jsonl(cfg_with_products, data_dir, client):
+    """Successful push writes records to sft_candidates.jsonl."""
+    payload = {
+        "product_id": "peregrine",
+        "prompt":     "Analyze this job:",
+        "results": [
+            {"model": "qwen2.5:0.5b", "response": "It's a good job.", "elapsed_ms": 800, "error": None},
+            {"model": "llama3.1:8b",  "response": "Strong candidate.", "elapsed_ms": 1500, "error": None},
+        ],
+    }
+    resp = client.post("/api/imitate/push-corrections", json=payload)
+    assert resp.status_code == 200
+    assert resp.json()["pushed"] == 2
+
+    candidates = (data_dir / "sft_candidates.jsonl").read_text().splitlines()
+    assert len(candidates) == 2
+    for line in candidates:
+        record = json.loads(line)
+        assert record["source"] == "imitate"
+        assert record["product_id"] == "peregrine"
+        assert record["status"] == "pending"
+        assert record["prompt_messages"][0]["role"] == "user"
+
+
+def test_push_corrections_skips_errors(cfg_with_products, data_dir, client):
+    """Results with errors are not written to the corrections file."""
+    payload = {
+        "product_id": "peregrine",
+        "prompt":     "Analyze:",
+        "results": [
+            {"model": "good-model",  "response": "Good answer.", "elapsed_ms": 500, "error": None},
+            {"model": "bad-model",   "response": "",             "elapsed_ms": 0,   "error": "connection refused"},
+        ],
+    }
+    resp = client.post("/api/imitate/push-corrections", json=payload)
+    assert resp.status_code == 200
+    assert resp.json()["pushed"] == 1
+
+
+def test_push_corrections_empty_prompt_422(cfg_with_products, data_dir, client):
+    """Empty prompt returns 422."""
+    payload = {
+        "product_id": "peregrine",
+        "prompt":     "   ",
+        "results": [{"model": "m", "response": "r", "elapsed_ms": 1, "error": None}],
+    }
+    resp = client.post("/api/imitate/push-corrections", json=payload)
+    assert resp.status_code == 422
+
+
+def test_push_corrections_all_errors_422(cfg_with_products, data_dir, client):
+    """422 when every result has an error (nothing to push)."""
+    payload = {
+        "product_id": "peregrine",
+        "prompt":     "Analyze:",
+        "results": [
+            {"model": "m", "response": "", "elapsed_ms": 0, "error": "timed out"},
+        ],
+    }
+    resp = client.post("/api/imitate/push-corrections", json=payload)
+    assert resp.status_code == 422
+
+
+# ── _extract_sample helper ─────────────────────────────────────────────────────
+
+def test_extract_sample_list():
+    result = _imitate_module._extract_sample(
+        [{"title": "A", "description": "B"}],
+        text_fields=["title", "description"],
+    )
+    assert "A" in result["text"]
+    assert "B" in result["text"]
+
+
+def test_extract_sample_empty_list():
+    result = _imitate_module._extract_sample([], text_fields=["title"])
+    assert result == {}
+
+
+def test_extract_sample_respects_index():
+    items = [{"title": "First"}, {"title": "Second"}]
+    result = _imitate_module._extract_sample(items, ["title"], sample_index=1)
+    assert "Second" in result["text"]
+
+
+def test_extract_sample_clamps_index():
+    items = [{"title": "Only"}]
+    result = _imitate_module._extract_sample(items, ["title"], sample_index=99)
+    assert "Only" in result["text"]
--- a/tests/test_models.py
+++ b/tests/test_models.py
@ -0,0 +1,402 @@
+"""Tests for app/models.py — /api/models/* endpoints."""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+
+import pytest
+from fastapi.testclient import TestClient
+
+
+# ── Fixtures ───────────────────────────────────────────────────────────────────
+
+@pytest.fixture(autouse=True)
+def reset_models_globals(tmp_path):
+    """Redirect module-level dirs to tmp_path and reset download progress."""
+    from app import models as models_module
+
+    prev_models = models_module._MODELS_DIR
+    prev_queue = models_module._QUEUE_DIR
+    prev_progress = dict(models_module._download_progress)
+
+    models_dir = tmp_path / "models"
+    queue_dir = tmp_path / "data"
+    models_dir.mkdir()
+    queue_dir.mkdir()
+
+    models_module.set_models_dir(models_dir)
+    models_module.set_queue_dir(queue_dir)
+    models_module._download_progress = {}
+
+    yield
+
+    models_module.set_models_dir(prev_models)
+    models_module.set_queue_dir(prev_queue)
+    models_module._download_progress = prev_progress
+
+
+@pytest.fixture
+def client():
+    from app.api import app
+    return TestClient(app)
+
+
+def _make_hf_response(repo_id: str = "org/model", pipeline_tag: str = "text-classification") -> dict:
+    """Minimal HF API response payload."""
+    return {
+        "modelId": repo_id,
+        "pipeline_tag": pipeline_tag,
+        "tags": ["pytorch", pipeline_tag],
+        "downloads": 42000,
+        "siblings": [
+            {"rfilename": "pytorch_model.bin", "size": 500_000_000},
+        ],
+        "cardData": {"description": "A test model description."},
+    }
+
+
+def _queue_one(client, repo_id: str = "org/model") -> dict:
+    """Helper: POST to /queue and return the created entry."""
+    r = client.post("/api/models/queue", json={
+        "repo_id": repo_id,
+        "pipeline_tag": "text-classification",
+        "adapter_recommendation": "ZeroShotAdapter",
+    })
+    assert r.status_code == 201, r.text
+    return r.json()
+
+
+# ── GET /lookup ────────────────────────────────────────────────────────────────
+
+def test_lookup_invalid_repo_id_returns_422_no_slash(client):
+    """repo_id without a '/' should be rejected with 422."""
+    r = client.get("/api/models/lookup", params={"repo_id": "noslash"})
+    assert r.status_code == 422
+
+
+def test_lookup_invalid_repo_id_returns_422_whitespace(client):
+    """repo_id containing whitespace should be rejected with 422."""
+    r = client.get("/api/models/lookup", params={"repo_id": "org/model name"})
+    assert r.status_code == 422
+
+
+def test_lookup_hf_404_returns_404(client):
+    """HF API returning 404 should surface as HTTP 404."""
+    mock_resp = MagicMock()
+    mock_resp.status_code = 404
+
+    with patch("app.models.httpx.get", return_value=mock_resp):
+        r = client.get("/api/models/lookup", params={"repo_id": "org/nonexistent"})
+
+    assert r.status_code == 404
+
+
+def test_lookup_hf_network_error_returns_502(client):
+    """Network error reaching HF API should return 502."""
+    import httpx as _httpx
+
+    with patch("app.models.httpx.get", side_effect=_httpx.RequestError("timeout")):
+        r = client.get("/api/models/lookup", params={"repo_id": "org/model"})
+
+    assert r.status_code == 502
+
+
+def test_lookup_returns_correct_shape(client):
+    """Successful lookup returns all required fields."""
+    mock_resp = MagicMock()
+    mock_resp.status_code = 200
+    mock_resp.json.return_value = _make_hf_response("org/mymodel", "text-classification")
+
+    with patch("app.models.httpx.get", return_value=mock_resp):
+        r = client.get("/api/models/lookup", params={"repo_id": "org/mymodel"})
+
+    assert r.status_code == 200
+    data = r.json()
+    assert data["repo_id"] == "org/mymodel"
+    assert data["pipeline_tag"] == "text-classification"
+    assert data["adapter_recommendation"] == "ZeroShotAdapter"
+    assert data["model_size_bytes"] == 500_000_000
+    assert data["downloads"] == 42000
+    assert data["already_installed"] is False
+    assert data["already_queued"] is False
+
+
+def test_lookup_unknown_pipeline_tag_returns_null_adapter(client):
+    """An unrecognised pipeline_tag yields adapter_recommendation=null."""
+    mock_resp = MagicMock()
+    mock_resp.status_code = 200
+    mock_resp.json.return_value = _make_hf_response("org/m", "audio-classification")
+
+    with patch("app.models.httpx.get", return_value=mock_resp):
+        r = client.get("/api/models/lookup", params={"repo_id": "org/m"})
+
+    assert r.status_code == 200
+    assert r.json()["adapter_recommendation"] is None
+
+
+def test_lookup_already_queued_flag(client):
+    """already_queued is True when repo_id is in the pending queue."""
+    _queue_one(client, "org/queued-model")
+
+    mock_resp = MagicMock()
+    mock_resp.status_code = 200
+    mock_resp.json.return_value = _make_hf_response("org/queued-model")
+
+    with patch("app.models.httpx.get", return_value=mock_resp):
+        r = client.get("/api/models/lookup", params={"repo_id": "org/queued-model"})
+
+    assert r.status_code == 200
+    assert r.json()["already_queued"] is True
+
+
+# ── GET /queue ─────────────────────────────────────────────────────────────────
+
+def test_queue_empty_initially(client):
+    r = client.get("/api/models/queue")
+    assert r.status_code == 200
+    assert r.json() == []
+
+
+def test_queue_add_and_list(client):
+    """POST then GET /queue should return the entry."""
+    entry = _queue_one(client, "org/my-model")
+
+    r = client.get("/api/models/queue")
+    assert r.status_code == 200
+    items = r.json()
+    assert len(items) == 1
+    assert items[0]["repo_id"] == "org/my-model"
+    assert items[0]["status"] == "pending"
+    assert items[0]["id"] == entry["id"]
+
+
+def test_queue_add_returns_entry_fields(client):
+    """POST /queue returns an entry with all expected fields."""
+    entry = _queue_one(client)
+    assert "id" in entry
+    assert "queued_at" in entry
+    assert entry["status"] == "pending"
+    assert entry["pipeline_tag"] == "text-classification"
+    assert entry["adapter_recommendation"] == "ZeroShotAdapter"
+
+
+# ── POST /queue — 409 duplicate ────────────────────────────────────────────────
+
+def test_queue_duplicate_returns_409(client):
+    """Posting the same repo_id twice should return 409."""
+    _queue_one(client, "org/dup-model")
+
+    r = client.post("/api/models/queue", json={
+        "repo_id": "org/dup-model",
+        "pipeline_tag": "text-classification",
+        "adapter_recommendation": "ZeroShotAdapter",
+    })
+    assert r.status_code == 409
+
+
+def test_queue_multiple_different_models(client):
+    """Multiple distinct repo_ids should all be accepted."""
+    _queue_one(client, "org/model-a")
+    _queue_one(client, "org/model-b")
+    _queue_one(client, "org/model-c")
+
+    r = client.get("/api/models/queue")
+    assert r.status_code == 200
+    assert len(r.json()) == 3
+
+
+# ── DELETE /queue/{id} — dismiss ──────────────────────────────────────────────
+
+def test_queue_dismiss(client):
+    """DELETE /queue/{id} sets status=dismissed; entry not returned by GET /queue."""
+    entry = _queue_one(client)
+    entry_id = entry["id"]
+
+    r = client.delete(f"/api/models/queue/{entry_id}")
+    assert r.status_code == 200
+    assert r.json() == {"ok": True}
+
+    r2 = client.get("/api/models/queue")
+    assert r2.status_code == 200
+    assert r2.json() == []
+
+
+def test_queue_dismiss_nonexistent_returns_404(client):
+    """DELETE /queue/{id} with unknown id returns 404."""
+    r = client.delete("/api/models/queue/does-not-exist")
+    assert r.status_code == 404
+
+
+def test_queue_dismiss_allows_re_queue(client):
+    """After dismissal the same repo_id can be queued again."""
+    entry = _queue_one(client, "org/requeue-model")
+    client.delete(f"/api/models/queue/{entry['id']}")
+
+    r = client.post("/api/models/queue", json={
+        "repo_id": "org/requeue-model",
+        "pipeline_tag": None,
+        "adapter_recommendation": None,
+    })
+    assert r.status_code == 201
+
+
+# ── POST /queue/{id}/approve ───────────────────────────────────────────────────
+
+def test_approve_nonexistent_returns_404(client):
+    """Approving an unknown id returns 404."""
+    r = client.post("/api/models/queue/ghost-id/approve")
+    assert r.status_code == 404
+
+
+def test_approve_non_pending_returns_409(client):
+    """Approving an entry that is not in 'pending' state returns 409."""
+    from app import models as models_module
+
+    entry = _queue_one(client)
+    # Manually flip status to 'failed'
+    models_module._update_queue_entry(entry["id"], {"status": "failed"})
+
+    r = client.post(f"/api/models/queue/{entry['id']}/approve")
+    assert r.status_code == 409
+
+
+def test_approve_starts_download_and_returns_ok(client):
+    """Approving a pending entry returns {ok: true} and starts a background thread."""
+    import time
+    import threading
+
+    entry = _queue_one(client)
+
+    # Patch snapshot_download so the thread doesn't actually hit the network.
+    # Use an Event so we can wait for the thread to finish before asserting.
+    thread_done = threading.Event()
+    original_run = None
+
+    def _fake_snapshot_download(**kwargs):
+        pass
+
+    with patch("app.models.snapshot_download", side_effect=_fake_snapshot_download):
+        r = client.post(f"/api/models/queue/{entry['id']}/approve")
+        assert r.status_code == 200
+        assert r.json() == {"ok": True}
+        # Give the background thread a moment to complete while snapshot_download is patched
+        time.sleep(0.3)
+
+        # Queue entry status should have moved to 'downloading' (or 'ready' if fast)
+        from app import models as models_module
+        updated = models_module._get_queue_entry(entry["id"])
+        assert updated is not None, "Queue entry not found — thread may have run after fixture teardown"
+        assert updated["status"] in ("downloading", "ready", "failed")
+
+
+# ── GET /download/stream ───────────────────────────────────────────────────────
+
+def test_download_stream_idle_when_no_download(client):
+    """GET /download/stream returns a single idle event when nothing is downloading."""
+    r = client.get("/api/models/download/stream")
+    assert r.status_code == 200
+    # SSE body should contain the idle event
+    assert "idle" in r.text
+
+
+# ── GET /installed ─────────────────────────────────────────────────────────────
+
+def test_installed_empty(client):
+    """GET /installed returns [] when models dir is empty."""
+    r = client.get("/api/models/installed")
+    assert r.status_code == 200
+    assert r.json() == []
+
+
+def test_installed_detects_downloaded_model(client, tmp_path):
+    """A subdir with config.json is surfaced as type='downloaded'."""
+    from app import models as models_module
+
+    model_dir = models_module._MODELS_DIR / "org--mymodel"
+    model_dir.mkdir()
+    (model_dir / "config.json").write_text(json.dumps({"model_type": "bert"}), encoding="utf-8")
+    (model_dir / "model_info.json").write_text(
+        json.dumps({"repo_id": "org/mymodel", "adapter_recommendation": "ZeroShotAdapter"}),
+        encoding="utf-8",
+    )
+
+    r = client.get("/api/models/installed")
+    assert r.status_code == 200
+    items = r.json()
+    assert len(items) == 1
+    assert items[0]["type"] == "downloaded"
+    assert items[0]["name"] == "org--mymodel"
+    assert items[0]["adapter"] == "ZeroShotAdapter"
+    assert items[0]["model_id"] == "org/mymodel"
+
+
+def test_installed_detects_finetuned_model(client):
+    """A subdir with training_info.json is surfaced as type='finetuned'."""
+    from app import models as models_module
+
+    model_dir = models_module._MODELS_DIR / "my-finetuned"
+    model_dir.mkdir()
+    (model_dir / "training_info.json").write_text(
+        json.dumps({"base_model": "org/base", "epochs": 5}), encoding="utf-8"
+    )
+
+    r = client.get("/api/models/installed")
+    assert r.status_code == 200
+    items = r.json()
+    assert len(items) == 1
+    assert items[0]["type"] == "finetuned"
+    assert items[0]["name"] == "my-finetuned"
+
+
+# ── DELETE /installed/{name} ───────────────────────────────────────────────────
+
+def test_delete_installed_removes_directory(client):
+    """DELETE /installed/{name} removes the directory and returns {ok: true}."""
+    from app import models as models_module
+
+    model_dir = models_module._MODELS_DIR / "org--removeme"
+    model_dir.mkdir()
+    (model_dir / "config.json").write_text("{}", encoding="utf-8")
+
+    r = client.delete("/api/models/installed/org--removeme")
+    assert r.status_code == 200
+    assert r.json() == {"ok": True}
+    assert not model_dir.exists()
+
+
+def test_delete_installed_not_found_returns_404(client):
+    r = client.delete("/api/models/installed/does-not-exist")
+    assert r.status_code == 404
+
+
+def test_delete_installed_path_traversal_blocked(client):
+    """DELETE /installed/../../etc must be blocked.
+    Path traversal normalises to a different URL (/api/etc); if web/dist exists
+    the StaticFiles mount intercepts it and returns 405 (GET/HEAD only).
+    """
+    r = client.delete("/api/models/installed/../../etc")
+    assert r.status_code in (400, 404, 405, 422)
+
+
+def test_delete_installed_dotdot_name_blocked(client):
+    """A name containing '..' in any form must be rejected."""
+    r = client.delete("/api/models/installed/..%2F..%2Fetc")
+    assert r.status_code in (400, 404, 405, 422)
+
+
+def test_delete_installed_name_with_slash_blocked(client):
+    """A name containing a literal '/' after URL decoding must be rejected."""
+    from app import models as models_module
+
+    # The router will see the path segment after /installed/ — a second '/' would
+    # be parsed as a new path segment, so we test via the validation helper directly.
+    with pytest.raises(Exception):
+        # Simulate calling delete logic with a slash-containing name directly
+        from fastapi import HTTPException as _HTTPException
+        from app.models import delete_installed
+        try:
+            delete_installed("org/traversal")
+        except _HTTPException as exc:
+            assert exc.status_code in (400, 404)
+            raise
--- a/tests/test_sft.py
+++ b/tests/test_sft.py
@ -8,13 +8,16 @@ from pathlib import Path
@pytest.fixture(autouse=True)
 def reset_sft_globals(tmp_path):
    from app import sft as sft_module
-    _prev_data = sft_module._SFT_DATA_DIR
-    _prev_cfg = sft_module._SFT_CONFIG_DIR
+    _prev_data    = sft_module._SFT_DATA_DIR
+    _prev_cfg     = sft_module._SFT_CONFIG_DIR
+    _prev_default = sft_module._DEFAULT_BENCH_RESULTS_DIR
    sft_module.set_sft_data_dir(tmp_path)
    sft_module.set_sft_config_dir(tmp_path)
+    sft_module.set_default_bench_results_dir(str(tmp_path / "bench_results"))
    yield
    sft_module.set_sft_data_dir(_prev_data)
    sft_module.set_sft_config_dir(_prev_cfg)
+    sft_module.set_default_bench_results_dir(_prev_default)


@pytest.fixture
@ -232,6 +235,41 @@ def test_submit_already_approved_returns_409(client, tmp_path):
    assert r.status_code == 409


+def test_submit_correct_stores_failure_category(client, tmp_path):
+    _populate_candidates(tmp_path, [_make_record("a")])
+    r = client.post("/api/sft/submit", json={
+        "id": "a", "action": "correct",
+        "corrected_response": "def add(a, b): return a + b",
+        "failure_category": "style_violation",
+    })
+    assert r.status_code == 200
+    from app import sft as sft_module
+    records = sft_module._read_candidates()
+    assert records[0]["failure_category"] == "style_violation"
+
+
+def test_submit_correct_null_failure_category(client, tmp_path):
+    _populate_candidates(tmp_path, [_make_record("a")])
+    r = client.post("/api/sft/submit", json={
+        "id": "a", "action": "correct",
+        "corrected_response": "def add(a, b): return a + b",
+    })
+    assert r.status_code == 200
+    from app import sft as sft_module
+    records = sft_module._read_candidates()
+    assert records[0]["failure_category"] is None
+
+
+def test_submit_invalid_failure_category_returns_422(client, tmp_path):
+    _populate_candidates(tmp_path, [_make_record("a")])
+    r = client.post("/api/sft/submit", json={
+        "id": "a", "action": "correct",
+        "corrected_response": "def add(a, b): return a + b",
+        "failure_category": "nonsense",
+    })
+    assert r.status_code == 422
+
+
 # ── /api/sft/undo ────────────────────────────────────────────────────────────

 def test_undo_restores_discarded_to_needs_review(client, tmp_path):
--- a/web/src/components/AppSidebar.vue
+++ b/web/src/components/AppSidebar.vue
@ -66,6 +66,8 @@ const navItems = [
  { path: '/fetch',     icon: '📥', label: 'Fetch'     },
  { path: '/stats',     icon: '📊', label: 'Stats'     },
  { path: '/benchmark',   icon: '🏁', label: 'Benchmark'   },
+  { path: '/models',      icon: '🤗', label: 'Models'      },
+  { path: '/imitate',     icon: '🪞', label: 'Imitate'     },
  { path: '/corrections', icon: '✍️', label: 'Corrections' },
  { path: '/settings',    icon: '⚙️', label: 'Settings'    },
 ]
--- a/web/src/components/SftCard.test.ts
+++ b/web/src/components/SftCard.test.ts
@ -13,6 +13,7 @@ const LOW_QUALITY_ITEM: SftQueueItem = {
  model_response: 'def add(a, b): return a - b',
  corrected_response: null, quality_score: 0.2,
  failure_reason: 'pattern_match: 0/2 matched',
+  failure_category: null,
  task_id: 'code-fn', task_type: 'code', task_name: 'Code: Write a function',
  model_id: 'Qwen/Qwen2.5-3B', model_name: 'Qwen2.5-3B',
  node_id: 'heimdall', gpu_id: 0, tokens_per_sec: 38.4,
@ -68,15 +69,17 @@ describe('SftCard', () => {
    expect(w.emitted('correct')).toBeTruthy()
  })

-  it('clicking Discard button emits discard', async () => {
+  it('clicking Discard button then confirming emits discard', async () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    await w.find('[data-testid="discard-btn"]').trigger('click')
+    await w.find('[data-testid="confirm-pending-btn"]').trigger('click')
    expect(w.emitted('discard')).toBeTruthy()
  })

-  it('clicking Flag Model button emits flag', async () => {
+  it('clicking Flag Model button then confirming emits flag', async () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    await w.find('[data-testid="flag-btn"]').trigger('click')
+    await w.find('[data-testid="confirm-pending-btn"]').trigger('click')
    expect(w.emitted('flag')).toBeTruthy()
  })

@ -95,4 +98,82 @@ describe('SftCard', () => {
    const w = mount(SftCard, { props: { item } })
    expect(w.find('.failure-reason').exists()).toBe(false)
  })
+
+  // ── Failure category chip-group ───────────────────────────────────
+  it('failure category section hidden when not correcting and no pending action', () => {
+    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
+    expect(w.find('[data-testid="failure-category-section"]').exists()).toBe(false)
+  })
+
+  it('failure category section shown when correcting prop is true', () => {
+    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM, correcting: true } })
+    expect(w.find('[data-testid="failure-category-section"]').exists()).toBe(true)
+  })
+
+  it('renders all six category chips when correcting', () => {
+    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM, correcting: true } })
+    const chips = w.findAll('.category-chip')
+    expect(chips).toHaveLength(6)
+  })
+
+  it('clicking a category chip selects it (adds active class)', async () => {
+    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM, correcting: true } })
+    const chip = w.find('[data-testid="category-chip-wrong_answer"]')
+    await chip.trigger('click')
+    expect(chip.classes()).toContain('category-chip--active')
+  })
+
+  it('clicking the active chip again deselects it', async () => {
+    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM, correcting: true } })
+    const chip = w.find('[data-testid="category-chip-hallucination"]')
+    await chip.trigger('click')
+    expect(chip.classes()).toContain('category-chip--active')
+    await chip.trigger('click')
+    expect(chip.classes()).not.toContain('category-chip--active')
+  })
+
+  it('only one chip can be active at a time', async () => {
+    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM, correcting: true } })
+    await w.find('[data-testid="category-chip-wrong_answer"]').trigger('click')
+    await w.find('[data-testid="category-chip-hallucination"]').trigger('click')
+    const active = w.findAll('.category-chip--active')
+    expect(active).toHaveLength(1)
+    expect(active[0].attributes('data-testid')).toBe('category-chip-hallucination')
+  })
+
+  it('clicking Discard shows pending action row with category section', async () => {
+    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
+    await w.find('[data-testid="discard-btn"]').trigger('click')
+    expect(w.find('[data-testid="failure-category-section"]').exists()).toBe(true)
+    expect(w.find('[data-testid="pending-action-row"]').exists()).toBe(true)
+  })
+
+  it('clicking Flag shows pending action row', async () => {
+    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
+    await w.find('[data-testid="flag-btn"]').trigger('click')
+    expect(w.find('[data-testid="pending-action-row"]').exists()).toBe(true)
+  })
+
+  it('confirming discard emits discard with null when no category selected', async () => {
+    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
+    await w.find('[data-testid="discard-btn"]').trigger('click')
+    await w.find('[data-testid="confirm-pending-btn"]').trigger('click')
+    expect(w.emitted('discard')).toBeTruthy()
+    expect(w.emitted('discard')![0]).toEqual([null])
+  })
+
+  it('confirming discard emits discard with selected category', async () => {
+    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
+    await w.find('[data-testid="discard-btn"]').trigger('click')
+    await w.find('[data-testid="category-chip-scoring_artifact"]').trigger('click')
+    await w.find('[data-testid="confirm-pending-btn"]').trigger('click')
+    expect(w.emitted('discard')![0]).toEqual(['scoring_artifact'])
+  })
+
+  it('cancelling pending action hides the pending row', async () => {
+    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
+    await w.find('[data-testid="discard-btn"]').trigger('click')
+    await w.find('[data-testid="cancel-pending-btn"]').trigger('click')
+    expect(w.find('[data-testid="pending-action-row"]').exists()).toBe(false)
+  })
 })
--- a/web/src/components/SftCard.vue
+++ b/web/src/components/SftCard.vue
@ -57,21 +57,52 @@
      <button
        data-testid="discard-btn"
        class="btn-discard"
-        @click="$emit('discard')"
+        @click="emitWithCategory('discard')"
      >✕ Discard</button>
      <button
        data-testid="flag-btn"
        class="btn-flag"
-        @click="$emit('flag')"
+        @click="emitWithCategory('flag')"
      >⚑ Flag Model</button>
    </div>

+    <!-- Failure category selector (shown when correcting or acting) -->
+    <div
+      v-if="correcting || pendingAction"
+      class="failure-category-section"
+      data-testid="failure-category-section"
+    >
+      <p class="section-label">Failure category <span class="optional-label">(optional)</span></p>
+      <div class="category-chips" role="group" aria-label="Failure category">
+        <button
+          v-for="cat in FAILURE_CATEGORIES"
+          :key="cat.value"
+          type="button"
+          class="category-chip"
+          :class="{ 'category-chip--active': selectedCategory === cat.value }"
+          :aria-pressed="selectedCategory === cat.value || undefined"
+          :data-testid="'category-chip-' + cat.value"
+          @click="toggleCategory(cat.value)"
+        >{{ cat.label }}</button>
+      </div>
+
+      <!-- Pending discard/flag confirm row -->
+      <div v-if="pendingAction" class="pending-action-row" data-testid="pending-action-row">
+        <button class="btn-confirm" @click="confirmPendingAction" data-testid="confirm-pending-btn">
+          Confirm {{ pendingAction }}
+        </button>
+        <button class="btn-cancel-pending" @click="cancelPendingAction" data-testid="cancel-pending-btn">
+          Cancel
+        </button>
+      </div>
+    </div>
+
    <!-- Correction area (shown when correcting = true) -->
    <div v-if="correcting" data-testid="correction-area">
      <SftCorrectionArea
        ref="correctionAreaEl"
        :described-by="'sft-failure-' + item.id"
-        @submit="$emit('submit-correction', $event)"
+        @submit="handleSubmitCorrection"
        @cancel="$emit('cancel-correction')"
      />
    </div>
@ -80,21 +111,32 @@

 <script setup lang="ts">
 import { ref, computed } from 'vue'
-import type { SftQueueItem } from '../stores/sft'
+import type { SftQueueItem, SftFailureCategory } from '../stores/sft'
 import SftCorrectionArea from './SftCorrectionArea.vue'

 const props = defineProps<{ item: SftQueueItem; correcting?: boolean }>()

 const emit = defineEmits<{
  correct: []
-  discard: []
-  flag: []
-  'submit-correction': [text: string]
+  discard: [category: SftFailureCategory | null]
+  flag: [category: SftFailureCategory | null]
+  'submit-correction': [text: string, category: SftFailureCategory | null]
  'cancel-correction': []
 }>()

+const FAILURE_CATEGORIES: { value: SftFailureCategory; label: string }[] = [
+  { value: 'scoring_artifact', label: 'Scoring artifact' },
+  { value: 'style_violation',  label: 'Style violation'  },
+  { value: 'partial_answer',   label: 'Partial answer'   },
+  { value: 'wrong_answer',     label: 'Wrong answer'     },
+  { value: 'format_error',     label: 'Format error'     },
+  { value: 'hallucination',    label: 'Hallucination'    },
+]
+
 const promptExpanded   = ref(false)
 const correctionAreaEl = ref<InstanceType<typeof SftCorrectionArea> | null>(null)
+const selectedCategory = ref<SftFailureCategory | null>(null)
+const pendingAction    = ref<'discard' | 'flag' | null>(null)

 const qualityClass = computed(() => {
  const s = props.item.quality_score
@ -110,8 +152,34 @@ const qualityLabel = computed(() => {
  return 'acceptable'
 })

+function toggleCategory(cat: SftFailureCategory) {
+  selectedCategory.value = selectedCategory.value === cat ? null : cat
+}
+
+function emitWithCategory(action: 'discard' | 'flag') {
+  pendingAction.value = action
+}
+
+function confirmPendingAction() {
+  if (!pendingAction.value) return
+  emit(pendingAction.value, selectedCategory.value)
+  pendingAction.value = null
+  selectedCategory.value = null
+}
+
+function cancelPendingAction() {
+  pendingAction.value = null
+}
+
+function handleSubmitCorrection(text: string) {
+  emit('submit-correction', text, selectedCategory.value)
+  selectedCategory.value = null
+}
+
 function resetCorrection() {
  correctionAreaEl.value?.reset()
+  selectedCategory.value = null
+  pendingAction.value = null
 }

 defineExpose({ resetCorrection })
@ -243,4 +311,83 @@ defineExpose({ resetCorrection })

 .btn-flag { border-color: var(--color-warning); color: var(--color-warning); }
 .btn-flag:hover { background: color-mix(in srgb, var(--color-warning) 10%, transparent); }
+
+/* ── Failure category selector ─────────────────── */
+.failure-category-section {
+  display: flex;
+  flex-direction: column;
+  gap: var(--space-2);
+}
+
+.optional-label {
+  font-size: 0.75rem;
+  font-weight: 400;
+  color: var(--color-text-muted);
+}
+
+.category-chips {
+  display: flex;
+  flex-wrap: wrap;
+  gap: var(--space-2);
+}
+
+.category-chip {
+  padding: var(--space-1) var(--space-3);
+  border-radius: var(--radius-full);
+  border: 1px solid var(--color-border);
+  background: var(--color-surface-alt);
+  color: var(--color-text-muted);
+  font-size: 0.78rem;
+  font-weight: 500;
+  cursor: pointer;
+  transition: background var(--transition), color var(--transition), border-color var(--transition);
+}
+
+.category-chip:hover {
+  border-color: var(--color-accent);
+  color: var(--color-accent);
+  background: var(--color-accent-light);
+}
+
+.category-chip--active {
+  background: var(--color-accent-light);
+  border-color: var(--color-accent);
+  color: var(--color-accent);
+  font-weight: 700;
+}
+
+.pending-action-row {
+  display: flex;
+  gap: var(--space-2);
+  margin-top: var(--space-1);
+}
+
+.btn-confirm {
+  padding: var(--space-1) var(--space-3);
+  border-radius: var(--radius-md);
+  border: 1px solid var(--color-accent);
+  background: var(--color-accent-light);
+  color: var(--color-accent);
+  font-size: 0.85rem;
+  font-weight: 600;
+  cursor: pointer;
+}
+
+.btn-confirm:hover {
+  background: color-mix(in srgb, var(--color-accent) 15%, transparent);
+}
+
+.btn-cancel-pending {
+  padding: var(--space-1) var(--space-3);
+  border-radius: var(--radius-md);
+  border: 1px solid var(--color-border);
+  background: none;
+  color: var(--color-text-muted);
+  font-size: 0.85rem;
+  cursor: pointer;
+}
+
+.btn-cancel-pending:hover {
+  background: var(--color-surface-alt);
+}
 </style>
--- a/web/src/router/index.ts
+++ b/web/src/router/index.ts
@ -7,6 +7,8 @@ const StatsView     = () => import('../views/StatsView.vue')
 const BenchmarkView = () => import('../views/BenchmarkView.vue')
 const SettingsView    = () => import('../views/SettingsView.vue')
 const CorrectionsView = () => import('../views/CorrectionsView.vue')
+const ModelsView      = () => import('../views/ModelsView.vue')
+const ImitateView     = () => import('../views/ImitateView.vue')

 export const router = createRouter({
  history: createWebHashHistory(),
@ -15,6 +17,8 @@ export const router = createRouter({
    { path: '/fetch',       component: FetchView,       meta: { title: 'Fetch' } },
    { path: '/stats',       component: StatsView,       meta: { title: 'Stats' } },
    { path: '/benchmark',   component: BenchmarkView,   meta: { title: 'Benchmark' } },
+    { path: '/models',      component: ModelsView,      meta: { title: 'Models' } },
+    { path: '/imitate',     component: ImitateView,     meta: { title: 'Imitate' }     },
    { path: '/corrections', component: CorrectionsView, meta: { title: 'Corrections' } },
    { path: '/settings',    component: SettingsView,    meta: { title: 'Settings' } },
  ],
--- a/web/src/stores/sft.ts
+++ b/web/src/stores/sft.ts
@ -2,6 +2,14 @@
 import { defineStore } from 'pinia'
 import { computed, ref } from 'vue'

+export type SftFailureCategory =
+  | 'scoring_artifact'
+  | 'style_violation'
+  | 'partial_answer'
+  | 'wrong_answer'
+  | 'format_error'
+  | 'hallucination'
+
 export interface SftQueueItem {
  id: string
  source: 'cf-orch-benchmark'
@ -13,6 +21,7 @@ export interface SftQueueItem {
  corrected_response: string | null
  quality_score: number          // 0.0 to 1.0
  failure_reason: string | null
+  failure_category: SftFailureCategory | null
  task_id: string
  task_type: string
  task_name: string
@ -26,6 +35,7 @@ export interface SftQueueItem {
 export interface SftLastAction {
  type: 'correct' | 'discard' | 'flag'
  item: SftQueueItem
+  failure_category?: SftFailureCategory | null
 }

 export const useSftStore = defineStore('sft', () => {
@ -39,8 +49,12 @@ export const useSftStore = defineStore('sft', () => {
    queue.value.shift()
  }

-  function setLastAction(type: SftLastAction['type'], item: SftQueueItem) {
-    lastAction.value = { type, item }
+  function setLastAction(
+    type: SftLastAction['type'],
+    item: SftQueueItem,
+    failure_category?: SftFailureCategory | null,
+  ) {
+    lastAction.value = { type, item, failure_category }
  }

  function clearLastAction() {
--- a/web/src/views/BenchmarkView.vue
+++ b/web/src/views/BenchmarkView.vue
--- a/web/src/views/CorrectionsView.vue
+++ b/web/src/views/CorrectionsView.vue
@ -36,6 +36,7 @@
          @flag="handleFlag"
          @submit-correction="handleCorrect"
          @cancel-correction="correcting = false"
+          ref="sftCardEl"
        />
      </div>
    </template>
@ -67,6 +68,7 @@
 <script setup lang="ts">
 import { ref, onMounted } from 'vue'
 import { useSftStore } from '../stores/sft'
+import type { SftFailureCategory } from '../stores/sft'
 import { useSftKeyboard } from '../composables/useSftKeyboard'
 import SftCard from '../components/SftCard.vue'

@ -76,6 +78,7 @@ const apiError   = ref(false)
 const correcting = ref(false)
 const stats      = ref<Record<string, any> | null>(null)
 const exportUrl  = '/api/sft/export'
+const sftCardEl  = ref<InstanceType<typeof SftCard> | null>(null)

 useSftKeyboard({
  onCorrect: () => { if (store.current && !correcting.value) correcting.value = true },
@ -113,19 +116,21 @@ function startCorrection() {
  correcting.value = true
 }

-async function handleCorrect(text: string) {
+async function handleCorrect(text: string, category: SftFailureCategory | null = null) {
  if (!store.current) return
  const item = store.current
  correcting.value = false
  try {
+    const body: Record<string, unknown> = { id: item.id, action: 'correct', corrected_response: text }
+    if (category != null) body.failure_category = category
    const res = await fetch('/api/sft/submit', {
      method:  'POST',
      headers: { 'Content-Type': 'application/json' },
-      body:    JSON.stringify({ id: item.id, action: 'correct', corrected_response: text }),
+      body:    JSON.stringify(body),
    })
    if (!res.ok) throw new Error(`HTTP ${res.status}`)
    store.removeCurrentFromQueue()
-    store.setLastAction('correct', item)
+    store.setLastAction('correct', item, category)
    store.totalRemaining = Math.max(0, store.totalRemaining - 1)
    fetchStats()
    if (store.queue.length < 5) fetchBatch()
@ -134,18 +139,20 @@ async function handleCorrect(text: string) {
  }
 }

-async function handleDiscard() {
+async function handleDiscard(category: SftFailureCategory | null = null) {
  if (!store.current) return
  const item = store.current
  try {
+    const body: Record<string, unknown> = { id: item.id, action: 'discard' }
+    if (category != null) body.failure_category = category
    const res = await fetch('/api/sft/submit', {
      method:  'POST',
      headers: { 'Content-Type': 'application/json' },
-      body:    JSON.stringify({ id: item.id, action: 'discard' }),
+      body:    JSON.stringify(body),
    })
    if (!res.ok) throw new Error(`HTTP ${res.status}`)
    store.removeCurrentFromQueue()
-    store.setLastAction('discard', item)
+    store.setLastAction('discard', item, category)
    store.totalRemaining = Math.max(0, store.totalRemaining - 1)
    fetchStats()
    if (store.queue.length < 5) fetchBatch()
@ -154,18 +161,20 @@ async function handleDiscard() {
  }
 }

-async function handleFlag() {
+async function handleFlag(category: SftFailureCategory | null = null) {
  if (!store.current) return
  const item = store.current
  try {
+    const body: Record<string, unknown> = { id: item.id, action: 'flag' }
+    if (category != null) body.failure_category = category
    const res = await fetch('/api/sft/submit', {
      method:  'POST',
      headers: { 'Content-Type': 'application/json' },
-      body:    JSON.stringify({ id: item.id, action: 'flag' }),
+      body:    JSON.stringify(body),
    })
    if (!res.ok) throw new Error(`HTTP ${res.status}`)
    store.removeCurrentFromQueue()
-    store.setLastAction('flag', item)
+    store.setLastAction('flag', item, category)
    store.totalRemaining = Math.max(0, store.totalRemaining - 1)
    fetchStats()
    if (store.queue.length < 5) fetchBatch()
--- a/web/src/views/ImitateView.vue
+++ b/web/src/views/ImitateView.vue
@ -0,0 +1,898 @@
+<template>
+  <div class="imitate-view">
+    <header class="bench-header">
+      <h1 class="page-title">🪞 Imitate</h1>
+      <p class="page-subtitle">Pull real samples from CF product APIs and compare LLM responses</p>
+    </header>
+
+    <!-- ── Step 1: Product selection ──────────────────────────────── -->
+    <section class="step-section">
+      <h2 class="step-heading">1. Select Product</h2>
+      <div v-if="productsLoading" class="picker-loading">Loading products…</div>
+      <div v-else-if="products.length === 0" class="picker-empty">
+        No products configured — add an <code>imitate:</code> section to
+        <code>config/label_tool.yaml</code>.
+      </div>
+      <div v-else class="product-grid">
+        <button
+          v-for="p in products"
+          :key="p.id"
+          class="product-card"
+          :class="{
+            selected: selectedProduct?.id === p.id,
+            offline: !p.online,
+          }"
+          :disabled="!p.online"
+          :title="p.online ? p.description : `${p.name} is offline`"
+          @click="selectProduct(p)"
+        >
+          <span class="product-icon">{{ p.icon }}</span>
+          <span class="product-name">{{ p.name }}</span>
+          <span class="product-status" :class="p.online ? 'status-on' : 'status-off'">
+            {{ p.online ? 'online' : 'offline' }}
+          </span>
+        </button>
+      </div>
+    </section>
+
+    <!-- ── Step 2: Sample + Prompt ────────────────────────────────── -->
+    <section v-if="selectedProduct" class="step-section">
+      <h2 class="step-heading">2. Sample &amp; Prompt</h2>
+      <div class="sample-toolbar">
+        <span class="sample-product-label">{{ selectedProduct.icon }} {{ selectedProduct.name }}</span>
+        <button class="btn-refresh" :disabled="sampleLoading" @click="fetchSample">
+          {{ sampleLoading ? '⏳ Fetching…' : '🔄 Refresh Sample' }}
+        </button>
+        <span v-if="sampleError" class="sample-error">{{ sampleError }}</span>
+      </div>
+
+      <div v-if="sampleLoading" class="picker-loading">Fetching sample from API…</div>
+
+      <template v-else-if="rawSample">
+        <!-- Fetched text preview -->
+        <details class="sample-preview" open>
+          <summary class="sample-preview-toggle">Raw sample text</summary>
+          <pre class="sample-text">{{ rawSample.text }}</pre>
+        </details>
+
+        <!-- Prompt editor -->
+        <label class="prompt-label" for="prompt-editor">Prompt sent to models</label>
+        <textarea
+          id="prompt-editor"
+          class="prompt-editor"
+          v-model="editedPrompt"
+          rows="8"
+        />
+      </template>
+
+      <div v-else-if="!sampleLoading && selectedProduct" class="picker-empty">
+        Click "Refresh Sample" to fetch a real sample from {{ selectedProduct.name }}.
+      </div>
+    </section>
+
+    <!-- ── Step 3: Models + Run ───────────────────────────────────── -->
+    <section v-if="editedPrompt" class="step-section">
+      <h2 class="step-heading">3. Models &amp; Run</h2>
+
+      <!-- Ollama model picker -->
+      <details class="model-picker" open>
+        <summary class="picker-summary">
+          <span class="picker-title">🤖 Ollama Models</span>
+          <span class="picker-badge">{{ selectedModels.size }} / {{ ollamaModels.length }}</span>
+        </summary>
+        <div class="picker-body">
+          <div v-if="modelsLoading" class="picker-loading">Loading models…</div>
+          <div v-else-if="ollamaModels.length === 0" class="picker-empty">
+            No ollama models in bench_models.yaml — add models with <code>service: ollama</code>.
+          </div>
+          <template v-else>
+            <label class="picker-cat-header">
+              <input
+                type="checkbox"
+                :checked="selectedModels.size === ollamaModels.length"
+                :indeterminate="selectedModels.size > 0 && selectedModels.size < ollamaModels.length"
+                @change="toggleAllModels(($event.target as HTMLInputElement).checked)"
+              />
+              <span class="picker-cat-name">All ollama models</span>
+            </label>
+            <div class="picker-model-list">
+              <label v-for="m in ollamaModels" :key="m.id" class="picker-model-row">
+                <input
+                  type="checkbox"
+                  :checked="selectedModels.has(m.id)"
+                  @change="toggleModel(m.id, ($event.target as HTMLInputElement).checked)"
+                />
+                <span class="picker-model-name" :title="m.name">{{ m.name }}</span>
+                <span class="picker-model-tags">
+                  <span v-for="tag in m.tags.slice(0, 3)" :key="tag" class="tag">{{ tag }}</span>
+                </span>
+              </label>
+            </div>
+          </template>
+        </div>
+      </details>
+
+      <!-- Temperature -->
+      <div class="temp-row">
+        <label for="temp-slider" class="temp-label">Temperature: <strong>{{ temperature.toFixed(1) }}</strong></label>
+        <input
+          id="temp-slider"
+          type="range" min="0" max="1" step="0.1"
+          :value="temperature"
+          @input="temperature = parseFloat(($event.target as HTMLInputElement).value)"
+          class="temp-slider"
+        />
+      </div>
+
+      <!-- Run controls -->
+      <div class="run-row">
+        <button
+          class="btn-run"
+          :disabled="running || selectedModels.size === 0"
+          @click="startRun"
+        >
+          {{ running ? '⏳ Running…' : '▶ Run' }}
+        </button>
+        <button v-if="running" class="btn-cancel" @click="cancelRun">✕ Cancel</button>
+      </div>
+
+      <!-- Progress log -->
+      <div v-if="runLog.length > 0" class="run-log" aria-live="polite">
+        <div v-for="(line, i) in runLog" :key="i" class="log-line">{{ line }}</div>
+      </div>
+    </section>
+
+    <!-- ── Step 4: Results ────────────────────────────────────────── -->
+    <section v-if="results.length > 0" class="step-section">
+      <h2 class="step-heading">4. Results</h2>
+
+      <div class="results-grid">
+        <div
+          v-for="r in results"
+          :key="r.model"
+          class="result-card"
+          :class="{ 'result-error': !!r.error }"
+        >
+          <div class="result-header">
+            <span class="result-model">{{ r.model }}</span>
+            <span class="result-meta">
+              <template v-if="r.error">
+                <span class="result-err-badge">error</span>
+              </template>
+              <template v-else>
+                {{ (r.elapsed_ms / 1000).toFixed(1) }}s
+              </template>
+            </span>
+          </div>
+          <pre v-if="r.error" class="result-error-text">{{ r.error }}</pre>
+          <pre v-else class="result-response">{{ r.response }}</pre>
+        </div>
+      </div>
+
+      <div class="corrections-row">
+        <button
+          class="btn-corrections"
+          :disabled="pushingCorrections || !selectedProduct || successfulResults.length === 0"
+          @click="pushCorrections"
+        >
+          {{ pushingCorrections ? '⏳ Pushing…' : `✍ Send ${successfulResults.length} to Corrections` }}
+        </button>
+        <span v-if="correctionsPushMsg" class="corrections-msg" :class="correctionsPushOk ? 'msg-ok' : 'msg-err'">
+          {{ correctionsPushMsg }}
+        </span>
+      </div>
+    </section>
+  </div>
+</template>
+
+<script setup lang="ts">
+import { ref, computed, onMounted } from 'vue'
+
+// ── Types ──────────────────────────────────────────────────────────────────────
+
+interface Product {
+  id: string
+  name: string
+  icon: string
+  description: string
+  base_url: string
+  online: boolean
+}
+
+interface Sample {
+  product_id: string
+  sample_index: number
+  text: string
+  prompt: string
+  raw_item: Record<string, unknown>
+}
+
+interface ModelEntry {
+  id: string
+  name: string
+  service: string
+  tags: string[]
+  vram_estimate_mb: number
+}
+
+interface RunResult {
+  model: string
+  response: string
+  elapsed_ms: number
+  error: string | null
+}
+
+// ── State ──────────────────────────────────────────────────────────────────────
+
+const productsLoading  = ref(false)
+const products         = ref<Product[]>([])
+const selectedProduct  = ref<Product | null>(null)
+
+const sampleLoading    = ref(false)
+const sampleError      = ref<string | null>(null)
+const rawSample        = ref<Sample | null>(null)
+const editedPrompt     = ref('')
+
+const modelsLoading    = ref(false)
+const allModels        = ref<ModelEntry[]>([])
+const selectedModels   = ref<Set<string>>(new Set())
+
+const temperature      = ref(0.7)
+
+const running          = ref(false)
+const eventSource      = ref<EventSource | null>(null)
+const runLog           = ref<string[]>([])
+const results          = ref<RunResult[]>([])
+
+const pushingCorrections = ref(false)
+const correctionsPushMsg = ref<string | null>(null)
+const correctionsPushOk  = ref(false)
+
+// ── Computed ───────────────────────────────────────────────────────────────────
+
+const ollamaModels = computed(() =>
+  allModels.value.filter(m => m.service === 'ollama')
+)
+
+const successfulResults = computed(() =>
+  results.value.filter(r => !r.error && r.response.trim())
+)
+
+// ── Lifecycle ─────────────────────────────────────────────────────────────────
+
+onMounted(async () => {
+  await Promise.all([loadProducts(), loadModels()])
+})
+
+// ── Methods ────────────────────────────────────────────────────────────────────
+
+async function loadProducts() {
+  productsLoading.value = true
+  try {
+    const resp = await fetch('/api/imitate/products')
+    if (!resp.ok) throw new Error(`HTTP ${resp.status}`)
+    const data = await resp.json()
+    products.value = data.products ?? []
+  } catch {
+    products.value = []
+  } finally {
+    productsLoading.value = false
+  }
+}
+
+async function loadModels() {
+  modelsLoading.value = true
+  try {
+    const resp = await fetch('/api/cforch/models')
+    if (!resp.ok) throw new Error(`HTTP ${resp.status}`)
+    const data = await resp.json()
+    allModels.value = data.models ?? []
+    // Select all ollama models by default
+    for (const m of allModels.value) {
+      if (m.service === 'ollama') selectedModels.value.add(m.id)
+    }
+  } catch {
+    allModels.value = []
+  } finally {
+    modelsLoading.value = false
+  }
+}
+
+async function selectProduct(p: Product) {
+  selectedProduct.value = p
+  rawSample.value = null
+  editedPrompt.value = ''
+  sampleError.value = null
+  results.value = []
+  runLog.value = []
+  await fetchSample()
+}
+
+async function fetchSample() {
+  if (!selectedProduct.value) return
+  sampleLoading.value = true
+  sampleError.value = null
+  try {
+    const resp = await fetch(`/api/imitate/products/${selectedProduct.value.id}/sample`)
+    if (!resp.ok) {
+      const body = await resp.json().catch(() => ({ detail: 'Unknown error' }))
+      throw new Error(body.detail ?? `HTTP ${resp.status}`)
+    }
+    const data: Sample = await resp.json()
+    rawSample.value = data
+    editedPrompt.value = data.prompt
+  } catch (err: unknown) {
+    sampleError.value = err instanceof Error ? err.message : String(err)
+  } finally {
+    sampleLoading.value = false
+  }
+}
+
+function toggleModel(id: string, checked: boolean) {
+  const next = new Set(selectedModels.value)
+  checked ? next.add(id) : next.delete(id)
+  selectedModels.value = next
+}
+
+function toggleAllModels(checked: boolean) {
+  selectedModels.value = checked
+    ? new Set(ollamaModels.value.map(m => m.id))
+    : new Set()
+}
+
+function startRun() {
+  if (running.value || !editedPrompt.value.trim() || selectedModels.value.size === 0) return
+
+  running.value = true
+  results.value = []
+  runLog.value = []
+  correctionsPushMsg.value = null
+
+  const params = new URLSearchParams({
+    prompt:     editedPrompt.value,
+    model_ids:  [...selectedModels.value].join(','),
+    temperature: temperature.value.toString(),
+    product_id: selectedProduct.value?.id ?? '',
+  })
+
+  const es = new EventSource(`/api/imitate/run?${params}`)
+  eventSource.value = es
+
+  es.onmessage = (event: MessageEvent) => {
+    try {
+      const msg = JSON.parse(event.data)
+      if (msg.type === 'start') {
+        runLog.value.push(`Running ${msg.total_models} model(s)…`)
+      } else if (msg.type === 'model_start') {
+        runLog.value.push(`→ ${msg.model}…`)
+      } else if (msg.type === 'model_done') {
+        const status = msg.error
+          ? `✕ error: ${msg.error}`
+          : `✓ done (${(msg.elapsed_ms / 1000).toFixed(1)}s)`
+        runLog.value.push(`  ${msg.model}: ${status}`)
+        results.value.push({
+          model:      msg.model,
+          response:   msg.response,
+          elapsed_ms: msg.elapsed_ms,
+          error:      msg.error ?? null,
+        })
+      } else if (msg.type === 'complete') {
+        runLog.value.push(`Complete. ${results.value.length} responses.`)
+        running.value = false
+        es.close()
+      }
+    } catch {
+      // ignore malformed SSE frames
+    }
+  }
+
+  es.onerror = () => {
+    runLog.value.push('Connection error — run may be incomplete.')
+    running.value = false
+    es.close()
+  }
+}
+
+function cancelRun() {
+  eventSource.value?.close()
+  eventSource.value = null
+  running.value = false
+  runLog.value.push('Cancelled.')
+}
+
+async function pushCorrections() {
+  if (!selectedProduct.value || successfulResults.value.length === 0) return
+
+  pushingCorrections.value = true
+  correctionsPushMsg.value = null
+  try {
+    const resp = await fetch('/api/imitate/push-corrections', {
+      method:  'POST',
+      headers: { 'Content-Type': 'application/json' },
+      body: JSON.stringify({
+        product_id: selectedProduct.value.id,
+        prompt:     editedPrompt.value,
+        results:    successfulResults.value,
+      }),
+    })
+    if (!resp.ok) {
+      const body = await resp.json().catch(() => ({ detail: 'Unknown error' }))
+      throw new Error(body.detail ?? `HTTP ${resp.status}`)
+    }
+    const data = await resp.json()
+    correctionsPushMsg.value = `${data.pushed} record(s) added to Corrections queue.`
+    correctionsPushOk.value = true
+  } catch (err: unknown) {
+    correctionsPushMsg.value = err instanceof Error ? err.message : String(err)
+    correctionsPushOk.value = false
+  } finally {
+    pushingCorrections.value = false
+  }
+}
+</script>
+
+<style scoped>
+.imitate-view {
+  max-width: 1100px;
+  margin: 0 auto;
+  padding: 1.5rem;
+  display: flex;
+  flex-direction: column;
+  gap: 1.5rem;
+}
+
+.bench-header {
+  display: flex;
+  flex-direction: column;
+  gap: 0.25rem;
+}
+
+.page-title {
+  font-size: 1.6rem;
+  font-weight: 700;
+  color: var(--color-text, #1a2338);
+}
+
+.page-subtitle {
+  font-size: 0.9rem;
+  color: var(--color-text-secondary, #6b7a99);
+}
+
+/* Steps */
+.step-section {
+  background: var(--color-surface-raised, #e4ebf5);
+  border: 1px solid var(--color-border, #d0d7e8);
+  border-radius: 0.5rem;
+  padding: 1.25rem;
+  display: flex;
+  flex-direction: column;
+  gap: 1rem;
+}
+
+.step-heading {
+  font-size: 1rem;
+  font-weight: 600;
+  color: var(--color-text-secondary, #6b7a99);
+  text-transform: uppercase;
+  letter-spacing: 0.05em;
+  border-bottom: 1px solid var(--color-border, #d0d7e8);
+  padding-bottom: 0.5rem;
+}
+
+/* Product grid */
+.product-grid {
+  display: grid;
+  grid-template-columns: repeat(auto-fill, minmax(160px, 1fr));
+  gap: 0.75rem;
+}
+
+.product-card {
+  display: flex;
+  flex-direction: column;
+  align-items: center;
+  gap: 0.35rem;
+  padding: 1rem 0.75rem;
+  border: 2px solid var(--color-border, #d0d7e8);
+  border-radius: 0.5rem;
+  background: var(--color-surface, #f0f4fc);
+  cursor: pointer;
+  transition: border-color 0.15s, background 0.15s;
+  font-size: 0.9rem;
+}
+
+.product-card:hover:not(:disabled) {
+  border-color: var(--app-primary, #2A6080);
+  background: color-mix(in srgb, var(--app-primary, #2A6080) 6%, var(--color-surface, #f0f4fc));
+}
+
+.product-card.selected {
+  border-color: var(--app-primary, #2A6080);
+  background: color-mix(in srgb, var(--app-primary, #2A6080) 12%, var(--color-surface, #f0f4fc));
+}
+
+.product-card.offline {
+  opacity: 0.45;
+  cursor: not-allowed;
+}
+
+.product-icon {
+  font-size: 2rem;
+}
+
+.product-name {
+  font-weight: 600;
+  color: var(--color-text, #1a2338);
+}
+
+.product-status {
+  font-size: 0.72rem;
+  padding: 0.1rem 0.45rem;
+  border-radius: 9999px;
+  font-weight: 600;
+}
+
+.status-on {
+  background: #d1fae5;
+  color: #065f46;
+}
+
+.status-off {
+  background: #fee2e2;
+  color: #991b1b;
+}
+
+/* Sample panel */
+.sample-toolbar {
+  display: flex;
+  align-items: center;
+  gap: 0.75rem;
+  flex-wrap: wrap;
+}
+
+.sample-product-label {
+  font-weight: 600;
+  color: var(--app-primary, #2A6080);
+}
+
+.sample-error {
+  color: #b91c1c;
+  font-size: 0.85rem;
+}
+
+.sample-preview {
+  border: 1px solid var(--color-border, #d0d7e8);
+  border-radius: 0.375rem;
+  overflow: hidden;
+}
+
+.sample-preview-toggle {
+  padding: 0.5rem 0.75rem;
+  cursor: pointer;
+  font-size: 0.85rem;
+  color: var(--color-text-secondary, #6b7a99);
+  background: var(--color-surface, #f0f4fc);
+  user-select: none;
+}
+
+.sample-text {
+  padding: 0.75rem;
+  font-size: 0.82rem;
+  white-space: pre-wrap;
+  word-break: break-word;
+  max-height: 180px;
+  overflow-y: auto;
+  background: var(--color-bg, #f0f4fc);
+  margin: 0;
+  color: var(--color-text, #1a2338);
+}
+
+.prompt-label {
+  font-size: 0.85rem;
+  font-weight: 600;
+  color: var(--color-text-secondary, #6b7a99);
+}
+
+.prompt-editor {
+  width: 100%;
+  font-family: var(--font-mono, monospace);
+  font-size: 0.85rem;
+  padding: 0.75rem;
+  border: 1px solid var(--color-border, #d0d7e8);
+  border-radius: 0.375rem;
+  background: var(--color-surface, #f0f4fc);
+  color: var(--color-text, #1a2338);
+  resize: vertical;
+  line-height: 1.5;
+}
+
+.prompt-editor:focus {
+  outline: 2px solid var(--app-primary, #2A6080);
+  outline-offset: -1px;
+}
+
+/* Model picker — reuse bench-view classes */
+.model-picker {
+  border: 1px solid var(--color-border, #d0d7e8);
+  border-radius: 0.5rem;
+  overflow: hidden;
+}
+
+.picker-summary {
+  display: flex;
+  align-items: center;
+  justify-content: space-between;
+  padding: 0.75rem 1rem;
+  background: var(--color-surface, #f0f4fc);
+  cursor: pointer;
+  font-size: 0.95rem;
+  font-weight: 600;
+  user-select: none;
+  list-style: none;
+}
+
+.picker-title { flex: 1; }
+
+.picker-badge {
+  font-size: 0.8rem;
+  background: var(--app-primary, #2A6080);
+  color: #fff;
+  border-radius: 9999px;
+  padding: 0.15rem 0.6rem;
+}
+
+.picker-body {
+  padding: 0.75rem 1rem;
+  display: flex;
+  flex-direction: column;
+  gap: 0.25rem;
+}
+
+.picker-loading, .picker-empty {
+  font-size: 0.85rem;
+  color: var(--color-text-secondary, #6b7a99);
+  padding: 0.5rem 0;
+}
+
+.picker-cat-header {
+  display: flex;
+  align-items: center;
+  gap: 0.5rem;
+  font-weight: 600;
+  font-size: 0.9rem;
+  padding: 0.35rem 0;
+  cursor: pointer;
+}
+
+.picker-model-list {
+  display: flex;
+  flex-wrap: wrap;
+  gap: 0.25rem;
+  padding-left: 1.25rem;
+  padding-bottom: 0.5rem;
+}
+
+.picker-model-row {
+  display: flex;
+  align-items: center;
+  gap: 0.4rem;
+  font-size: 0.85rem;
+  cursor: pointer;
+  padding: 0.2rem 0.5rem;
+  border-radius: 0.25rem;
+  min-width: 220px;
+}
+
+.picker-model-row:hover {
+  background: color-mix(in srgb, var(--app-primary, #2A6080) 8%, transparent);
+}
+
+.picker-model-name {
+  flex: 1;
+  overflow: hidden;
+  text-overflow: ellipsis;
+  white-space: nowrap;
+}
+
+.picker-model-tags {
+  display: flex;
+  gap: 0.2rem;
+  flex-shrink: 0;
+}
+
+.tag {
+  font-size: 0.68rem;
+  background: var(--color-border, #d0d7e8);
+  border-radius: 9999px;
+  padding: 0.05rem 0.4rem;
+  color: var(--color-text-secondary, #6b7a99);
+  white-space: nowrap;
+}
+
+/* Temperature */
+.temp-row {
+  display: flex;
+  align-items: center;
+  gap: 0.75rem;
+}
+
+.temp-label {
+  font-size: 0.85rem;
+  white-space: nowrap;
+  min-width: 160px;
+}
+
+.temp-slider {
+  flex: 1;
+  accent-color: var(--app-primary, #2A6080);
+}
+
+/* Run controls */
+.run-row {
+  display: flex;
+  align-items: center;
+  gap: 0.75rem;
+}
+
+.btn-run {
+  background: var(--app-primary, #2A6080);
+  color: #fff;
+  border: none;
+  border-radius: 0.375rem;
+  padding: 0.55rem 1.25rem;
+  font-size: 0.9rem;
+  font-weight: 600;
+  cursor: pointer;
+  transition: opacity 0.15s;
+}
+
+.btn-run:disabled {
+  opacity: 0.4;
+  cursor: not-allowed;
+}
+
+.btn-cancel {
+  background: transparent;
+  border: 1px solid var(--color-border, #d0d7e8);
+  border-radius: 0.375rem;
+  padding: 0.5rem 0.9rem;
+  font-size: 0.85rem;
+  cursor: pointer;
+  color: var(--color-text-secondary, #6b7a99);
+}
+
+.btn-refresh {
+  background: transparent;
+  border: 1px solid var(--app-primary, #2A6080);
+  border-radius: 0.375rem;
+  padding: 0.35rem 0.8rem;
+  font-size: 0.85rem;
+  color: var(--app-primary, #2A6080);
+  cursor: pointer;
+  transition: background 0.15s;
+}
+
+.btn-refresh:hover:not(:disabled) {
+  background: color-mix(in srgb, var(--app-primary, #2A6080) 10%, transparent);
+}
+
+.btn-refresh:disabled { opacity: 0.5; cursor: not-allowed; }
+
+/* Run log */
+.run-log {
+  background: var(--color-bg, #f0f4fc);
+  border: 1px solid var(--color-border, #d0d7e8);
+  border-radius: 0.375rem;
+  padding: 0.75rem;
+  font-family: var(--font-mono, monospace);
+  font-size: 0.8rem;
+  max-height: 140px;
+  overflow-y: auto;
+}
+
+.log-line {
+  padding: 0.05rem 0;
+  color: var(--color-text, #1a2338);
+}
+
+/* Results */
+.results-grid {
+  display: grid;
+  grid-template-columns: repeat(auto-fill, minmax(300px, 1fr));
+  gap: 1rem;
+}
+
+.result-card {
+  border: 1px solid var(--color-border, #d0d7e8);
+  border-radius: 0.5rem;
+  overflow: hidden;
+  background: var(--color-surface, #f0f4fc);
+  display: flex;
+  flex-direction: column;
+}
+
+.result-card.result-error {
+  border-color: #fca5a5;
+}
+
+.result-header {
+  display: flex;
+  justify-content: space-between;
+  align-items: center;
+  padding: 0.5rem 0.75rem;
+  background: var(--color-surface-raised, #e4ebf5);
+  border-bottom: 1px solid var(--color-border, #d0d7e8);
+}
+
+.result-model {
+  font-size: 0.82rem;
+  font-weight: 600;
+  color: var(--color-text, #1a2338);
+  overflow: hidden;
+  text-overflow: ellipsis;
+  white-space: nowrap;
+}
+
+.result-meta {
+  font-size: 0.75rem;
+  color: var(--color-text-secondary, #6b7a99);
+  flex-shrink: 0;
+  margin-left: 0.5rem;
+}
+
+.result-err-badge {
+  background: #fee2e2;
+  color: #991b1b;
+  border-radius: 9999px;
+  padding: 0.1rem 0.45rem;
+  font-size: 0.7rem;
+  font-weight: 600;
+}
+
+.result-response, .result-error-text {
+  padding: 0.75rem;
+  font-size: 0.82rem;
+  white-space: pre-wrap;
+  word-break: break-word;
+  max-height: 280px;
+  overflow-y: auto;
+  margin: 0;
+  flex: 1;
+  color: var(--color-text, #1a2338);
+}
+
+.result-error-text {
+  color: #b91c1c;
+}
+
+/* Corrections */
+.corrections-row {
+  display: flex;
+  align-items: center;
+  gap: 0.75rem;
+  flex-wrap: wrap;
+}
+
+.btn-corrections {
+  background: var(--color-accent-warm, #b45309);
+  color: #fff;
+  border: none;
+  border-radius: 0.375rem;
+  padding: 0.55rem 1.25rem;
+  font-size: 0.9rem;
+  font-weight: 600;
+  cursor: pointer;
+  transition: opacity 0.15s;
+}
+
+.btn-corrections:disabled {
+  opacity: 0.4;
+  cursor: not-allowed;
+}
+
+.corrections-msg {
+  font-size: 0.85rem;
+}
+
+.msg-ok { color: #065f46; }
+.msg-err { color: #b91c1c; }
+</style>
--- a/web/src/views/ModelsView.vue
+++ b/web/src/views/ModelsView.vue
@ -0,0 +1,858 @@
+<template>
+  <div class="models-view">
+    <h1 class="page-title">🤗 Models</h1>
+
+    <!-- ── 1. HF Lookup ───────────────────────────────── -->
+    <section class="section">
+      <h2 class="section-title">HuggingFace Lookup</h2>
+
+      <div class="lookup-row">
+        <input
+          v-model="lookupInput"
+          type="text"
+          class="lookup-input"
+          placeholder="org/model or huggingface.co/org/model"
+          :disabled="lookupLoading"
+          @keydown.enter="doLookup"
+          aria-label="HuggingFace model ID"
+        />
+        <button
+          class="btn-primary"
+          :disabled="lookupLoading || !lookupInput.trim()"
+          @click="doLookup"
+        >
+          {{ lookupLoading ? 'Looking up…' : 'Lookup' }}
+        </button>
+      </div>
+
+      <div v-if="lookupError" class="error-notice" role="alert">
+        {{ lookupError }}
+      </div>
+
+      <div v-if="lookupResult" class="preview-card">
+        <div class="preview-header">
+          <span class="preview-repo-id">{{ lookupResult.repo_id }}</span>
+          <div class="badge-group">
+            <span v-if="lookupResult.already_installed" class="badge badge-success">Installed</span>
+            <span v-if="lookupResult.already_queued"    class="badge badge-info">In queue</span>
+          </div>
+        </div>
+
+        <div class="preview-meta">
+          <span v-if="lookupResult.pipeline_tag" class="chip chip-pipeline">
+            {{ lookupResult.pipeline_tag }}
+          </span>
+          <span v-if="lookupResult.adapter_recommendation" class="chip chip-adapter">
+            {{ lookupResult.adapter_recommendation }}
+          </span>
+          <span v-if="lookupResult.size != null" class="preview-size">
+            {{ humanBytes(lookupResult.size) }}
+          </span>
+        </div>
+
+        <p v-if="lookupResult.description" class="preview-desc">
+          {{ lookupResult.description }}
+        </p>
+
+        <div v-if="lookupResult.warning" class="compat-warning" role="alert">
+          <span class="compat-warning-icon">⚠️</span>
+          <span>{{ lookupResult.warning }}</span>
+        </div>
+
+        <button
+          class="btn-primary btn-add-queue"
+          :class="{ 'btn-add-queue-warn': !lookupResult.compatible }"
+          :disabled="lookupResult.already_installed || lookupResult.already_queued || addingToQueue"
+          @click="addToQueue"
+        >
+          {{ addingToQueue ? 'Adding…' : lookupResult.compatible ? 'Add to queue' : 'Add anyway' }}
+        </button>
+      </div>
+    </section>
+
+    <!-- ── 2. Approval Queue ──────────────────────────── -->
+    <section class="section">
+      <h2 class="section-title">Approval Queue</h2>
+
+      <div v-if="pendingModels.length === 0" class="empty-notice">
+        No models waiting for approval.
+      </div>
+
+      <div v-for="model in pendingModels" :key="model.id" class="model-card">
+        <div class="model-card-header">
+          <span class="model-repo-id">{{ model.repo_id }}</span>
+          <button
+            class="btn-dismiss"
+            :aria-label="`Dismiss ${model.repo_id}`"
+            @click="dismissModel(model.id)"
+          >
+            ✕
+          </button>
+        </div>
+        <div class="model-meta">
+          <span v-if="model.pipeline_tag"         class="chip chip-pipeline">{{ model.pipeline_tag }}</span>
+          <span v-if="model.adapter_recommendation" class="chip chip-adapter">{{ model.adapter_recommendation }}</span>
+        </div>
+        <div class="model-card-actions">
+          <button class="btn-primary btn-sm" @click="approveModel(model.id)">
+            Approve download
+          </button>
+        </div>
+      </div>
+    </section>
+
+    <!-- ── 3. Active Downloads ────────────────────────── -->
+    <section class="section">
+      <h2 class="section-title">Active Downloads</h2>
+
+      <div v-if="downloadingModels.length === 0" class="empty-notice">
+        No active downloads.
+      </div>
+
+      <div v-for="model in downloadingModels" :key="model.id" class="model-card">
+        <div class="model-card-header">
+          <span class="model-repo-id">{{ model.repo_id }}</span>
+          <span v-if="downloadErrors[model.id]" class="badge badge-error">Error</span>
+        </div>
+        <div class="model-meta">
+          <span v-if="model.pipeline_tag" class="chip chip-pipeline">{{ model.pipeline_tag }}</span>
+        </div>
+
+        <div v-if="downloadErrors[model.id]" class="download-error" role="alert">
+          {{ downloadErrors[model.id] }}
+        </div>
+        <div v-else class="progress-wrap" :aria-label="`Download progress for ${model.repo_id}`">
+          <div
+            class="progress-bar"
+            :style="{ width: `${downloadProgress[model.id] ?? 0}%` }"
+            role="progressbar"
+            :aria-valuenow="downloadProgress[model.id] ?? 0"
+            aria-valuemin="0"
+            aria-valuemax="100"
+          />
+          <span class="progress-label">
+            {{ downloadProgress[model.id] == null ? 'Preparing…' : `${downloadProgress[model.id]}%` }}
+          </span>
+        </div>
+      </div>
+    </section>
+
+    <!-- ── 4. Installed Models ────────────────────────── -->
+    <section class="section">
+      <h2 class="section-title">Installed Models</h2>
+
+      <div v-if="installedModels.length === 0" class="empty-notice">
+        No models installed yet.
+      </div>
+
+      <div v-else class="installed-table-wrap">
+        <table class="installed-table">
+          <thead>
+            <tr>
+              <th>Name</th>
+              <th>Type</th>
+              <th>Adapter</th>
+              <th>Size</th>
+              <th></th>
+            </tr>
+          </thead>
+          <tbody>
+            <tr v-for="model in installedModels" :key="model.name">
+              <td class="td-name">{{ model.name }}</td>
+              <td>
+                <span
+                  class="badge"
+                  :class="model.type === 'finetuned' ? 'badge-accent' : 'badge-info'"
+                >
+                  {{ model.type }}
+                </span>
+              </td>
+              <td>{{ model.adapter ?? '—' }}</td>
+              <td>{{ humanBytes(model.size) }}</td>
+              <td>
+                <button
+                  class="btn-danger btn-sm"
+                  @click="deleteInstalled(model.name)"
+                >
+                  Delete
+                </button>
+              </td>
+            </tr>
+          </tbody>
+        </table>
+      </div>
+    </section>
+  </div>
+</template>
+
+<script setup lang="ts">
+import { ref, computed, onMounted, onUnmounted } from 'vue'
+
+// ── Type definitions ──────────────────────────────────
+
+interface LookupResult {
+  repo_id: string
+  pipeline_tag: string | null
+  adapter_recommendation: string | null
+  compatible: boolean
+  warning: string | null
+  size: number | null
+  description: string | null
+  already_installed: boolean
+  already_queued: boolean
+}
+
+interface QueuedModel {
+  id: string
+  repo_id: string
+  status: 'pending' | 'downloading' | 'done' | 'error'
+  pipeline_tag: string | null
+  adapter_recommendation: string | null
+}
+
+interface InstalledModel {
+  name: string
+  type: 'finetuned' | 'downloaded'
+  adapter: string | null
+  size: number
+}
+
+interface SseProgressEvent {
+  model_id: string
+  pct: number | null
+  status: 'progress' | 'done' | 'error'
+  message?: string
+}
+
+// ── State ─────────────────────────────────────────────
+
+const lookupInput   = ref('')
+const lookupLoading = ref(false)
+const lookupError   = ref<string | null>(null)
+const lookupResult  = ref<LookupResult | null>(null)
+const addingToQueue = ref(false)
+
+const queuedModels    = ref<QueuedModel[]>([])
+const installedModels = ref<InstalledModel[]>([])
+
+const downloadProgress = ref<Record<string, number>>({})
+const downloadErrors   = ref<Record<string, string>>({})
+
+let pollInterval: ReturnType<typeof setInterval> | null = null
+let sseSource: EventSource | null = null
+
+// ── Derived ───────────────────────────────────────────
+
+const pendingModels = computed(() =>
+  queuedModels.value.filter(m => m.status === 'pending')
+)
+
+const downloadingModels = computed(() =>
+  queuedModels.value.filter(m => m.status === 'downloading')
+)
+
+// ── Helpers ───────────────────────────────────────────
+
+function humanBytes(bytes: number | null): string {
+  if (bytes == null) return '—'
+  const units = ['B', 'KB', 'MB', 'GB', 'TB']
+  let value = bytes
+  let unitIndex = 0
+  while (value >= 1024 && unitIndex < units.length - 1) {
+    value /= 1024
+    unitIndex++
+  }
+  return `${value.toFixed(unitIndex === 0 ? 0 : 1)} ${units[unitIndex]}`
+}
+
+function normalizeRepoId(raw: string): string {
+  return raw.trim().replace(/^https?:\/\/huggingface\.co\//, '')
+}
+
+// ── API calls ─────────────────────────────────────────
+
+async function doLookup() {
+  const repoId = normalizeRepoId(lookupInput.value)
+  if (!repoId) return
+
+  lookupLoading.value = true
+  lookupError.value   = null
+  lookupResult.value  = null
+
+  try {
+    const res = await fetch(`/api/models/lookup?repo_id=${encodeURIComponent(repoId)}`)
+    if (res.status === 404) {
+      lookupError.value = 'Model not found on HuggingFace.'
+      return
+    }
+    if (res.status === 502) {
+      lookupError.value = 'HuggingFace unreachable. Check your connection and try again.'
+      return
+    }
+    if (!res.ok) {
+      lookupError.value = `Lookup failed (HTTP ${res.status}).`
+      return
+    }
+    lookupResult.value = await res.json() as LookupResult
+  } catch {
+    lookupError.value = 'Network error. Is the Avocet API running?'
+  } finally {
+    lookupLoading.value = false
+  }
+}
+
+async function addToQueue() {
+  if (!lookupResult.value) return
+  addingToQueue.value = true
+  try {
+    const res = await fetch('/api/models/queue', {
+      method: 'POST',
+      headers: { 'Content-Type': 'application/json' },
+      body: JSON.stringify({ repo_id: lookupResult.value.repo_id }),
+    })
+    if (res.ok) {
+      lookupResult.value = { ...lookupResult.value, already_queued: true }
+      await loadQueue()
+    }
+  } catch { /* ignore — already_queued badge won't flip, user can retry */ }
+  finally {
+    addingToQueue.value = false
+  }
+}
+
+async function approveModel(id: string) {
+  try {
+    const res = await fetch(`/api/models/queue/${encodeURIComponent(id)}/approve`, { method: 'POST' })
+    if (res.ok) {
+      await loadQueue()
+      startSse()
+    }
+  } catch { /* ignore */ }
+}
+
+async function dismissModel(id: string) {
+  try {
+    const res = await fetch(`/api/models/queue/${encodeURIComponent(id)}`, { method: 'DELETE' })
+    if (res.ok) {
+      queuedModels.value = queuedModels.value.filter(m => m.id !== id)
+    }
+  } catch { /* ignore */ }
+}
+
+async function deleteInstalled(name: string) {
+  if (!window.confirm(`Delete installed model "${name}"? This cannot be undone.`)) return
+  try {
+    const res = await fetch(`/api/models/installed/${encodeURIComponent(name)}`, { method: 'DELETE' })
+    if (res.ok) {
+      installedModels.value = installedModels.value.filter(m => m.name !== name)
+    }
+  } catch { /* ignore */ }
+}
+
+async function loadQueue() {
+  try {
+    const res = await fetch('/api/models/queue')
+    if (res.ok) queuedModels.value = await res.json() as QueuedModel[]
+  } catch { /* non-fatal */ }
+}
+
+async function loadInstalled() {
+  try {
+    const res = await fetch('/api/models/installed')
+    if (res.ok) installedModels.value = await res.json() as InstalledModel[]
+  } catch { /* non-fatal */ }
+}
+
+// ── SSE for download progress ─────────────────────────
+
+function startSse() {
+  if (sseSource) return  // already connected
+
+  sseSource = new EventSource('/api/models/download/stream')
+
+  sseSource.addEventListener('message', (e: MessageEvent) => {
+    let event: SseProgressEvent
+    try {
+      event = JSON.parse(e.data as string) as SseProgressEvent
+    } catch {
+      return
+    }
+
+    const { model_id, pct, status, message } = event
+
+    if (status === 'progress' && pct != null) {
+      downloadProgress.value = { ...downloadProgress.value, [model_id]: pct }
+    } else if (status === 'done') {
+      const updated = { ...downloadProgress.value }
+      delete updated[model_id]
+      downloadProgress.value = updated
+
+      queuedModels.value = queuedModels.value.filter(m => m.id !== model_id)
+      loadInstalled()
+    } else if (status === 'error') {
+      downloadErrors.value = {
+        ...downloadErrors.value,
+        [model_id]: message ?? 'Download failed.',
+      }
+    }
+  })
+
+  sseSource.onerror = () => {
+    sseSource?.close()
+    sseSource = null
+  }
+}
+
+function stopSse() {
+  sseSource?.close()
+  sseSource = null
+}
+
+// ── Polling ───────────────────────────────────────────
+
+function startPollingIfDownloading() {
+  if (pollInterval) return
+  pollInterval = setInterval(async () => {
+    await loadQueue()
+    if (downloadingModels.value.length === 0) {
+      stopPolling()
+    }
+  }, 5000)
+}
+
+function stopPolling() {
+  if (pollInterval) {
+    clearInterval(pollInterval)
+    pollInterval = null
+  }
+}
+
+// ── Lifecycle ─────────────────────────────────────────
+
+onMounted(async () => {
+  await Promise.all([loadQueue(), loadInstalled()])
+
+  if (downloadingModels.value.length > 0) {
+    startSse()
+    startPollingIfDownloading()
+  }
+})
+
+onUnmounted(() => {
+  stopPolling()
+  stopSse()
+})
+</script>
+
+<style scoped>
+.models-view {
+  max-width: 760px;
+  margin: 0 auto;
+  padding: 1.5rem 1rem 4rem;
+  display: flex;
+  flex-direction: column;
+  gap: 2rem;
+}
+
+.page-title {
+  font-family: var(--font-display, var(--font-body, sans-serif));
+  font-size: 1.4rem;
+  font-weight: 700;
+  color: var(--color-primary, #2d5a27);
+}
+
+/* ── Sections ── */
+.section {
+  display: flex;
+  flex-direction: column;
+  gap: 0.75rem;
+}
+
+.section-title {
+  font-size: 1rem;
+  font-weight: 600;
+  color: var(--color-text, #1a2338);
+  padding-bottom: 0.4rem;
+  border-bottom: 1px solid var(--color-border, #a8b8d0);
+}
+
+/* ── Lookup row ── */
+.lookup-row {
+  display: flex;
+  gap: 0.5rem;
+  flex-wrap: wrap;
+}
+
+.lookup-input {
+  flex: 1;
+  min-width: 0;
+  padding: 0.45rem 0.7rem;
+  border: 1px solid var(--color-border, #a8b8d0);
+  border-radius: var(--radius-md, 0.5rem);
+  background: var(--color-surface-raised, #f5f7fc);
+  color: var(--color-text, #1a2338);
+  font-size: 0.9rem;
+  font-family: var(--font-body, sans-serif);
+}
+
+.lookup-input:disabled {
+  opacity: 0.6;
+}
+
+.lookup-input::placeholder {
+  color: var(--color-text-muted, #4a5c7a);
+}
+
+/* ── Notices ── */
+.error-notice {
+  padding: 0.6rem 0.8rem;
+  background: color-mix(in srgb, var(--color-error, #c0392b) 12%, transparent);
+  border: 1px solid color-mix(in srgb, var(--color-error, #c0392b) 30%, transparent);
+  border-radius: var(--radius-md, 0.5rem);
+  color: var(--color-error, #c0392b);
+  font-size: 0.88rem;
+}
+
+.empty-notice {
+  color: var(--color-text-muted, #4a5c7a);
+  font-size: 0.9rem;
+  padding: 0.75rem;
+  border: 1px dashed var(--color-border, #a8b8d0);
+  border-radius: var(--radius-md, 0.5rem);
+}
+
+/* ── Preview card ── */
+.preview-card {
+  border: 1px solid var(--color-border, #a8b8d0);
+  border-radius: var(--radius-lg, 1rem);
+  background: var(--color-surface-raised, #f5f7fc);
+  padding: 1rem;
+  display: flex;
+  flex-direction: column;
+  gap: 0.6rem;
+  box-shadow: var(--shadow-sm);
+}
+
+.preview-header {
+  display: flex;
+  align-items: flex-start;
+  justify-content: space-between;
+  gap: 0.5rem;
+  flex-wrap: wrap;
+}
+
+.preview-repo-id {
+  font-family: var(--font-mono, monospace);
+  font-size: 0.95rem;
+  font-weight: 600;
+  color: var(--color-text, #1a2338);
+  word-break: break-all;
+}
+
+.preview-meta {
+  display: flex;
+  gap: 0.4rem;
+  flex-wrap: wrap;
+  align-items: center;
+}
+
+.preview-size {
+  font-size: 0.8rem;
+  color: var(--color-text-muted, #4a5c7a);
+  margin-left: 0.25rem;
+}
+
+.preview-desc {
+  font-size: 0.875rem;
+  color: var(--color-text-muted, #4a5c7a);
+  line-height: 1.5;
+  margin: 0;
+  display: -webkit-box;
+  -webkit-line-clamp: 3;
+  -webkit-box-orient: vertical;
+  overflow: hidden;
+}
+
+.compat-warning {
+  display: flex;
+  align-items: flex-start;
+  gap: 0.5rem;
+  padding: 0.6rem 0.75rem;
+  border-radius: var(--radius-sm, 0.25rem);
+  background: color-mix(in srgb, var(--color-warning, #f59e0b) 12%, transparent);
+  border: 1px solid color-mix(in srgb, var(--color-warning, #f59e0b) 40%, transparent);
+  font-size: 0.82rem;
+  color: var(--color-text, #1a2338);
+  line-height: 1.45;
+}
+
+.compat-warning-icon {
+  flex-shrink: 0;
+  line-height: 1.45;
+}
+
+.btn-add-queue {
+  align-self: flex-start;
+}
+
+.btn-add-queue-warn {
+  background: var(--color-surface-raised, #e4ebf5);
+  color: var(--color-text-secondary, #6b7a99);
+  border: 1px solid var(--color-border, #d0d7e8);
+}
+
+/* ── Model cards (queue + downloads) ── */
+.model-card {
+  border: 1px solid var(--color-border, #a8b8d0);
+  border-radius: var(--radius-md, 0.5rem);
+  background: var(--color-surface-raised, #f5f7fc);
+  padding: 0.75rem 1rem;
+  display: flex;
+  flex-direction: column;
+  gap: 0.5rem;
+  box-shadow: var(--shadow-sm);
+}
+
+.model-card-header {
+  display: flex;
+  align-items: center;
+  justify-content: space-between;
+  gap: 0.5rem;
+}
+
+.model-repo-id {
+  font-family: var(--font-mono, monospace);
+  font-size: 0.9rem;
+  font-weight: 600;
+  color: var(--color-text, #1a2338);
+  word-break: break-all;
+}
+
+.model-meta {
+  display: flex;
+  gap: 0.4rem;
+  flex-wrap: wrap;
+}
+
+.model-card-actions {
+  display: flex;
+  gap: 0.5rem;
+  flex-wrap: wrap;
+  padding-top: 0.25rem;
+}
+
+/* ── Progress bar ── */
+.progress-wrap {
+  position: relative;
+  height: 1.5rem;
+  background: var(--color-surface-alt, #dde4f0);
+  border-radius: var(--radius-full, 9999px);
+  overflow: hidden;
+}
+
+.progress-bar {
+  position: absolute;
+  top: 0;
+  left: 0;
+  height: 100%;
+  background: var(--color-accent, #c4732a);
+  border-radius: var(--radius-full, 9999px);
+  transition: width 300ms ease;
+}
+
+.progress-label {
+  position: absolute;
+  inset: 0;
+  display: flex;
+  align-items: center;
+  justify-content: center;
+  font-size: 0.75rem;
+  font-weight: 600;
+  color: var(--color-text, #1a2338);
+  pointer-events: none;
+}
+
+.download-error {
+  font-size: 0.85rem;
+  color: var(--color-error, #c0392b);
+  padding: 0.4rem 0.5rem;
+  background: color-mix(in srgb, var(--color-error, #c0392b) 10%, transparent);
+  border-radius: var(--radius-sm, 0.25rem);
+}
+
+/* ── Installed table ── */
+.installed-table-wrap {
+  overflow-x: auto;
+}
+
+.installed-table {
+  width: 100%;
+  border-collapse: collapse;
+  font-size: 0.875rem;
+}
+
+.installed-table th {
+  text-align: left;
+  padding: 0.4rem 0.6rem;
+  color: var(--color-text-muted, #4a5c7a);
+  font-size: 0.78rem;
+  font-weight: 600;
+  text-transform: uppercase;
+  letter-spacing: 0.03em;
+  border-bottom: 1px solid var(--color-border, #a8b8d0);
+  white-space: nowrap;
+}
+
+.installed-table td {
+  padding: 0.55rem 0.6rem;
+  border-bottom: 1px solid var(--color-border-light, #ccd5e6);
+  vertical-align: middle;
+}
+
+.td-name {
+  font-family: var(--font-mono, monospace);
+  font-size: 0.85rem;
+  word-break: break-all;
+}
+
+/* ── Badges ── */
+.badge-group {
+  display: flex;
+  gap: 0.35rem;
+  flex-wrap: wrap;
+  align-items: center;
+}
+
+.badge {
+  display: inline-flex;
+  align-items: center;
+  padding: 0.15rem 0.55rem;
+  border-radius: var(--radius-full, 9999px);
+  font-size: 0.72rem;
+  font-weight: 700;
+  letter-spacing: 0.02em;
+  text-transform: uppercase;
+  white-space: nowrap;
+}
+
+.badge-success {
+  background: color-mix(in srgb, var(--color-success, #3a7a32) 15%, transparent);
+  color: var(--color-success, #3a7a32);
+}
+
+.badge-info {
+  background: color-mix(in srgb, var(--color-info, #1e6091) 15%, transparent);
+  color: var(--color-info, #1e6091);
+}
+
+.badge-accent {
+  background: color-mix(in srgb, var(--color-accent, #c4732a) 15%, transparent);
+  color: var(--color-accent, #c4732a);
+}
+
+.badge-error {
+  background: color-mix(in srgb, var(--color-error, #c0392b) 15%, transparent);
+  color: var(--color-error, #c0392b);
+}
+
+/* ── Chips ── */
+.chip {
+  display: inline-flex;
+  align-items: center;
+  padding: 0.15rem 0.5rem;
+  border-radius: var(--radius-full, 9999px);
+  font-size: 0.75rem;
+  font-weight: 600;
+  background: var(--color-surface-alt, #dde4f0);
+  white-space: nowrap;
+}
+
+.chip-pipeline {
+  color: var(--color-primary, #2d5a27);
+  background: color-mix(in srgb, var(--color-primary, #2d5a27) 12%, var(--color-surface-alt, #dde4f0));
+}
+
+.chip-adapter {
+  color: var(--color-accent, #c4732a);
+  background: color-mix(in srgb, var(--color-accent, #c4732a) 12%, var(--color-surface-alt, #dde4f0));
+}
+
+/* ── Buttons ── */
+.btn-primary, .btn-danger {
+  padding: 0.4rem 0.9rem;
+  border-radius: var(--radius-md, 0.5rem);
+  font-size: 0.85rem;
+  cursor: pointer;
+  border: 1px solid;
+  font-family: var(--font-body, sans-serif);
+  transition: background var(--transition, 200ms ease), color var(--transition, 200ms ease);
+}
+
+.btn-sm {
+  padding: 0.25rem 0.65rem;
+  font-size: 0.8rem;
+}
+
+.btn-primary {
+  border-color: var(--color-primary, #2d5a27);
+  background: var(--color-primary, #2d5a27);
+  color: var(--color-text-inverse, #eaeff8);
+}
+
+.btn-primary:hover:not(:disabled) {
+  background: var(--color-primary-hover, #234820);
+  border-color: var(--color-primary-hover, #234820);
+}
+
+.btn-primary:disabled {
+  opacity: 0.5;
+  cursor: not-allowed;
+}
+
+.btn-danger {
+  border-color: var(--color-error, #c0392b);
+  background: transparent;
+  color: var(--color-error, #c0392b);
+}
+
+.btn-danger:hover {
+  background: color-mix(in srgb, var(--color-error, #c0392b) 10%, transparent);
+}
+
+.btn-dismiss {
+  border: none;
+  background: transparent;
+  color: var(--color-text-muted, #4a5c7a);
+  cursor: pointer;
+  font-size: 0.9rem;
+  padding: 0.15rem 0.4rem;
+  border-radius: var(--radius-sm, 0.25rem);
+  flex-shrink: 0;
+  transition: color var(--transition, 200ms ease), background var(--transition, 200ms ease);
+}
+
+.btn-dismiss:hover {
+  color: var(--color-error, #c0392b);
+  background: color-mix(in srgb, var(--color-error, #c0392b) 10%, transparent);
+}
+
+/* ── Responsive ── */
+@media (max-width: 480px) {
+  .lookup-row {
+    flex-direction: column;
+  }
+
+  .lookup-input {
+    width: 100%;
+  }
+
+  .btn-primary:not(.btn-sm) {
+    width: 100%;
+  }
+
+  .installed-table th:nth-child(3),
+  .installed-table td:nth-child(3) {
+    display: none;  /* hide Adapter column on very narrow screens */
+  }
+}
+</style>
--- a/web/src/views/SettingsView.vue
+++ b/web/src/views/SettingsView.vue
@ -115,8 +115,18 @@
      <h2 class="section-title">cf-orch Integration</h2>
      <p class="section-desc">
        Import SFT (supervised fine-tuning) candidates from cf-orch benchmark runs.
+        Connection settings fall back to environment variables
+        (<code>CF_ORCH_URL</code>, <code>CF_LICENSE_KEY</code>, <code>OLLAMA_HOST</code>)
+        when not set here.
      </p>

+      <!-- Connection status pill -->
+      <div v-if="orchConfig" class="orch-status-row">
+        <span class="orch-status-pill" :class="orchStatusClass">{{ orchStatusLabel }}</span>
+        <span v-if="orchConfig.source === 'env'" class="orch-source-note">via env vars</span>
+        <span v-else class="orch-source-note">via label_tool.yaml</span>
+      </div>
+
      <div class="field-row">
        <label class="field field-grow">
          <span>bench_results_dir</span>
@ -181,7 +191,7 @@
 </template>

 <script setup lang="ts">
-import { ref, onMounted } from 'vue'
+import { ref, computed, onMounted } from 'vue'
 import { useApiFetch } from '../composables/useApi'

 interface Account {
@ -199,12 +209,27 @@ const saveOk        = ref(true)
 const richMotion    = ref(localStorage.getItem('cf-avocet-rich-motion') !== 'false')
 const keyHints      = ref(localStorage.getItem('cf-avocet-key-hints') !== 'false')

-// SFT integration state
+// SFT / cf-orch integration state
 const benchResultsDir = ref('')
 const runs            = ref<Array<{ run_id: string; timestamp: string; candidate_count: number; already_imported: boolean }>>([])
 const importingRunId  = ref<string | null>(null)
 const importResult    = ref<{ imported: number; skipped: number } | null>(null)
 const saveStatus      = ref('')
+const orchConfig      = ref<{ coordinator_url: string; ollama_url: string; ollama_model: string; license_key_set: boolean; source: string } | null>(null)
+
+const orchStatusClass = computed(() => {
+  if (!orchConfig.value) return 'status-unknown'
+  if (orchConfig.value.coordinator_url) return 'status-connected'
+  if (orchConfig.value.ollama_url) return 'status-local'
+  return 'status-unconfigured'
+})
+
+const orchStatusLabel = computed(() => {
+  if (!orchConfig.value) return 'Unknown'
+  if (orchConfig.value.coordinator_url) return '● cf-orch coordinator'
+  if (orchConfig.value.ollama_url) return '● Ollama (local)'
+  return '○ Not configured'
+})

 async function loadSftConfig() {
  try {
@ -218,6 +243,15 @@ async function loadSftConfig() {
  }
 }

+async function loadOrchConfig() {
+  try {
+    const res = await fetch('/api/cforch/config')
+    if (res.ok) orchConfig.value = await res.json()
+  } catch {
+    // non-fatal
+  }
+}
+
 async function saveSftConfig() {
  saveStatus.value = 'Saving…'
  try {
@ -337,6 +371,7 @@ function onKeyHintsChange() {
 onMounted(() => {
  reload()
  loadSftConfig()
+  loadOrchConfig()
 })
 </script>

@ -564,6 +599,31 @@ onMounted(() => {
  width: 100%;
 }

+.orch-status-row {
+  display: flex;
+  align-items: center;
+  gap: var(--space-2);
+  margin-bottom: var(--space-3);
+}
+
+.orch-status-pill {
+  font-size: 0.8rem;
+  font-weight: 600;
+  padding: var(--space-1) var(--space-3);
+  border-radius: var(--radius-full);
+}
+
+.status-connected    { background: color-mix(in srgb, var(--color-success, #3a7a32) 12%, transparent); color: var(--color-success, #3a7a32); }
+.status-local        { background: color-mix(in srgb, var(--color-primary) 12%, transparent); color: var(--color-primary); }
+.status-unconfigured { background: var(--color-surface-alt); color: var(--color-text-muted); }
+.status-unknown      { background: var(--color-surface-alt); color: var(--color-text-muted); }
+
+.orch-source-note {
+  font-size: 0.75rem;
+  color: var(--color-text-muted);
+  font-style: italic;
+}
+
 .runs-table {
  width: 100%;
  border-collapse: collapse;
--- a/web/src/views/StatsView.vue
+++ b/web/src/views/StatsView.vue
@ -35,6 +35,77 @@
        </div>
      </div>

+      <!-- Benchmark Results -->
+      <template v-if="benchRows.length > 0">
+        <h2 class="section-title">🏁 Benchmark Results</h2>
+        <div class="bench-table-wrap">
+          <table class="bench-table">
+            <thead>
+              <tr>
+                <th class="bt-model-col">Model</th>
+                <th
+                  v-for="m in BENCH_METRICS"
+                  :key="m.key as string"
+                  class="bt-metric-col"
+                >{{ m.label }}</th>
+              </tr>
+            </thead>
+            <tbody>
+              <tr v-for="row in benchRows" :key="row.name">
+                <td class="bt-model-cell" :title="row.name">{{ row.name }}</td>
+                <td
+                  v-for="m in BENCH_METRICS"
+                  :key="m.key as string"
+                  class="bt-metric-cell"
+                  :class="{ 'bt-best': bestByMetric[m.key as string] === row.name }"
+                >
+                  {{ formatMetric(row.result[m.key]) }}
+                </td>
+              </tr>
+            </tbody>
+          </table>
+        </div>
+        <p class="bench-hint">Highlighted cells are the best-scoring model per metric.</p>
+      </template>
+
+      <!-- LLM Benchmark Results -->
+      <template v-if="llmResults.length > 0">
+        <h2 class="section-title">🤖 LLM Benchmark</h2>
+        <div class="bench-table-wrap">
+          <table class="bench-table">
+            <thead>
+              <tr>
+                <th class="bt-model-col">Model</th>
+                <th class="bt-metric-col">overall</th>
+                <th
+                  v-for="col in llmTaskTypeCols"
+                  :key="col"
+                  class="bt-metric-col"
+                >{{ col }}</th>
+                <th class="bt-metric-col">tok/s</th>
+              </tr>
+            </thead>
+            <tbody>
+              <tr v-for="row in llmResults" :key="row.model_id">
+                <td class="bt-model-cell" :title="row.model_id">{{ row.model_name }}</td>
+                <td
+                  class="bt-metric-cell"
+                  :class="{ 'bt-best': llmBestByCol['overall'] === row.model_id }"
+                >{{ llmPct(row.avg_quality_score) }}</td>
+                <td
+                  v-for="col in llmTaskTypeCols"
+                  :key="col"
+                  class="bt-metric-cell"
+                  :class="{ 'bt-best': llmBestByCol[col] === row.model_id }"
+                >{{ row.quality_by_task_type[col] != null ? llmPct(row.quality_by_task_type[col]) : '—' }}</td>
+                <td class="bt-metric-cell">{{ row.avg_tokens_per_sec.toFixed(1) }}</td>
+              </tr>
+            </tbody>
+          </table>
+        </div>
+        <p class="bench-hint">Run LLM Eval on the Benchmark tab to refresh. Highlighted = best per column.</p>
+      </template>
+
      <div class="file-info">
        <span class="file-path">Score file: <code>data/email_score.jsonl</code></span>
        <span class="file-size">{{ fileSizeLabel }}</span>
@ -54,10 +125,30 @@
 import { ref, computed, onMounted } from 'vue'
 import { useApiFetch } from '../composables/useApi'

+interface BenchmarkModelResult {
+  accuracy?: number
+  macro_f1?: number
+  weighted_f1?: number
+  [key: string]: number | undefined
+}
+
+interface LlmModelResult {
+  model_name: string
+  model_id: string
+  node_id: string
+  avg_tokens_per_sec: number
+  avg_completion_ms: number
+  avg_quality_score: number
+  finetune_candidates: number
+  error_count: number
+  quality_by_task_type: Record<string, number>
+}
+
 interface StatsResponse {
  total: number
  counts: Record<string, number>
  score_file_bytes: number
+  benchmark_results?: Record<string, BenchmarkModelResult>
 }

 // Canonical label order + metadata
@ -108,6 +199,85 @@ const fileSizeLabel = computed(() => {
  return `${(b / 1024 / 1024).toFixed(2)} MB`
 })

+// Benchmark results helpers
+const BENCH_METRICS: Array<{ key: keyof BenchmarkModelResult; label: string }> = [
+  { key: 'accuracy',    label: 'Accuracy' },
+  { key: 'macro_f1',   label: 'Macro F1' },
+  { key: 'weighted_f1', label: 'Weighted F1' },
+]
+
+const benchRows = computed(() => {
+  const br = stats.value.benchmark_results
+  if (!br || Object.keys(br).length === 0) return []
+  return Object.entries(br).map(([name, result]) => ({ name, result }))
+})
+
+// Find the best model name for each metric
+const bestByMetric = computed((): Record<string, string> => {
+  const result: Record<string, string> = {}
+  for (const { key } of BENCH_METRICS) {
+    let bestName = ''
+    let bestVal  = -Infinity
+    for (const { name, result: r } of benchRows.value) {
+      const v = r[key]
+      if (v != null && v > bestVal) { bestVal = v; bestName = name }
+    }
+    result[key as string] = bestName
+  }
+  return result
+})
+
+function formatMetric(v: number | undefined): string {
+  if (v == null) return '—'
+  // Values in 0-1 range: format as percentage
+  if (v <= 1) return `${(v * 100).toFixed(1)}%`
+  // Already a percentage
+  return `${v.toFixed(1)}%`
+}
+
+// ── LLM Benchmark results ────────────────────────────────────────────────────
+const llmResults = ref<LlmModelResult[]>([])
+
+const llmTaskTypeCols = computed(() => {
+  const types = new Set<string>()
+  for (const r of llmResults.value) {
+    for (const k of Object.keys(r.quality_by_task_type)) types.add(k)
+  }
+  return [...types].sort()
+})
+
+const llmBestByCol = computed((): Record<string, string> => {
+  const best: Record<string, string> = {}
+  if (llmResults.value.length === 0) return best
+
+  let bestId = '', bestVal = -Infinity
+  for (const r of llmResults.value) {
+    if (r.avg_quality_score > bestVal) { bestVal = r.avg_quality_score; bestId = r.model_id }
+  }
+  best['overall'] = bestId
+
+  for (const col of llmTaskTypeCols.value) {
+    bestId = ''; bestVal = -Infinity
+    for (const r of llmResults.value) {
+      const v = r.quality_by_task_type[col]
+      if (v != null && v > bestVal) { bestVal = v; bestId = r.model_id }
+    }
+    best[col] = bestId
+  }
+  return best
+})
+
+function llmPct(v: number): string {
+  return `${(v * 100).toFixed(1)}%`
+}
+
+async function loadLlmResults() {
+  const { data } = await useApiFetch<LlmModelResult[]>('/api/cforch/results')
+  if (Array.isArray(data) && data.length > 0) {
+    llmResults.value = data
+  }
+}
+
 async function load() {
  loading.value = true
  error.value   = ''
@ -120,7 +290,10 @@ async function load() {
  }
 }

-onMounted(load)
+onMounted(() => {
+  load()
+  loadLlmResults()
+})
 </script>

 <style scoped>
@ -234,6 +407,79 @@ onMounted(load)
  padding: 1rem;
 }

+/* ── Benchmark Results ──────────────────────────── */
+.section-title {
+  font-family: var(--font-display, var(--font-body, sans-serif));
+  font-size: 1.05rem;
+  font-weight: 700;
+  color: var(--app-primary, #2A6080);
+  margin: 0;
+}
+
+.bench-table-wrap {
+  overflow-x: auto;
+  border: 1px solid var(--color-border, #d0d7e8);
+  border-radius: 0.5rem;
+}
+
+.bench-table {
+  border-collapse: collapse;
+  width: 100%;
+  font-size: 0.82rem;
+}
+
+.bt-model-col {
+  text-align: left;
+  padding: 0.45rem 0.75rem;
+  background: var(--color-surface-raised, #e4ebf5);
+  border-bottom: 1px solid var(--color-border, #d0d7e8);
+  font-weight: 600;
+  min-width: 12rem;
+}
+
+.bt-metric-col {
+  text-align: right;
+  padding: 0.45rem 0.75rem;
+  background: var(--color-surface-raised, #e4ebf5);
+  border-bottom: 1px solid var(--color-border, #d0d7e8);
+  font-weight: 600;
+  white-space: nowrap;
+  min-width: 6rem;
+}
+
+.bt-model-cell {
+  padding: 0.4rem 0.75rem;
+  border-top: 1px solid var(--color-border, #d0d7e8);
+  font-family: var(--font-mono, monospace);
+  font-size: 0.76rem;
+  white-space: nowrap;
+  overflow: hidden;
+  text-overflow: ellipsis;
+  max-width: 16rem;
+  color: var(--color-text, #1a2338);
+}
+
+.bt-metric-cell {
+  padding: 0.4rem 0.75rem;
+  border-top: 1px solid var(--color-border, #d0d7e8);
+  text-align: right;
+  font-family: var(--font-mono, monospace);
+  font-variant-numeric: tabular-nums;
+  color: var(--color-text, #1a2338);
+}
+
+.bt-metric-cell.bt-best {
+  color: var(--color-success, #3a7a32);
+  font-weight: 700;
+  background: color-mix(in srgb, var(--color-success, #3a7a32) 8%, transparent);
+}
+
+.bench-hint {
+  font-size: 0.75rem;
+  color: var(--color-text-secondary, #6b7a99);
+  margin: 0;
+}
+
@media (max-width: 480px) {
  .bar-row {
    grid-template-columns: 1.5rem 1fr 1fr 3rem;
Author	SHA1	Message	Date
pyr0ball	e6b64d6efe	fix: imitate extractor + health_path — support CF cloud API shapes - _extract_sample: add saved_searches, entries, calls, records as recognized list-wrapper keys (snipe/osprey response shapes) - _is_online: accept health_path param (default /api/health) so products using /api/v1/health/ (kiwi) report correctly - products endpoint: pass health_path from config into _is_online	2026-04-09 20:24:26 -07:00
pyr0ball	fee0cdb4a8	Merge pull request 'feat: Imitate tab — pull CF product samples, compare LLM responses' (#23 ) from feat/imitate into main	2026-04-09 20:13:20 -07:00
pyr0ball	3299c0e23a	feat: Imitate tab — pull CF product samples, compare LLM responses Backend (app/imitate.py): - GET /api/imitate/products — reads imitate: config, checks online status - GET /api/imitate/products/{id}/sample — fetches real item from product API - GET /api/imitate/run (SSE) — streams ollama responses for selected models - POST /api/imitate/push-corrections — queues results in SFT corrections JSONL Frontend (ImitateView.vue): - Step 1: product picker grid (online/offline status, icon from config) - Step 2: raw sample preview + editable prompt textarea - Step 3: ollama model multi-select, temperature slider, SSE run with live log - Step 4: response cards side by side, push to Corrections button Wiring: - app/api.py: include imitate_router at /api/imitate - web/src/router: /imitate route + lazy import - AppSidebar: Imitate nav entry (mirror icon) - config/label_tool.yaml.example: imitate: section with peregrine example - 16 unit tests (100% passing) Also: BenchmarkView.vue Compare panel — side-by-side run diff for bench results	2026-04-09 20:12:57 -07:00
pyr0ball	dc246df42d	test: fix test_tasks_parses_yaml for TaskEntry schema TaskEntry now includes prompt/system fields (default ""). Switch from exact dict comparison to field-by-field assertions so the test is forward-compatible with optional schema additions.	2026-04-09 20:11:01 -07:00
pyr0ball	7a392df492	Merge pull request 'feat: env var LLM config, cf-orch coordinator auth, SFT default bench path (#10 , #14 )' (#22 ) from feat/env-config-sft-import into main	2026-04-09 12:30:56 -07:00
pyr0ball	891142570b	feat(#14 ): default bench_results_dir + testability seam - sft.py: _DEFAULT_BENCH_RESULTS_DIR set to circuitforge-orch bench results path; set_default_bench_results_dir() seam for test isolation - test fixture resets default to tmp_path to avoid real-fs interference - 136 tests passing Closes #14	2026-04-09 12:28:38 -07:00
pyr0ball	a271278dc9	feat(#10 ): env var LLM config + cf-orch coordinator auth - _load_cforch_config() falls back to CF_ORCH_URL / CF_LICENSE_KEY / OLLAMA_HOST / OLLAMA_MODEL env vars when label_tool.yaml cforch: key is absent or empty (yaml wins when both present) - CF_LICENSE_KEY forwarded to benchmark subprocess env so cf-orch agent can authenticate without it appearing in command args - GET /api/cforch/config endpoint — returns resolved connection state; redacts license key (returns license_key_set bool only) - SettingsView: connection status pill (cf-orch / Ollama / unconfigured) loaded from /api/cforch/config on mount; shows env vs yaml source - .env.example documenting all relevant vars - config/label_tool.yaml.example: full cforch: section with all keys - environment.yml: add circuitforge-core>=0.9.0 dependency - .gitignore: add .env - 4 new tests (17 total in test_cforch.py); 136 passing overall Closes #10	2026-04-09 12:26:44 -07:00
pyr0ball	dffb1d0d7a	feat: cf-orch LLM benchmark integration (Phase 1) Backend (app/cforch.py — new APIRouter at /api/cforch): - GET /tasks — reads bench_tasks.yaml, returns tasks + deduplicated types - GET /models — reads bench_models.yaml, returns model list with service/tags - GET /run — SSE endpoint; spawns cf-orch benchmark.py subprocess with --filter-tasks, --filter-tags, --coordinator, --ollama-url; strips ANSI codes; emits progress/result/complete/error events; 409 guard on concurrency - GET /results — returns latest bench_results/*/summary.json; 404 if none - POST /cancel — terminates running benchmark subprocess - All paths configurable via label_tool.yaml cforch: section - 13 tests; follows sft.py/models.py testability seam pattern Frontend: - BenchmarkView: mode toggle (Classifier / LLM Eval); LLM Eval panel with task picker (by type, select-all + indeterminate), model picker (by service), SSE run log, results table with best-per-column highlighting - StatsView: LLM Benchmark section showing quality_by_task_type table across models; hidden when no results; fetches /api/cforch/results on mount SFT candidate pipeline: cf-orch runs that produce sft_candidates.jsonl are auto-discovered by the existing bench_results_dir config in sft.py — no additional wiring needed.	2026-04-09 10:46:06 -07:00
pyr0ball	ce12b29c94	feat: model compatibility warning on HF lookup - GET /api/models/lookup now returns compatible: bool and warning: str\|null - compatible=false + warning when pipeline_tag is absent (no task tag on HF) or present but not in the supported adapter map - Warning message names the unsupported pipeline_tag and lists supported types - ModelsView: yellow compat-warning banner below preview description; Add button relabels to "Add anyway" with muted styling when incompatible - test_models: accept 405 for path-traversal DELETE tests (StaticFiles mount returns 405 for non-GET methods when web/dist exists)	2026-04-09 09:48:55 -07:00
pyr0ball	49ec85706c	Merge pull request 'feat: benchmark model picker, category grouping, stats benchmark results' (#20 ) from feat/benchmark-model-picker into main	2026-04-08 23:07:10 -07:00
pyr0ball	478a47f6e0	Merge pull request 'feat: HuggingFace model management tab' (#19 ) from feat/hf-model-queue into main	2026-04-08 23:06:54 -07:00
pyr0ball	7c304ebc45	feat: benchmark model picker, category grouping, stats benchmark results Backend (app/api.py): - GET /api/benchmark/models — returns installed models grouped by adapter type (ZeroShotAdapter, RerankerAdapter, GenerationAdapter, Unknown); reads _MODELS_DIR via app.models so test overrides are respected - GET /api/benchmark/run — add model_names query param (comma-separated); when set, passes --models <names...> to benchmark_classifier.py - GET /api/stats — add benchmark_results field from benchmark_results.json Frontend: - BenchmarkView: collapsible Model Selection panel with per-category checkboxes, select-all per category (supports indeterminate state), collapsed summary badge ("All models (N)" or "N of M selected"); model_names only sent when a strict subset is selected - StatsView: Benchmark Results table (accuracy, macro_f1, weighted_f1) with best-model highlighting per metric; hidden when no results exist	2026-04-08 23:03:56 -07:00
pyr0ball	b6b3d2c390	feat: HuggingFace model management tab - New /api/models router: HF lookup, approval queue (JSONL persistence), SSE download progress via snapshot_download(), installed model listing, path-traversal-safe DELETE - pipeline_tag → adapter type mapping (zero-shot-classification, sentence-similarity, text-generation) - 27 tests covering all endpoints, duplicate detection, path traversal - ModelsView.vue: HF lookup + add, approval queue, live download progress bars via SSE, installed model table with delete - Sidebar entry (🤗 Models) between Benchmark and Corrections	2026-04-08 22:32:35 -07:00
pyr0ball	a7cb3ae62a	Merge pull request 'feat: SFT failure_category — classify why a model response was wrong' (#17 ) from feat/sft-failure-category into main	2026-04-08 22:19:20 -07:00
pyr0ball	c5eaacc767	Merge pull request 'feat: Corrections tab — SFT candidate import, review, and JSONL export' (#15 ) from feat/sft-corrections into main	2026-04-08 22:19:01 -07:00
pyr0ball	9633d9a535	feat: add failure_category field to SFT corrections (#16 ) Adds optional failure_category to SubmitRequest and candidate records so reviewers can classify why a model response was wrong, not just what to do with it. Enables the fine-tune harness to filter training data by failure type (e.g. exclude scoring artifacts, train only on genuine wrong answers). Taxonomy: scoring_artifact \| style_violation \| partial_answer \| wrong_answer \| format_error \| hallucination - app/sft.py: FailureCategory Literal type; SubmitRequest.failure_category; stored on candidate record in POST /submit correct branch - tests/test_sft.py: 3 new tests (stores value, null round-trip, 422 on invalid) - stores/sft.ts: SftFailureCategory type exported; SftQueueItem + SftLastAction updated; setLastAction accepts optional category param - SftCard.vue: chip-group selector shown during correct/discard/flag flow; two-step confirm for discard/flag reveals chips before emitting; category forwarded in all emit payloads - CorrectionsView.vue: handleCorrect/Discard/Flag accept and forward category to POST /api/sft/submit body and store.setLastAction - SftCard.test.ts: 11 new tests covering chip visibility, selection, single-active enforcement, pending-action flow, emit payloads, cancel	2026-04-08 22:10:26 -07:00
pyr0ball	cfc09b4731	chore: gitignore CLAUDE.md and docs/superpowers (BSL 1.1 compliance)	2026-03-27 01:04:18 -07:00