34 changed files with 4382 additions and 4723 deletions
--- a/.gitignore
+++ b/.gitignore
@ -11,12 +11,6 @@ config/label_tool.yaml
 data/email_score.jsonl
 data/email_label_queue.jsonl
 data/email_compare_sample.jsonl
 data/sft_candidates.jsonl
 data/sft_approved.jsonl
 # Conda/pip artifacts
 .env
 # Claude context — BSL 1.1, keep out of version control
 CLAUDE.md
 docs/superpowers/
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -0,0 +1,173 @@
 # Avocet — Email Classifier Training Tool
 ## What it is
 Shared infrastructure for building and benchmarking email classifiers across the CircuitForge menagerie.
 Named for the avocet's sweeping-bill technique — it sweeps through email streams and filters out categories.
 **Pipeline:**
 ```
 Scrape (IMAP, wide search, multi-account) → data/email_label_queue.jsonl
                ↓
 Label (card-stack UI)                      → data/email_score.jsonl
                ↓
 Benchmark (HuggingFace NLI/reranker)       → per-model macro-F1 + latency
 ```
 ## Environment
 - Python env: `conda run -n job-seeker <cmd>` for basic use (streamlit, yaml, stdlib only)
 - Classifier env: `conda run -n job-seeker-classifiers <cmd>` for benchmark (transformers, FlagEmbedding, gliclass)
 - Run tests: `/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v`
  (direct binary — `conda run pytest` can spawn runaway processes)
 - Create classifier env: `conda env create -f environment.yml`
 ## Label Tool (app/label_tool.py)
 Card-stack Streamlit UI for manually labeling recruitment emails.
 ```
 conda run -n job-seeker streamlit run app/label_tool.py --server.port 8503
 ```
 - Config: `config/label_tool.yaml` (gitignored — copy from `.example`, or use ⚙️ Settings tab)
 - Queue: `data/email_label_queue.jsonl` (gitignored)
 - Output: `data/email_score.jsonl` (gitignored)
 - Four tabs: 🃏 Label, 📥 Fetch, 📊 Stats, ⚙️ Settings
 - Keyboard shortcuts: 1–9 = label, 0 = Other (wildcard, prompts free-text input), S = skip, U = undo
 - Dedup: MD5 of `(subject + body[:100])` — cross-account safe
 ### Settings Tab (⚙️)
 - Add / edit / remove IMAP accounts via form UI — no manual YAML editing required
 - Per-account fields: display name, host, port, SSL toggle, username, password (masked), folder, days back
 - **🔌 Test connection** button per account — connects, logs in, selects folder, reports message count
 - Global: max emails per account per fetch
 - **💾 Save** writes `config/label_tool.yaml`; **↩ Reload** discards unsaved changes
 - `_sync_settings_to_state()` collects widget values before any add/remove to avoid index-key drift
 ## Benchmark (scripts/benchmark_classifier.py)
 ```
 # List available models
 conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --list-models
 # Score against labeled JSONL
 conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --score
 # Visual comparison on live IMAP emails
 conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --compare --limit 20
 # Include slow/large models
 conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --score --include-slow
 # Export DB-labeled emails (⚠️ LLM-generated labels — review first)
 conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --export-db --db /path/to/staging.db
 ```
 ## Labels (peregrine defaults — configurable per product)
 | Label | Key | Meaning |
 |-------|-----|---------|
 | `interview_scheduled` | 1 | Phone screen, video call, or on-site invitation |
 | `offer_received` | 2 | Formal job offer or offer letter |
 | `rejected` | 3 | Application declined or not moving forward |
 | `positive_response` | 4 | Recruiter interest or request to connect |
 | `survey_received` | 5 | Culture-fit survey or assessment invitation |
 | `neutral` | 6 | ATS confirmation (application received, etc.) |
 | `event_rescheduled` | 7 | Interview or event moved to a new time |
 | `digest` | 8 | Job digest or multi-listing email (scrapeable) |
 | `new_lead` | 9 | Unsolicited recruiter outreach or cold contact |
 | `hired` | h | Offer accepted, onboarding, welcome email, start date |
 ## Model Registry (13 models, 7 defaults)
 See `scripts/benchmark_classifier.py:MODEL_REGISTRY`.
 Default models run without `--include-slow`.
 Add `--models deberta-small deberta-small-2pass` to test a specific subset.
 ## Config Files
 - `config/label_tool.yaml` — gitignored; multi-account IMAP config
 - `config/label_tool.yaml.example` — committed template
 ## Data Files
 - `data/email_score.jsonl` — gitignored; manually-labeled ground truth
 - `data/email_score.jsonl.example` — committed sample for CI
 - `data/email_label_queue.jsonl` — gitignored; IMAP fetch queue
 ## Key Design Notes
 - `ZeroShotAdapter.load()` instantiates the pipeline object; `classify()` calls the object.
  Tests patch `scripts.classifier_adapters.pipeline` (the module-level factory) with a
  two-level mock: `mock_factory.return_value = MagicMock(return_value={...})`.
 - `two_pass=True` on ZeroShotAdapter: first pass ranks all 6 labels; second pass re-runs
  with only top-2, forcing a binary choice. 2× cost, better confidence.
 - `--compare` uses the first account in `label_tool.yaml` for live IMAP emails.
 - DB export labels are llama3.1:8b-generated — treat as noisy, not gold truth.
 ## Vue Label UI (app/api.py + web/)
 FastAPI on port 8503 serves both the REST API and the built Vue SPA (`web/dist/`).
 ```
 ./manage.sh start-api    # build Vue SPA + start FastAPI (binds 0.0.0.0:8503 — LAN accessible)
 ./manage.sh stop-api
 ./manage.sh open-api     # xdg-open http://localhost:8503
 ```
 Logs: `log/api.log`
 ## Email Field Schema — IMPORTANT
 Two schemas exist. The normalization layer in `app/api.py` bridges them automatically.
 ### JSONL on-disk schema (written by `label_tool.py` and `label_tool.py`'s IMAP fetch)
 | Field | Type | Notes |
 |-------|------|-------|
 | `subject` | str | Email subject line |
 | `body` | str | Plain-text body, truncated at 800 chars; HTML stripped by `_strip_html()` |
 | `from_addr` | str | Sender address string (`"Name <addr>"`) |
 | `date` | str | Raw RFC 2822 date string |
 | `account` | str | Display name of the IMAP account that fetched it |
 | *(no `id`)* | — | Dedup key is MD5 of `(subject + body[:100])` — never stored on disk |
 ### Vue API schema (returned by `GET /api/queue`, required by POST endpoints)
 | Field | Type | Notes |
 |-------|------|-------|
 | `id` | str | MD5 content hash, or stored `id` if item has one |
 | `subject` | str | Unchanged |
 | `body` | str | Unchanged |
 | `from` | str | Mapped from `from_addr` (or `from` if already present) |
 | `date` | str | Unchanged |
 | `source` | str | Mapped from `account` (or `source` if already present) |
 ### Normalization layer (`_normalize()` in `app/api.py`)
 `_normalize(item)` handles the mapping and ID generation. All `GET /api/queue` responses
 pass through it. Mutating endpoints (`/api/label`, `/api/skip`, `/api/discard`) look up
 items via `_normalize(x)["id"]`, so both real data (no `id`, uses content hash) and test
 fixtures (explicit `id` field) work transparently.
 ### Peregrine integration
 Peregrine's `staging.db` uses different field names again:
 | staging.db column | Maps to avocet JSONL field |
 |-------------------|---------------------------|
 | `subject` | `subject` |
 | `body` | `body` (may contain HTML — run through `_strip_html()` before queuing) |
 | `from_address` | `from_addr` |
 | `received_date` | `date` |
 | `account` or source context | `account` |
 When exporting from Peregrine's DB for avocet labeling, transform to the JSONL schema above
 (not the Vue API schema). The `--export-db` flag in `benchmark_classifier.py` does this.
 Any new export path should also call `_strip_html()` on the body before writing.
 ## Relationship to Peregrine
 Avocet started as `peregrine/tools/label_tool.py` + `peregrine/scripts/classifier_adapters.py`.
 Peregrine retains copies during stabilization; once avocet is proven, peregrine will import from here.
--- a/app/api.py
+++ b/app/api.py
@ -142,13 +142,6 @@ def _normalize(item: dict) -> dict:
 app = FastAPI(title="Avocet API")
 from app.sft import router as sft_router
 app.include_router(sft_router, prefix="/api/sft")
 from app.models import router as models_router
 import app.models as _models_module
 app.include_router(models_router, prefix="/api/models")
 # In-memory last-action store (single user, local tool — in-memory is fine)
 _last_action: dict | None = None
@ -302,18 +295,10 @@ def get_stats():
        lbl = r.get("label", "")
        if lbl:
            counts[lbl] = counts.get(lbl, 0) + 1
    benchmark_results: dict = {}
    benchmark_path = _DATA_DIR / "benchmark_results.json"
    if benchmark_path.exists():
        try:
            benchmark_results = json.loads(benchmark_path.read_text(encoding="utf-8"))
        except Exception:
            pass
    return {
        "total": len(records),
        "counts": counts,
        "score_file_bytes": _score_file().stat().st_size if _score_file().exists() else 0,
        "benchmark_results": benchmark_results,
    }
@ -348,36 +333,6 @@ from fastapi.responses import StreamingResponse
 # Benchmark endpoints
 # ---------------------------------------------------------------------------
@app.get("/api/benchmark/models")
 def get_benchmark_models() -> dict:
    """Return installed models grouped by adapter_type category."""
    models_dir: Path = _models_module._MODELS_DIR
    categories: dict[str, list[dict]] = {
        "ZeroShotAdapter": [],
        "RerankerAdapter": [],
        "GenerationAdapter": [],
        "Unknown": [],
    }
    if models_dir.exists():
        for sub in models_dir.iterdir():
            if not sub.is_dir():
                continue
            info_path = sub / "model_info.json"
            adapter_type = "Unknown"
            repo_id: str | None = None
            if info_path.exists():
                try:
                    info = json.loads(info_path.read_text(encoding="utf-8"))
                    adapter_type = info.get("adapter_type") or info.get("adapter_recommendation") or "Unknown"
                    repo_id = info.get("repo_id")
                except Exception:
                    pass
            bucket = adapter_type if adapter_type in categories else "Unknown"
            entry: dict = {"name": sub.name, "repo_id": repo_id, "adapter_type": adapter_type}
            categories[bucket].append(entry)
    return {"categories": categories}
@app.get("/api/benchmark/results")
 def get_benchmark_results():
    """Return the most recently saved benchmark results, or an empty envelope."""
@ -388,17 +343,13 @@ def get_benchmark_results():
@app.get("/api/benchmark/run")
-def run_benchmark(include_slow: bool = False, model_names: str = ""):
+def run_benchmark(include_slow: bool = False):
    """Spawn the benchmark script and stream stdout as SSE progress events."""
    python_bin = "/devl/miniconda3/envs/job-seeker-classifiers/bin/python"
    script = str(_ROOT / "scripts" / "benchmark_classifier.py")
    cmd = [python_bin, script, "--score", "--save"]
    if include_slow:
        cmd.append("--include-slow")
    if model_names:
        names = [n.strip() for n in model_names.split(",") if n.strip()]
        if names:
            cmd.extend(["--models"] + names)
    def generate():
        try:
--- a/app/imap_fetch.py
+++ b/app/imap_fetch.py
@ -1,6 +1,6 @@
 """Avocet — IMAP fetch utilities.
-Shared between app/api.py (FastAPI SSE endpoint) and the label UI.
+Shared between app/api.py (FastAPI SSE endpoint) and app/label_tool.py (Streamlit).
 No Streamlit imports here — stdlib + imaplib only.
 """
 from __future__ import annotations
@ -8,11 +8,36 @@ from __future__ import annotations
 import email as _email_lib
 import hashlib
 import imaplib
 import re
 from datetime import datetime, timedelta
 from email.header import decode_header as _raw_decode
 from html.parser import HTMLParser
 from typing import Any, Iterator
-from app.utils import extract_body, strip_html  # noqa: F401 (strip_html re-exported for callers)
+
 # ── HTML → plain text ────────────────────────────────────────────────────────
 class _TextExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self._parts: list[str] = []
    def handle_data(self, data: str) -> None:
        stripped = data.strip()
        if stripped:
            self._parts.append(stripped)
    def get_text(self) -> str:
        return " ".join(self._parts)
 def strip_html(html_str: str) -> str:
    try:
        ex = _TextExtractor()
        ex.feed(html_str)
        return ex.get_text()
    except Exception:
        return re.sub(r"<[^>]+>", " ", html_str).strip()
 # ── IMAP decode helpers ───────────────────────────────────────────────────────
@ -30,6 +55,37 @@ def _decode_str(value: str | None) -> str:
    return " ".join(out).strip()
 def _extract_body(msg: Any) -> str:
    if msg.is_multipart():
        html_fallback: str | None = None
        for part in msg.walk():
            ct = part.get_content_type()
            if ct == "text/plain":
                try:
                    charset = part.get_content_charset() or "utf-8"
                    return part.get_payload(decode=True).decode(charset, errors="replace")
                except Exception:
                    pass
            elif ct == "text/html" and html_fallback is None:
                try:
                    charset = part.get_content_charset() or "utf-8"
                    raw = part.get_payload(decode=True).decode(charset, errors="replace")
                    html_fallback = strip_html(raw)
                except Exception:
                    pass
        return html_fallback or ""
    else:
        try:
            charset = msg.get_content_charset() or "utf-8"
            raw = msg.get_payload(decode=True).decode(charset, errors="replace")
            if msg.get_content_type() == "text/html":
                return strip_html(raw)
            return raw
        except Exception:
            pass
    return ""
 def entry_key(e: dict) -> str:
    """Stable MD5 content-hash for dedup — matches label_tool.py _entry_key."""
    key = (e.get("subject", "") + (e.get("body", "") or "")[:100])
@ -137,7 +193,7 @@ def fetch_account_stream(
            subj      = _decode_str(msg.get("Subject", ""))
            from_addr = _decode_str(msg.get("From", ""))
            date      = _decode_str(msg.get("Date", ""))
-            body      = extract_body(msg)[:800]
+            body      = _extract_body(msg)[:800]
            entry     = {"subject": subj, "body": body, "from_addr": from_addr,
                         "date": date, "account": name}
            k = entry_key(entry)
--- a/app/label_tool.py
+++ b/app/label_tool.py
--- a/app/models.py
+++ b/app/models.py
@ -1,428 +0,0 @@
 """Avocet — HF model lifecycle API.
 Handles model metadata lookup, approval queue, download with progress,
 and installed model management.
 All endpoints are registered on `router` (a FastAPI APIRouter).
 api.py includes this router with prefix="/api/models".
 Module-level globals (_MODELS_DIR, _QUEUE_DIR) follow the same
 testability pattern as sft.py — override them via set_models_dir() and
 set_queue_dir() in test fixtures.
 """
 from __future__ import annotations
 import json
 import logging
 import shutil
 import threading
 from datetime import datetime, timezone
 from pathlib import Path
 from typing import Any
 from uuid import uuid4
 import httpx
 from fastapi import APIRouter, HTTPException
 from fastapi.responses import StreamingResponse
 from pydantic import BaseModel
 from app.utils import read_jsonl, write_jsonl
 try:
    from huggingface_hub import snapshot_download
 except ImportError:  # pragma: no cover
    snapshot_download = None  # type: ignore[assignment]
 logger = logging.getLogger(__name__)
 _ROOT = Path(__file__).parent.parent
 _MODELS_DIR: Path = _ROOT / "models"
 _QUEUE_DIR: Path = _ROOT / "data"
 router = APIRouter()
 # ── Download progress shared state ────────────────────────────────────────────
 # Updated by the background download thread; read by GET /download/stream.
 _download_progress: dict[str, Any] = {}
 # ── HF pipeline_tag → adapter recommendation ──────────────────────────────────
 _TAG_TO_ADAPTER: dict[str, str] = {
    "zero-shot-classification": "ZeroShotAdapter",
    "text-classification": "ZeroShotAdapter",
    "natural-language-inference": "ZeroShotAdapter",
    "sentence-similarity": "RerankerAdapter",
    "text-ranking": "RerankerAdapter",
    "text-generation": "GenerationAdapter",
    "text2text-generation": "GenerationAdapter",
 }
 # ── Testability seams ──────────────────────────────────────────────────────────
 def set_models_dir(path: Path) -> None:
    global _MODELS_DIR
    _MODELS_DIR = path
 def set_queue_dir(path: Path) -> None:
    global _QUEUE_DIR
    _QUEUE_DIR = path
 # ── Internal helpers ───────────────────────────────────────────────────────────
 def _queue_file() -> Path:
    return _QUEUE_DIR / "model_queue.jsonl"
 def _read_queue() -> list[dict]:
    return read_jsonl(_queue_file())
 def _write_queue(records: list[dict]) -> None:
    write_jsonl(_queue_file(), records)
 def _safe_model_name(repo_id: str) -> str:
    """Convert repo_id to a filesystem-safe directory name (HF convention)."""
    return repo_id.replace("/", "--")
 def _is_installed(repo_id: str) -> bool:
    """Check if a model is already downloaded in _MODELS_DIR."""
    safe_name = _safe_model_name(repo_id)
    model_dir = _MODELS_DIR / safe_name
    return model_dir.exists() and (
        (model_dir / "config.json").exists()
        or (model_dir / "training_info.json").exists()
        or (model_dir / "model_info.json").exists()
    )
 def _is_queued(repo_id: str) -> bool:
    """Check if repo_id is already in the queue (non-dismissed)."""
    for entry in _read_queue():
        if entry.get("repo_id") == repo_id and entry.get("status") != "dismissed":
            return True
    return False
 def _update_queue_entry(entry_id: str, updates: dict) -> dict | None:
    """Update a queue entry by id. Returns updated entry or None if not found."""
    records = _read_queue()
    for i, r in enumerate(records):
        if r.get("id") == entry_id:
            records[i] = {**r, **updates}
            _write_queue(records)
            return records[i]
    return None
 def _get_queue_entry(entry_id: str) -> dict | None:
    for r in _read_queue():
        if r.get("id") == entry_id:
            return r
    return None
 # ── Background download ────────────────────────────────────────────────────────
 def _run_download(entry_id: str, repo_id: str, pipeline_tag: str | None, adapter_recommendation: str | None) -> None:
    """Background thread: download model via huggingface_hub.snapshot_download."""
    global _download_progress
    safe_name = _safe_model_name(repo_id)
    local_dir = _MODELS_DIR / safe_name
    _download_progress = {
        "active": True,
        "repo_id": repo_id,
        "downloaded_bytes": 0,
        "total_bytes": 0,
        "pct": 0.0,
        "done": False,
        "error": None,
    }
    try:
        if snapshot_download is None:
            raise RuntimeError("huggingface_hub is not installed")
        snapshot_download(
            repo_id=repo_id,
            local_dir=str(local_dir),
        )
        # Write model_info.json alongside downloaded files
        model_info = {
            "repo_id": repo_id,
            "pipeline_tag": pipeline_tag,
            "adapter_recommendation": adapter_recommendation,
            "downloaded_at": datetime.now(timezone.utc).isoformat(),
        }
        local_dir.mkdir(parents=True, exist_ok=True)
        (local_dir / "model_info.json").write_text(
            json.dumps(model_info, indent=2), encoding="utf-8"
        )
        _download_progress["done"] = True
        _download_progress["pct"] = 100.0
        _update_queue_entry(entry_id, {"status": "ready"})
    except Exception as exc:
        logger.exception("Download failed for %s: %s", repo_id, exc)
        _download_progress["error"] = str(exc)
        _download_progress["done"] = True
        _update_queue_entry(entry_id, {"status": "failed", "error": str(exc)})
    finally:
        _download_progress["active"] = False
 # ── GET /lookup ────────────────────────────────────────────────────────────────
@router.get("/lookup")
 def lookup_model(repo_id: str) -> dict:
    """Validate repo_id and fetch metadata from the HF API."""
    # Validate: must contain exactly one '/', no whitespace
    if "/" not in repo_id or any(c.isspace() for c in repo_id):
        raise HTTPException(422, f"Invalid repo_id {repo_id!r}: must be 'owner/model-name' with no whitespace")
    hf_url = f"https://huggingface.co/api/models/{repo_id}"
    try:
        resp = httpx.get(hf_url, timeout=10.0)
    except httpx.RequestError as exc:
        raise HTTPException(502, f"Network error reaching HuggingFace API: {exc}") from exc
    if resp.status_code == 404:
        raise HTTPException(404, f"Model {repo_id!r} not found on HuggingFace")
    if resp.status_code != 200:
        raise HTTPException(502, f"HuggingFace API returned status {resp.status_code}")
    data = resp.json()
    pipeline_tag = data.get("pipeline_tag")
    adapter_recommendation = _TAG_TO_ADAPTER.get(pipeline_tag) if pipeline_tag else None
    if pipeline_tag and adapter_recommendation is None:
        logger.warning("Unknown pipeline_tag %r for %s — no adapter recommendation", pipeline_tag, repo_id)
    # Estimate model size from siblings list
    siblings = data.get("siblings") or []
    model_size_bytes: int = sum(s.get("size", 0) for s in siblings if isinstance(s, dict))
    # Description: first 300 chars of card data (modelId field used as fallback)
    card_data = data.get("cardData") or {}
    description_raw = card_data.get("description") or data.get("modelId") or ""
    description = description_raw[:300] if description_raw else ""
    return {
        "repo_id": repo_id,
        "pipeline_tag": pipeline_tag,
        "adapter_recommendation": adapter_recommendation,
        "model_size_bytes": model_size_bytes,
        "description": description,
        "tags": data.get("tags") or [],
        "downloads": data.get("downloads") or 0,
        "already_installed": _is_installed(repo_id),
        "already_queued": _is_queued(repo_id),
    }
 # ── GET /queue ─────────────────────────────────────────────────────────────────
@router.get("/queue")
 def get_queue() -> list[dict]:
    """Return all non-dismissed queue entries sorted newest-first."""
    records = _read_queue()
    active = [r for r in records if r.get("status") != "dismissed"]
    return sorted(active, key=lambda r: r.get("queued_at", ""), reverse=True)
 # ── POST /queue ────────────────────────────────────────────────────────────────
 class QueueAddRequest(BaseModel):
    repo_id: str
    pipeline_tag: str | None = None
    adapter_recommendation: str | None = None
@router.post("/queue", status_code=201)
 def add_to_queue(req: QueueAddRequest) -> dict:
    """Add a model to the approval queue with status 'pending'."""
    if _is_installed(req.repo_id):
        raise HTTPException(409, f"{req.repo_id!r} is already installed")
    if _is_queued(req.repo_id):
        raise HTTPException(409, f"{req.repo_id!r} is already in the queue")
    entry = {
        "id": str(uuid4()),
        "repo_id": req.repo_id,
        "pipeline_tag": req.pipeline_tag,
        "adapter_recommendation": req.adapter_recommendation,
        "status": "pending",
        "queued_at": datetime.now(timezone.utc).isoformat(),
    }
    records = _read_queue()
    records.append(entry)
    _write_queue(records)
    return entry
 # ── POST /queue/{id}/approve ───────────────────────────────────────────────────
@router.post("/queue/{entry_id}/approve")
 def approve_queue_entry(entry_id: str) -> dict:
    """Approve a pending queue entry and start background download."""
    entry = _get_queue_entry(entry_id)
    if entry is None:
        raise HTTPException(404, f"Queue entry {entry_id!r} not found")
    if entry.get("status") != "pending":
        raise HTTPException(409, f"Entry is not in pending state (current: {entry.get('status')!r})")
    updated = _update_queue_entry(entry_id, {"status": "downloading"})
    thread = threading.Thread(
        target=_run_download,
        args=(entry_id, entry["repo_id"], entry.get("pipeline_tag"), entry.get("adapter_recommendation")),
        daemon=True,
        name=f"model-download-{entry_id}",
    )
    thread.start()
    return {"ok": True}
 # ── DELETE /queue/{id} ─────────────────────────────────────────────────────────
@router.delete("/queue/{entry_id}")
 def dismiss_queue_entry(entry_id: str) -> dict:
    """Dismiss (soft-delete) a queue entry."""
    entry = _get_queue_entry(entry_id)
    if entry is None:
        raise HTTPException(404, f"Queue entry {entry_id!r} not found")
    _update_queue_entry(entry_id, {"status": "dismissed"})
    return {"ok": True}
 # ── GET /download/stream ───────────────────────────────────────────────────────
@router.get("/download/stream")
 def download_stream() -> StreamingResponse:
    """SSE stream of download progress. Yields one idle event if no download active."""
    def generate():
        prog = _download_progress
        if not prog.get("active") and not (prog.get("done") and not prog.get("error")):
            yield f"data: {json.dumps({'type': 'idle'})}\n\n"
            return
        if prog.get("done"):
            if prog.get("error"):
                yield f"data: {json.dumps({'type': 'error', 'error': prog['error']})}\n\n"
            else:
                yield f"data: {json.dumps({'type': 'done', 'repo_id': prog.get('repo_id')})}\n\n"
            return
        # Stream live progress
        import time
        while True:
            p = dict(_download_progress)
            if p.get("done"):
                if p.get("error"):
                    yield f"data: {json.dumps({'type': 'error', 'error': p['error']})}\n\n"
                else:
                    yield f"data: {json.dumps({'type': 'done', 'repo_id': p.get('repo_id')})}\n\n"
                break
            event = json.dumps({
                "type": "progress",
                "repo_id": p.get("repo_id"),
                "downloaded_bytes": p.get("downloaded_bytes", 0),
                "total_bytes": p.get("total_bytes", 0),
                "pct": p.get("pct", 0.0),
            })
            yield f"data: {event}\n\n"
            time.sleep(0.5)
    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )
 # ── GET /installed ─────────────────────────────────────────────────────────────
@router.get("/installed")
 def list_installed() -> list[dict]:
    """Scan _MODELS_DIR and return info on each installed model."""
    if not _MODELS_DIR.exists():
        return []
    results: list[dict] = []
    for sub in _MODELS_DIR.iterdir():
        if not sub.is_dir():
            continue
        has_training_info = (sub / "training_info.json").exists()
        has_config = (sub / "config.json").exists()
        has_model_info = (sub / "model_info.json").exists()
        if not (has_training_info or has_config or has_model_info):
            continue
        model_type = "finetuned" if has_training_info else "downloaded"
        # Compute directory size
        size_bytes = sum(f.stat().st_size for f in sub.rglob("*") if f.is_file())
        # Load adapter/model_id from model_info.json or training_info.json
        adapter: str | None = None
        model_id: str | None = None
        if has_model_info:
            try:
                info = json.loads((sub / "model_info.json").read_text(encoding="utf-8"))
                adapter = info.get("adapter_recommendation")
                model_id = info.get("repo_id")
            except Exception:
                pass
        elif has_training_info:
            try:
                info = json.loads((sub / "training_info.json").read_text(encoding="utf-8"))
                adapter = info.get("adapter")
                model_id = info.get("base_model") or info.get("model_id")
            except Exception:
                pass
        results.append({
            "name": sub.name,
            "path": str(sub),
            "type": model_type,
            "adapter": adapter,
            "size_bytes": size_bytes,
            "model_id": model_id,
        })
    return results
 # ── DELETE /installed/{name} ───────────────────────────────────────────────────
@router.delete("/installed/{name}")
 def delete_installed(name: str) -> dict:
    """Remove an installed model directory by name. Blocks path traversal."""
    # Validate: single path component, no slashes or '..'
    if "/" in name or "\\" in name or ".." in name or not name or name.startswith("."):
        raise HTTPException(400, f"Invalid model name {name!r}: must be a single directory name with no path separators or '..'")
    model_path = _MODELS_DIR / name
    # Extra safety: confirm resolved path is inside _MODELS_DIR
    try:
        model_path.resolve().relative_to(_MODELS_DIR.resolve())
    except ValueError:
        raise HTTPException(400, f"Path traversal detected for name {name!r}")
    if not model_path.exists():
        raise HTTPException(404, f"Installed model {name!r} not found")
    shutil.rmtree(model_path)
    return {"ok": True}
--- a/app/sft.py
+++ b/app/sft.py
@ -1,326 +0,0 @@
 """Avocet — SFT candidate import and correction API.
 All endpoints are registered on `router` (a FastAPI APIRouter).
 api.py includes this router with prefix="/api/sft".
 Module-level globals (_SFT_DATA_DIR, _SFT_CONFIG_DIR) follow the same
 testability pattern as api.py — override them via set_sft_data_dir() and
 set_sft_config_dir() in test fixtures.
 """
 from __future__ import annotations
 import json
 import logging
 from datetime import datetime, timezone
 from pathlib import Path
 from typing import Literal
 import yaml
 from fastapi import APIRouter, HTTPException
 from fastapi.responses import StreamingResponse
 from pydantic import BaseModel
 from app.utils import append_jsonl, read_jsonl, write_jsonl
 logger = logging.getLogger(__name__)
 _ROOT = Path(__file__).parent.parent
 _SFT_DATA_DIR: Path = _ROOT / "data"
 _SFT_CONFIG_DIR: Path | None = None
 router = APIRouter()
 # ── Testability seams ──────────────────────────────────────────────────────
 def set_sft_data_dir(path: Path) -> None:
    global _SFT_DATA_DIR
    _SFT_DATA_DIR = path
 def set_sft_config_dir(path: Path | None) -> None:
    global _SFT_CONFIG_DIR
    _SFT_CONFIG_DIR = path
 # ── Internal helpers ───────────────────────────────────────────────────────
 def _config_file() -> Path:
    if _SFT_CONFIG_DIR is not None:
        return _SFT_CONFIG_DIR / "label_tool.yaml"
    return _ROOT / "config" / "label_tool.yaml"
 def _get_bench_results_dir() -> Path:
    f = _config_file()
    if not f.exists():
        return Path("/nonexistent-bench-results")
    try:
        raw = yaml.safe_load(f.read_text(encoding="utf-8")) or {}
    except yaml.YAMLError as exc:
        logger.warning("Failed to parse SFT config %s: %s", f, exc)
        return Path("/nonexistent-bench-results")
    d = raw.get("sft", {}).get("bench_results_dir", "")
    return Path(d) if d else Path("/nonexistent-bench-results")
 def _candidates_file() -> Path:
    return _SFT_DATA_DIR / "sft_candidates.jsonl"
 def _approved_file() -> Path:
    return _SFT_DATA_DIR / "sft_approved.jsonl"
 def _read_candidates() -> list[dict]:
    return read_jsonl(_candidates_file())
 def _write_candidates(records: list[dict]) -> None:
    write_jsonl(_candidates_file(), records)
 def _is_exportable(r: dict) -> bool:
    """Return True if an approved record is ready to include in SFT export."""
    return (
        r.get("status") == "approved"
        and bool(r.get("corrected_response"))
        and str(r["corrected_response"]).strip() != ""
    )
 # ── GET /runs ──────────────────────────────────────────────────────────────
@router.get("/runs")
 def get_runs():
    """List available benchmark runs in the configured bench_results_dir."""
    from scripts.sft_import import discover_runs
    bench_dir = _get_bench_results_dir()
    existing = _read_candidates()
    # benchmark_run_id in each record equals the run's directory name by cf-orch convention
    imported_run_ids = {
        r["benchmark_run_id"]
        for r in existing
        if r.get("benchmark_run_id") is not None
    }
    runs = discover_runs(bench_dir)
    return [
        {
            "run_id": r["run_id"],
            "timestamp": r["timestamp"],
            "candidate_count": r["candidate_count"],
            "already_imported": r["run_id"] in imported_run_ids,
        }
        for r in runs
    ]
 # ── POST /import ───────────────────────────────────────────────────────────
 class ImportRequest(BaseModel):
    run_id: str
@router.post("/import")
 def post_import(req: ImportRequest):
    """Import one benchmark run's sft_candidates.jsonl into the local data dir."""
    from scripts.sft_import import discover_runs, import_run
    bench_dir = _get_bench_results_dir()
    runs = discover_runs(bench_dir)
    run = next((r for r in runs if r["run_id"] == req.run_id), None)
    if run is None:
        raise HTTPException(404, f"Run {req.run_id!r} not found in bench_results_dir")
    return import_run(run["sft_path"], _SFT_DATA_DIR)
 # ── GET /queue ─────────────────────────────────────────────────────────────
@router.get("/queue")
 def get_queue(page: int = 1, per_page: int = 20):
    """Return paginated needs_review candidates."""
    records = _read_candidates()
    pending = [r for r in records if r.get("status") == "needs_review"]
    start = (page - 1) * per_page
    return {
        "items": pending[start:start + per_page],
        "total": len(pending),
        "page": page,
        "per_page": per_page,
    }
 # ── POST /submit ───────────────────────────────────────────────────────────
 FailureCategory = Literal[
    "scoring_artifact",
    "style_violation",
    "partial_answer",
    "wrong_answer",
    "format_error",
    "hallucination",
 ]
 class SubmitRequest(BaseModel):
    id: str
    action: Literal["correct", "discard", "flag"]
    corrected_response: str | None = None
    failure_category: FailureCategory | None = None
@router.post("/submit")
 def post_submit(req: SubmitRequest):
    """Record a reviewer decision for one SFT candidate."""
    if req.action == "correct":
        if not req.corrected_response or not req.corrected_response.strip():
            raise HTTPException(422, "corrected_response must be non-empty when action is 'correct'")
    records = _read_candidates()
    idx = next((i for i, r in enumerate(records) if r.get("id") == req.id), None)
    if idx is None:
        raise HTTPException(404, f"Record {req.id!r} not found")
    record = records[idx]
    if record.get("status") != "needs_review":
        raise HTTPException(409, f"Record is not in needs_review state (current: {record.get('status')})")
    if req.action == "correct":
        records[idx] = {
            **record,
            "status": "approved",
            "corrected_response": req.corrected_response,
            "failure_category": req.failure_category,
        }
        _write_candidates(records)
        append_jsonl(_approved_file(), records[idx])
    elif req.action == "discard":
        records[idx] = {**record, "status": "discarded"}
        _write_candidates(records)
    else:  # flag
        records[idx] = {**record, "status": "model_rejected"}
        _write_candidates(records)
    return {"ok": True}
 # ── POST /undo ─────────────────────────────────────────────────────────────
 class UndoRequest(BaseModel):
    id: str
@router.post("/undo")
 def post_undo(req: UndoRequest):
    """Restore a previously actioned candidate back to needs_review."""
    records = _read_candidates()
    idx = next((i for i, r in enumerate(records) if r.get("id") == req.id), None)
    if idx is None:
        raise HTTPException(404, f"Record {req.id!r} not found")
    record = records[idx]
    old_status = record.get("status")
    if old_status == "needs_review":
        raise HTTPException(409, "Record is already in needs_review state")
    records[idx] = {**record, "status": "needs_review", "corrected_response": None}
    _write_candidates(records)
    # If it was approved, remove from the approved file too
    if old_status == "approved":
        approved = read_jsonl(_approved_file())
        write_jsonl(_approved_file(), [r for r in approved if r.get("id") != req.id])
    return {"ok": True}
 # ── GET /export ─────────────────────────────────────────────────────────────
@router.get("/export")
 def get_export() -> StreamingResponse:
    """Stream approved records as SFT-ready JSONL for download."""
    exportable = [r for r in read_jsonl(_approved_file()) if _is_exportable(r)]
    def generate():
        for r in exportable:
            record = {
                "messages": r.get("prompt_messages", []) + [
                    {"role": "assistant", "content": r["corrected_response"]}
                ]
            }
            yield json.dumps(record) + "\n"
    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
    return StreamingResponse(
        generate(),
        media_type="application/x-ndjson",
        headers={
            "Content-Disposition": f'attachment; filename="sft_export_{timestamp}.jsonl"'
        },
    )
 # ── GET /stats ──────────────────────────────────────────────────────────────
@router.get("/stats")
 def get_stats() -> dict[str, object]:
    """Return counts by status, model, and task type."""
    records = _read_candidates()
    by_status: dict[str, int] = {}
    by_model: dict[str, int] = {}
    by_task_type: dict[str, int] = {}
    for r in records:
        status = r.get("status", "unknown")
        by_status[status] = by_status.get(status, 0) + 1
        model = r.get("model_name", "unknown")
        by_model[model] = by_model.get(model, 0) + 1
        task_type = r.get("task_type", "unknown")
        by_task_type[task_type] = by_task_type.get(task_type, 0) + 1
    approved = read_jsonl(_approved_file())
    export_ready = sum(1 for r in approved if _is_exportable(r))
    return {
        "total": len(records),
        "by_status": by_status,
        "by_model": by_model,
        "by_task_type": by_task_type,
        "export_ready": export_ready,
    }
 # ── GET /config ─────────────────────────────────────────────────────────────
@router.get("/config")
 def get_sft_config() -> dict:
    """Return the current SFT configuration (bench_results_dir)."""
    f = _config_file()
    if not f.exists():
        return {"bench_results_dir": ""}
    try:
        raw = yaml.safe_load(f.read_text(encoding="utf-8")) or {}
    except yaml.YAMLError:
        return {"bench_results_dir": ""}
    sft_section = raw.get("sft") or {}
    return {"bench_results_dir": sft_section.get("bench_results_dir", "")}
 class SftConfigPayload(BaseModel):
    bench_results_dir: str
@router.post("/config")
 def post_sft_config(payload: SftConfigPayload) -> dict:
    """Write the bench_results_dir setting to the config file."""
    f = _config_file()
    f.parent.mkdir(parents=True, exist_ok=True)
    try:
        raw = yaml.safe_load(f.read_text(encoding="utf-8")) if f.exists() else {}
        raw = raw or {}
    except yaml.YAMLError:
        raw = {}
    raw["sft"] = {"bench_results_dir": payload.bench_results_dir}
    tmp = f.with_suffix(".tmp")
    tmp.write_text(yaml.dump(raw, allow_unicode=True, sort_keys=False), encoding="utf-8")
    tmp.rename(f)
    return {"ok": True}
--- a/app/utils.py
+++ b/app/utils.py
@ -1,117 +0,0 @@
 """Shared email utility functions for Avocet.
 Pure-stdlib helpers extracted from the retired label_tool.py Streamlit app.
 These are reused by the FastAPI backend and the test suite.
 """
 from __future__ import annotations
 import json
 import re
 from html.parser import HTMLParser
 from pathlib import Path
 from typing import Any
 # ── HTML → plain-text extractor ──────────────────────────────────────────────
 class _TextExtractor(HTMLParser):
    """Extract visible text from an HTML email body, preserving line breaks."""
    _BLOCK = {"p", "div", "br", "li", "tr", "h1", "h2", "h3", "h4", "h5", "h6", "blockquote"}
    _SKIP  = {"script", "style", "head", "noscript"}
    def __init__(self):
        super().__init__(convert_charrefs=True)
        self._parts: list[str] = []
        self._depth_skip = 0
    def handle_starttag(self, tag, attrs):
        tag = tag.lower()
        if tag in self._SKIP:
            self._depth_skip += 1
        elif tag in self._BLOCK:
            self._parts.append("\n")
    def handle_endtag(self, tag):
        if tag.lower() in self._SKIP:
            self._depth_skip = max(0, self._depth_skip - 1)
    def handle_data(self, data):
        if not self._depth_skip:
            self._parts.append(data)
    def get_text(self) -> str:
        text = "".join(self._parts)
        lines = [ln.strip() for ln in text.splitlines()]
        return "\n".join(ln for ln in lines if ln)
 def strip_html(html_str: str) -> str:
    """Convert HTML email body to plain text. Pure stdlib, no dependencies."""
    try:
        extractor = _TextExtractor()
        extractor.feed(html_str)
        return extractor.get_text()
    except Exception:
        return re.sub(r"<[^>]+>", " ", html_str).strip()
 def extract_body(msg: Any) -> str:
    """Return plain-text body. Strips HTML when no text/plain part exists."""
    if msg.is_multipart():
        html_fallback: str | None = None
        for part in msg.walk():
            ct = part.get_content_type()
            if ct == "text/plain":
                try:
                    charset = part.get_content_charset() or "utf-8"
                    return part.get_payload(decode=True).decode(charset, errors="replace")
                except Exception:
                    pass
            elif ct == "text/html" and html_fallback is None:
                try:
                    charset = part.get_content_charset() or "utf-8"
                    raw = part.get_payload(decode=True).decode(charset, errors="replace")
                    html_fallback = strip_html(raw)
                except Exception:
                    pass
        return html_fallback or ""
    else:
        try:
            charset = msg.get_content_charset() or "utf-8"
            raw = msg.get_payload(decode=True).decode(charset, errors="replace")
            if msg.get_content_type() == "text/html":
                return strip_html(raw)
            return raw
        except Exception:
            pass
    return ""
 def read_jsonl(path: Path) -> list[dict]:
    """Read a JSONL file, returning valid records. Skips blank lines and malformed JSON."""
    if not path.exists():
        return []
    records: list[dict] = []
    for line in path.read_text(encoding="utf-8").splitlines():
        line = line.strip()
        if not line:
            continue
        try:
            records.append(json.loads(line))
        except json.JSONDecodeError:
            pass
    return records
 def write_jsonl(path: Path, records: list[dict]) -> None:
    """Write records to a JSONL file, overwriting any existing content."""
    path.parent.mkdir(parents=True, exist_ok=True)
    content = "\n".join(json.dumps(r) for r in records)
    path.write_text(content + ("\n" if records else ""), encoding="utf-8")
 def append_jsonl(path: Path, record: dict) -> None:
    """Append a single record to a JSONL file."""
    path.parent.mkdir(parents=True, exist_ok=True)
    with open(path, "a", encoding="utf-8") as fh:
        fh.write(json.dumps(record) + "\n")
--- a/config/label_tool.yaml.example
+++ b/config/label_tool.yaml.example
@ -21,8 +21,3 @@ accounts:
 # Optional: limit emails fetched per account per run (0 = unlimited)
 max_per_account: 500
 # cf-orch SFT candidate import — path to the bench_results/ directory
 # produced by circuitforge-orch's benchmark harness.
 sft:
  bench_results_dir: /path/to/circuitforge-orch/scripts/bench_results
--- a/docs/plans/2026-03-08-anime-animation-design.md
+++ b/docs/plans/2026-03-08-anime-animation-design.md
@ -0,0 +1,95 @@
 # Anime.js Animation Integration — Design
 **Date:** 2026-03-08
 **Status:** Approved
 **Branch:** feat/vue-label-tab
 ## Problem
 The current animation system mixes CSS keyframes, CSS transitions, and imperative inline-style bindings across three files. The seams between systems produce:
 - Abrupt ball pickup (instant scale/borderRadius jump)
 - No spring snap-back on release to no target
 - Rigid CSS dismissals with no timing control
 - Bucket grid and badge pop on basic `@keyframes`
 ## Decision
 Integrate **Anime.js v4** as a single animation layer. Vue reactive state is unchanged; Anime.js owns all DOM motion imperatively.
 ## Architecture
 One new composable, minimal changes to two existing files, CSS cleanup in two files.
 ```
 web/src/composables/useCardAnimation.ts   ← NEW
 web/src/components/EmailCardStack.vue     ← modify
 web/src/views/LabelView.vue              ← modify
 ```
 **Data flow:**
 ```
 pointer events → Vue refs (isHeld, deltaX, deltaY, dismissType)
                     ↓  watched by
             useCardAnimation(cardEl, stackEl, isHeld, ...)
                     ↓  imperatively drives
                Anime.js → DOM transforms
 ```
 `useCardAnimation` is a pure side-effect composable — returns nothing to the template. The `cardStyle` computed in `EmailCardStack.vue` is removed; Anime.js owns the element's transform directly.
 ## Animation Surfaces
 ### Pickup morph
 ```
 animate(cardEl, { scale: 0.55, borderRadius: '50%', y: -80 }, { duration: 200, ease: spring(1, 80, 10) })
 ```
 Replaces the instant CSS transform jump on `onPointerDown`.
 ### Drag tracking
 Raw `cardEl.style.translate` update on `onPointerMove` — no animation, just position. Easing only at boundaries (pickup / release), not during active drag.
 ### Snap-back
 ```
 animate(cardEl, { x: 0, y: 0, scale: 1, borderRadius: '1rem' }, { ease: spring(1, 80, 10) })
 ```
 Fires on `onPointerUp` when no zone/bucket target was hit.
 ### Dismissals (replace CSS `@keyframes`)
 - **fileAway** — `animate(cardEl, { y: '-120%', scale: 0.85, opacity: 0 }, { duration: 280, ease: 'out(3)' })`
 - **crumple** — 2-step timeline: shrink + redden → `scale(0)` + rotate
 - **slideUnder** — `animate(cardEl, { x: '110%', rotate: 5, opacity: 0 }, { duration: 260 })`
 ### Bucket grid rise
 `animate(gridEl, { y: -8, opacity: 0.45 })` on `isHeld` → true; reversed on false. Spring easing.
 ### Badge pop
 `animate(badgeEl, { scale: [0.6, 1], opacity: [0, 1] }, { ease: spring(1.5, 80, 8), duration: 300 })` triggered on badge mount via Vue's `onMounted` lifecycle hook in a `BadgePop` wrapper component or `v-enter-active` transition hook.
 ## Constraints
 ### Reduced motion
 `useCardAnimation` checks `motion.rich.value` before firing any Anime.js call. If false, all animations are skipped — instant state changes only. Consistent with existing `useMotion` pattern.
 ### Bundle size
 Anime.js v4 core ~17KB gzipped. Only `animate`, `spring`, and `createTimeline` are imported — Vite ESM tree-shaking keeps footprint minimal. The `draggable` module is not used.
 ### Tests
 Existing `EmailCardStack.test.ts` tests emit behavior, not animation — they remain passing. Anime.js mocked at module level in Vitest via `vi.mock('animejs')` where needed.
 ### CSS cleanup
 Remove from `EmailCardStack.vue` and `LabelView.vue`:
 - `@keyframes fileAway`, `crumple`, `slideUnder`
 - `@keyframes badge-pop`
 - `.dismiss-label`, `.dismiss-skip`, `.dismiss-discard` classes (Anime.js fires on element refs directly)
 - The `dismissClass` computed in `EmailCardStack.vue`
 ## Files Changed
 | File | Change |
 |------|--------|
 | `web/package.json` | Add `animejs` dependency |
 | `web/src/composables/useCardAnimation.ts` | New — all Anime.js animation logic |
 | `web/src/components/EmailCardStack.vue` | Remove `cardStyle` computed + dismiss classes; call `useCardAnimation` |
 | `web/src/views/LabelView.vue` | Badge pop + bucket grid rise via Anime.js |
 | `web/src/assets/avocet.css` | Remove any global animation keyframes if present |
--- a/docs/plans/2026-03-08-anime-animation-plan.md
+++ b/docs/plans/2026-03-08-anime-animation-plan.md
@ -0,0 +1,573 @@
 # Anime.js Animation Integration — Implementation Plan
 > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
 **Goal:** Replace the current mixed CSS keyframes / inline-style animation system with Anime.js v4 for all card motion — pickup morph, drag tracking, spring snap-back, dismissals, bucket grid rise, and badge pop.
 **Architecture:** A new `useCardAnimation` composable owns all Anime.js calls imperatively against DOM refs. Vue reactive state (`isHeld`, `deltaX`, `deltaY`, `dismissType`) is unchanged. `cardStyle` computed and `dismissClass` computed are deleted; Anime.js writes to the element directly.
 **Tech Stack:** Anime.js v4 (`animejs`), Vue 3 Composition API, `@vue/test-utils` + Vitest for tests.
 ---
 ## Task 1: Install Anime.js
 **Files:**
 - Modify: `web/package.json`
 **Step 1: Install the package**
 ```bash
 cd /Library/Development/CircuitForge/avocet/web
 npm install animejs
 ```
 **Step 2: Verify the import resolves**
 Create a throwaway check — open `web/src/main.ts` briefly and confirm:
 ```ts
 import { animate, spring } from 'animejs'
 ```
 resolves without error in the editor (TypeScript types ship with animejs v4).
 Remove the import immediately after verifying — do not commit it.
 **Step 3: Commit**
 ```bash
 cd /Library/Development/CircuitForge/avocet/web
 git add package.json package-lock.json
 git commit -m "feat(avocet): add animejs v4 dependency"
 ```
 ---
 ## Task 2: Create `useCardAnimation` composable
 **Files:**
 - Create: `web/src/composables/useCardAnimation.ts`
 - Create: `web/src/composables/useCardAnimation.test.ts`
 **Background — Anime.js v4 transform model:**
 Anime.js v4 tracks `x`, `y`, `scale`, `rotate`, etc. as separate transform components internally.
 Use `utils.set(el, props)` for instant (no-animation) property updates — this keeps the internal cache consistent.
 Never mix direct `el.style.transform = "..."` with Anime.js on the same element, or the cache desyncs.
 **Step 1: Write the failing tests**
 `web/src/composables/useCardAnimation.test.ts`:
 ```ts
 import { ref, nextTick } from 'vue'
 import { describe, it, expect, vi, beforeEach } from 'vitest'
 // Mock animejs before importing the composable
 vi.mock('animejs', () => ({
  animate: vi.fn(),
  spring:  vi.fn(() => 'mock-spring'),
  utils:   { set: vi.fn() },
 }))
 import { useCardAnimation } from './useCardAnimation'
 import { animate, utils } from 'animejs'
 const mockAnimate = animate as ReturnType<typeof vi.fn>
 const mockSet     = utils.set as ReturnType<typeof vi.fn>
 function makeEl() {
  return document.createElement('div')
 }
 describe('useCardAnimation', () => {
  beforeEach(() => {
    vi.clearAllMocks()
  })
  it('pickup() calls animate with ball shape', () => {
    const el = makeEl()
    const cardEl = ref<HTMLElement | null>(el)
    const motion = { rich: ref(true) }
    const { pickup } = useCardAnimation(cardEl, motion)
    pickup()
    expect(mockAnimate).toHaveBeenCalledWith(
      el,
      expect.objectContaining({ scale: 0.55, borderRadius: '50%' }),
      expect.anything(),
    )
  })
  it('pickup() is a no-op when motion.rich is false', () => {
    const el = makeEl()
    const cardEl = ref<HTMLElement | null>(el)
    const motion = { rich: ref(false) }
    const { pickup } = useCardAnimation(cardEl, motion)
    pickup()
    expect(mockAnimate).not.toHaveBeenCalled()
  })
  it('setDragPosition() calls utils.set with translated coords', () => {
    const el = makeEl()
    const cardEl = ref<HTMLElement | null>(el)
    const motion = { rich: ref(true) }
    const { setDragPosition } = useCardAnimation(cardEl, motion)
    setDragPosition(50, 30)
    expect(mockSet).toHaveBeenCalledWith(el, expect.objectContaining({ x: 50, y: -50 }))
    // y = deltaY - 80 = 30 - 80 = -50
  })
  it('snapBack() calls animate returning to card shape', () => {
    const el = makeEl()
    const cardEl = ref<HTMLElement | null>(el)
    const motion = { rich: ref(true) }
    const { snapBack } = useCardAnimation(cardEl, motion)
    snapBack()
    expect(mockAnimate).toHaveBeenCalledWith(
      el,
      expect.objectContaining({ x: 0, y: 0, scale: 1 }),
      expect.anything(),
    )
  })
  it('animateDismiss("label") calls animate', () => {
    const el = makeEl()
    const cardEl = ref<HTMLElement | null>(el)
    const motion = { rich: ref(true) }
    const { animateDismiss } = useCardAnimation(cardEl, motion)
    animateDismiss('label')
    expect(mockAnimate).toHaveBeenCalled()
  })
  it('animateDismiss("discard") calls animate', () => {
    const el = makeEl()
    const cardEl = ref<HTMLElement | null>(el)
    const motion = { rich: ref(true) }
    const { animateDismiss } = useCardAnimation(cardEl, motion)
    animateDismiss('discard')
    expect(mockAnimate).toHaveBeenCalled()
  })
  it('animateDismiss("skip") calls animate', () => {
    const el = makeEl()
    const cardEl = ref<HTMLElement | null>(el)
    const motion = { rich: ref(true) }
    const { animateDismiss } = useCardAnimation(cardEl, motion)
    animateDismiss('skip')
    expect(mockAnimate).toHaveBeenCalled()
  })
  it('animateDismiss is a no-op when motion.rich is false', () => {
    const el = makeEl()
    const cardEl = ref<HTMLElement | null>(el)
    const motion = { rich: ref(false) }
    const { animateDismiss } = useCardAnimation(cardEl, motion)
    animateDismiss('label')
    expect(mockAnimate).not.toHaveBeenCalled()
  })
 })
 ```
 **Step 2: Run tests to confirm they fail**
 ```bash
 cd /Library/Development/CircuitForge/avocet/web
 npm test -- useCardAnimation
 ```
 Expected: FAIL — "Cannot find module './useCardAnimation'"
 **Step 3: Implement the composable**
 `web/src/composables/useCardAnimation.ts`:
 ```ts
 import { type Ref } from 'vue'
 import { animate, spring, utils } from 'animejs'
 const BALL_SCALE        = 0.55
 const BALL_RADIUS       = '50%'
 const CARD_RADIUS       = '1rem'
 const PICKUP_Y_OFFSET   = 80      // px above finger
 const PICKUP_DURATION   = 200
 // NOTE: animejs v4 — spring() takes an object, not positional args
 const SNAP_SPRING       = spring({ mass: 1, stiffness: 80, damping: 10 })
 interface Motion { rich: Ref<boolean> }
 export function useCardAnimation(
  cardEl: Ref<HTMLElement | null>,
  motion: Motion,
 ) {
  function pickup() {
    if (!motion.rich.value || !cardEl.value) return
    // NOTE: animejs v4 — animate() is 2-arg; timing options merge into the params object
    animate(cardEl.value, {
      scale:        BALL_SCALE,
      borderRadius: BALL_RADIUS,
      y:            -PICKUP_Y_OFFSET,
      duration:     PICKUP_DURATION,
      ease:         SNAP_SPRING,
    })
  }
  function setDragPosition(dx: number, dy: number) {
    if (!cardEl.value) return
    utils.set(cardEl.value, { x: dx, y: dy - PICKUP_Y_OFFSET })
  }
  function snapBack() {
    if (!motion.rich.value || !cardEl.value) return
    // No duration — spring physics determines settling time
    animate(cardEl.value, {
      x:            0,
      y:            0,
      scale:        1,
      borderRadius: CARD_RADIUS,
      ease:         SNAP_SPRING,
    })
  }
  function animateDismiss(type: 'label' | 'skip' | 'discard') {
    if (!motion.rich.value || !cardEl.value) return
    const el = cardEl.value
    if (type === 'label') {
      animate(el, { y: '-120%', scale: 0.85, opacity: 0, duration: 280, ease: 'out(3)' })
    } else if (type === 'discard') {
      // Two-step: crumple then shrink (keyframes array in params object)
      animate(el, { keyframes: [
        { scale: 0.95, rotate: 2,  filter: 'brightness(0.6) sepia(1) hue-rotate(-20deg)', duration: 140 },
        { scale: 0,    rotate: 8,  opacity: 0,                                             duration: 210 },
      ])
    } else if (type === 'skip') {
      animate(el, { x: '110%', rotate: 5, opacity: 0 }, { duration: 260, ease: 'out(2)' })
    }
  }
  return { pickup, setDragPosition, snapBack, animateDismiss }
 }
 ```
 **Step 4: Run tests — expect pass**
 ```bash
 npm test -- useCardAnimation
 ```
 Expected: All 8 tests PASS.
 **Step 5: Commit**
 ```bash
 git add web/src/composables/useCardAnimation.ts web/src/composables/useCardAnimation.test.ts
 git commit -m "feat(avocet): add useCardAnimation composable with Anime.js"
 ```
 ---
 ## Task 3: Wire `useCardAnimation` into `EmailCardStack.vue`
 **Files:**
 - Modify: `web/src/components/EmailCardStack.vue`
 - Modify: `web/src/components/EmailCardStack.test.ts`
 **What changes:**
 - Remove `cardStyle` computed and `:style="cardStyle"` binding
 - Remove `dismissClass` computed and `:class="[dismissClass, ...]"` binding (keep `is-held`)
 - Remove `deltaX`, `deltaY` reactive refs (position now owned by Anime.js)
 - Call `pickup()` in `onPointerDown`, `setDragPosition()` in `onPointerMove`, `snapBack()` in `onPointerUp` (no-target path)
 - Watch `props.dismissType` and call `animateDismiss()`
 - Remove CSS `@keyframes fileAway`, `crumple`, `slideUnder` and their `.dismiss-*` rule blocks from `<style>`
 **Step 1: Update the tests that check dismiss classes**
 In `EmailCardStack.test.ts`, the 5 tests checking `.dismiss-label`, `.dismiss-discard`, `.dismiss-skip` classes are testing implementation (CSS class name), not behavior. Replace them with a single test that verifies `animateDismiss` is called:
 ```ts
 // Add at the top of the file (after existing imports):
 vi.mock('../composables/useCardAnimation', () => ({
  useCardAnimation: vi.fn(() => ({
    pickup:          vi.fn(),
    setDragPosition: vi.fn(),
    snapBack:        vi.fn(),
    animateDismiss:  vi.fn(),
  })),
 }))
 import { useCardAnimation } from '../composables/useCardAnimation'
 ```
 Replace the five `dismissType` class tests (lines 25–46) with:
 ```ts
 it('calls animateDismiss with type when dismissType prop changes', async () => {
  const w = mount(EmailCardStack, { props: { item, isBucketMode: false, dismissType: null } })
  const { animateDismiss } = (useCardAnimation as ReturnType<typeof vi.fn>).mock.results[0].value
  await w.setProps({ dismissType: 'label' })
  await nextTick()
  expect(animateDismiss).toHaveBeenCalledWith('label')
 })
 ```
 Add `nextTick` import to the test file header if not already present:
 ```ts
 import { nextTick } from 'vue'
 ```
 **Step 2: Run tests to confirm the replaced tests fail**
 ```bash
 npm test -- EmailCardStack
 ```
 Expected: FAIL — `animateDismiss` not called (not yet wired in component)
 **Step 3: Modify `EmailCardStack.vue`**
 Script section changes:
 ```ts
 // Remove:
 //   import { ref, computed } from 'vue'  →  change to:
 import { ref, watch } from 'vue'
 // Add import:
 import { useCardAnimation } from '../composables/useCardAnimation'
 // Remove these refs:
 //   const deltaX = ref(0)
 //   const deltaY = ref(0)
 // Add after const motion = useMotion():
 const { pickup, setDragPosition, snapBack, animateDismiss } = useCardAnimation(cardEl, motion)
 // Add watcher:
 watch(() => props.dismissType, (type) => {
  if (type) animateDismiss(type)
 })
 // Remove dismissClass computed entirely.
 // In onPointerDown — add after isHeld.value = true:
 pickup()
 // In onPointerMove — replace deltaX/deltaY assignments with:
 const dx = e.clientX - pickupX.value
 const dy = e.clientY - pickupY.value
 setDragPosition(dx, dy)
 // (keep the zone/bucket detection that uses e.clientX/e.clientY — those stay the same)
 // In onPointerUp — in the snap-back else branch, replace:
 //   deltaX.value = 0
 //   deltaY.value = 0
 // with:
 snapBack()
 ```
 Template changes — on the `.card-wrapper` div:
 ```html
 <!-- Remove: :class="[dismissClass, { 'is-held': isHeld }]" -->
 <!-- Replace with: -->
 :class="{ 'is-held': isHeld }"
 <!-- Remove: :style="cardStyle" -->
 ```
 CSS changes in `<style scoped>` — delete these entire blocks:
 ```
@keyframes fileAway { ... }
@keyframes crumple { ... }
@keyframes slideUnder { ... }
 .card-wrapper.dismiss-label { ... }
 .card-wrapper.dismiss-discard { ... }
 .card-wrapper.dismiss-skip { ... }
 ```
 Also delete `--card-dismiss` and `--card-skip` CSS var usages if present.
 **Step 4: Run all tests**
 ```bash
 npm test
 ```
 Expected: All pass (both `useCardAnimation.test.ts` and `EmailCardStack.test.ts`).
 **Step 5: Commit**
 ```bash
 git add web/src/components/EmailCardStack.vue web/src/components/EmailCardStack.test.ts
 git commit -m "feat(avocet): wire Anime.js card animation into EmailCardStack"
 ```
 ---
 ## Task 4: Bucket grid rise animation
 **Files:**
 - Modify: `web/src/views/LabelView.vue`
 **What changes:**
 Replace the CSS class-toggle animation on `.bucket-grid-footer.grid-active` with an Anime.js watch in `LabelView.vue`. The `position: sticky → fixed` switch stays as a CSS class (can't animate position), but `translateY` and `opacity` move to Anime.js.
 **Step 1: Add gridEl ref and import animate**
 In `LabelView.vue` `<script setup>`:
 ```ts
 // Add to imports:
 import { ref, onMounted, onUnmounted, watch } from 'vue'
 import { animate, spring } from 'animejs'
 // Add ref:
 const gridEl = ref<HTMLElement | null>(null)
 ```
 **Step 2: Add watcher for isHeld**
 ```ts
 watch(isHeld, (held) => {
  if (!motion.rich.value || !gridEl.value) return
  // animejs v4: 2-arg animate, spring() takes object
  animate(gridEl.value,
    held
      ? { y: -8, opacity: 0.45, ease: spring({ mass: 1, stiffness: 80, damping: 10 }), duration: 250 }
      : { y:  0, opacity: 1,    ease: spring({ mass: 1, stiffness: 80, damping: 10 }), duration: 250 }
  )
 })
 ```
 **Step 3: Wire ref in template**
 On the `.bucket-grid-footer` div:
 ```html
 <div ref="gridEl" class="bucket-grid-footer" :class="{ 'grid-active': isHeld }">
 ```
 **Step 4: Remove CSS transition from `.bucket-grid-footer`**
 In `LabelView.vue <style scoped>`, delete the `transition:` line from `.bucket-grid-footer`:
 ```css
 /* DELETE this line: */
 transition: transform 250ms cubic-bezier(0.34, 1.56, 0.64, 1),
            opacity 200ms ease,
            background 200ms ease;
 ```
 Keep the `transform: translateY(-8px)` and `opacity: 0.45` on `.bucket-grid-footer.grid-active` as fallback for reduced-motion users (no-JS fallback too).
 Actually — keep `.grid-active` rules as-is for the no-motion path. The Anime.js `watch` guard (`if (!motion.rich.value)`) means reduced-motion users never hit Anime.js; the CSS class handles them.
 **Step 5: Run tests**
 ```bash
 npm test
 ```
 Expected: All pass (LabelView has no dedicated tests, but full suite should be green).
 **Step 6: Commit**
 ```bash
 git add web/src/views/LabelView.vue
 git commit -m "feat(avocet): animate bucket grid rise with Anime.js spring"
 ```
 ---
 ## Task 5: Badge pop animation
 **Files:**
 - Modify: `web/src/views/LabelView.vue`
 **What changes:**
 Replace `@keyframes badge-pop` (scale + opacity keyframe) with a Vue `<Transition>` `@enter` hook that calls `animate()`. Badges already appear/disappear via `v-if`, so they have natural mount/unmount lifecycle.
 **Step 1: Wrap each badge in a `<Transition>`**
 In `LabelView.vue` template, each badge `<span v-if="...">` gets wrapped:
 ```html
 <Transition @enter="onBadgeEnter" :css="false">
  <span v-if="onRoll" class="badge badge-roll">🔥 On a roll!</span>
 </Transition>
 <Transition @enter="onBadgeEnter" :css="false">
  <span v-if="speedRound" class="badge badge-speed">⚡ Speed round!</span>
 </Transition>
 <!-- repeat for all 6 badges -->
 ```
 `:css="false"` tells Vue not to apply any CSS transition classes — Anime.js owns the enter animation entirely.
 **Step 2: Add `onBadgeEnter` hook**
 ```ts
 function onBadgeEnter(el: Element, done: () => void) {
  if (!motion.rich.value) { done(); return }
  animate(el as HTMLElement,
    { scale: [0.6, 1], opacity: [0, 1] },
    { ease: spring(1.5, 80, 8), duration: 300, onComplete: done }
  )
 }
 ```
 **Step 3: Remove `@keyframes badge-pop` from CSS**
 In `LabelView.vue <style scoped>`:
 ```css
 /* DELETE: */
@keyframes badge-pop {
  from { transform: scale(0.6); opacity: 0; }
  to   { transform: scale(1);   opacity: 1; }
 }
 /* DELETE animation line from .badge: */
 animation: badge-pop 0.3s cubic-bezier(0.34, 1.56, 0.64, 1);
 ```
 **Step 4: Run tests**
 ```bash
 npm test
 ```
 Expected: All pass.
 **Step 5: Commit**
 ```bash
 git add web/src/views/LabelView.vue
 git commit -m "feat(avocet): badge pop via Anime.js spring transition hook"
 ```
 ---
 ## Task 6: Build and smoke test
 **Step 1: Build the SPA**
 ```bash
 cd /Library/Development/CircuitForge/avocet
 ./manage.sh start-api
 ```
 (This builds Vue + starts FastAPI on port 8503.)
 **Step 2: Open the app**
 ```bash
 ./manage.sh open-api
 ```
 **Step 3: Manual smoke test checklist**
 - [ ] Pick up a card — ball morph is smooth (not instant jump)
 - [ ] Drag ball around — follows finger with no lag
 - [ ] Release in center — springs back to card with bounce
 - [ ] Release in left zone — discard fires (card crumples)
 - [ ] Release in right zone — skip fires (card slides right)
 - [ ] Release on a bucket — label fires (card files up)
 - [ ] Fling left fast — discard fires
 - [ ] Bucket grid rises smoothly on pickup, falls on release
 - [ ] Badge (label 10 in a row for 🔥) pops in with spring
 - [ ] Reduced motion: toggle in system settings → no animations, instant behavior
 - [ ] Keyboard labels (1–9) still work (pointer events unchanged)
 **Step 4: Final commit if all green**
 ```bash
 git add -A
 git commit -m "feat(avocet): complete Anime.js animation integration"
 ```
--- a/docs/superpowers/plans/2026-03-15-finetune-classifier.md
+++ b/docs/superpowers/plans/2026-03-15-finetune-classifier.md
--- a/docs/superpowers/specs/2026-03-15-finetune-classifier-design.md
+++ b/docs/superpowers/specs/2026-03-15-finetune-classifier-design.md
@ -0,0 +1,254 @@
 # Fine-tune Email Classifier — Design Spec
 **Date:** 2026-03-15
 **Status:** Approved
 **Scope:** Avocet — `scripts/`, `app/api.py`, `web/src/views/BenchmarkView.vue`, `environment.yml`
 ---
 ## Problem
 The benchmark baseline shows zero-shot macro-F1 of 0.366 for the best models (`deberta-zeroshot`, `deberta-base-anli`). Zero-shot inference cannot improve with more labeled data. Fine-tuning the fastest models (`deberta-small` at 111ms, `bge-m3` at 123ms) on the growing labeled dataset is the path to meaningful accuracy gains.
 ---
 ## Constraints
 - 501 labeled samples after dropping 2 non-canonical `profile_alert` rows
 - Heavy class imbalance: `digest` 29%, `neutral` 26%, `new_lead` 2.6%, `survey_received` 3%
 - 8.2 GB VRAM (shared with Peregrine vLLM during dev)
 - Target models: `cross-encoder/nli-deberta-v3-small` (100M params), `MoritzLaurer/bge-m3-zeroshot-v2.0` (600M params)
 - Output: local `models/avocet-{name}/` directory
 - UI-triggerable via web interface (SSE streaming log)
 - Stack: transformers 4.57.3, torch 2.10.0, accelerate 1.12.0, sklearn, CUDA 8.2GB
 ---
 ## Environment changes
 `environment.yml` must add:
 - `scikit-learn` — required for `train_test_split(stratify=...)` and `f1_score`
 - `peft` is NOT used by this spec; it is available in the env but not required here
 ---
 ## Architecture
 ### New file: `scripts/finetune_classifier.py`
 CLI entry point for fine-tuning. All prints use `flush=True` so stdout is SSE-streamable.
 ```
 python scripts/finetune_classifier.py --model deberta-small [--epochs 5]
 ```
 Supported `--model` values: `deberta-small`, `bge-m3`
 **Model registry** (internal to this script):
 | Key | Base model ID | Max tokens | fp16 | Batch size | Grad accum steps | Gradient checkpointing |
 |-----|--------------|------------|------|------------|-----------------|----------------------|
 | `deberta-small` | `cross-encoder/nli-deberta-v3-small` | 512 | No | 16 | 1 | No |
 | `bge-m3` | `MoritzLaurer/bge-m3-zeroshot-v2.0` | 512 | Yes | 4 | 4 | Yes |
 `bge-m3` uses `fp16=True` (halves optimizer state from ~4.8GB to ~2.4GB) with batch size 4 + gradient accumulation 4 = effective batch 16, matching `deberta-small`. These settings are required to fit within 8.2GB VRAM. Still stop Peregrine vLLM before running bge-m3 fine-tuning.
 ### Modified: `scripts/classifier_adapters.py`
 Add `FineTunedAdapter(ClassifierAdapter)`:
 - Takes `model_dir: str` (path to a `models/avocet-*/` checkpoint)
 - Loads via `pipeline("text-classification", model=model_dir)`
 - `classify()` input format: **`f"{subject} [SEP] {body[:400]}"`** — must match the training format exactly. Do NOT use the zero-shot adapters' `f"Subject: {subject}\n\n{body[:600]}"` format; distribution shift will degrade accuracy.
 - Returns the top predicted label directly (single forward pass — no per-label NLI scoring loop)
 - Expected inference speed: ~10–20ms/email vs 111–338ms for zero-shot
 ### Modified: `scripts/benchmark_classifier.py`
 At startup, scan `models/` for subdirectories containing `training_info.json`. Register each as a dynamic entry in the model registry using `FineTunedAdapter`. Silently skips if `models/` does not exist. Existing CLI behaviour unchanged.
 ### Modified: `app/api.py`
 Two new GET endpoints (GET required for `EventSource` compatibility):
 **`GET /api/finetune/status`**
 Scans `models/` for `training_info.json` files. Returns:
 ```json
 [
  {
    "name": "avocet-deberta-small",
    "base_model": "cross-encoder/nli-deberta-v3-small",
    "val_macro_f1": 0.712,
    "timestamp": "2026-03-15T12:00:00Z",
    "sample_count": 401
  }
 ]
 ```
 Returns `[]` if no fine-tuned models exist.
 **`GET /api/finetune/run?model=deberta-small&epochs=5`**
 Spawns `finetune_classifier.py` via the `job-seeker-classifiers` Python binary. Streams stdout as SSE `{"type":"progress","message":"..."}` events. Emits `{"type":"complete"}` on clean exit, `{"type":"error","message":"..."}` on non-zero exit. Same implementation pattern as `/api/benchmark/run`.
 ### Modified: `web/src/views/BenchmarkView.vue`
 **Trained models badge row** (top of view, conditional on fine-tuned models existing):
 Shows each fine-tuned model name + val macro-F1 chip. Fetches from `/api/finetune/status` on mount.
 **Fine-tune section** (collapsible, below benchmark charts):
 - Dropdown: `deberta-small` | `bge-m3`
 - Number input: epochs (default 5, range 1–20)
 - Run button → streams into existing log component
 - On `complete`: auto-triggers `/api/benchmark/run` (with `--save`) so charts update immediately
 ---
 ## Training Pipeline
 ### Data preparation
 1. Load `data/email_score.jsonl`
 2. Drop rows where `label` not in canonical `LABELS` (removes `profile_alert` etc.)
 3. Check for classes with < 2 **total** samples (before any split). Drop those classes and warn. Additionally warn — but do not skip — classes with < 5 training samples, noting eval F1 for those classes will be unreliable.
 4. Input text: `f"{subject} [SEP] {body[:400]}"` — fits within 512 tokens for both target models
 5. Stratified 80/20 train/val split via `sklearn.model_selection.train_test_split(stratify=labels)`
 ### Class weighting
 Compute per-class weights: `total_samples / (n_classes × class_count)`. Pass to a `WeightedTrainer` subclass:
 ```python
 class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        # **kwargs is required — absorbs num_items_in_batch added in Transformers 4.38.
        # Do not remove it; removing it causes TypeError on the first training step.
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        # Move class_weights to the same device as logits — required for GPU training.
        # class_weights is created on CPU; logits are on cuda:0 during training.
        weight = self.class_weights.to(outputs.logits.device)
        loss = F.cross_entropy(outputs.logits, labels, weight=weight)
        return (loss, outputs) if return_outputs else loss
 ```
 ### Model setup
 ```python
 AutoModelForSequenceClassification.from_pretrained(
    base_model_id,
    num_labels=10,
    ignore_mismatched_sizes=True,   # see note below
    id2label=id2label,
    label2id=label2id,
 )
 ```
 **Note on `ignore_mismatched_sizes=True`:** The pretrained NLI head is a 3-class linear projection. It mismatches the 10-class head constructed by `num_labels=10`, so its weights are skipped during loading. PyTorch initializes the new head from scratch using the model's default init scheme. The backbone weights load normally. Do not set this to `False` — it will raise a shape error.
 ### Training config and `compute_metrics`
 The Trainer requires a `compute_metrics` callback that takes an `EvalPrediction` (logits + label_ids) and returns a dict with a `macro_f1` key. This is distinct from the existing `compute_metrics` in `classifier_adapters.py` (which operates on string predictions):
 ```python
 def compute_metrics_for_trainer(eval_pred: EvalPrediction) -> dict:
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return {
        "macro_f1": f1_score(labels, preds, average="macro", zero_division=0),
        "accuracy": accuracy_score(labels, preds),
    }
 ```
 `TrainingArguments` must include:
 - `load_best_model_at_end=True`
 - `metric_for_best_model="macro_f1"`
 - `greater_is_better=True`
 These are required for `EarlyStoppingCallback` to work correctly. Without `load_best_model_at_end=True`, `EarlyStoppingCallback` raises `AssertionError` on init.
 | Hyperparameter | deberta-small | bge-m3 |
 |---------------|--------------|--------|
 | Epochs | 5 (default, CLI-overridable) | 5 |
 | Batch size | 16 | 4 |
 | Gradient accumulation | 1 | 4 (effective batch = 16) |
 | Learning rate | 2e-5 | 2e-5 |
 | LR schedule | Linear with 10% warmup | same |
 | Optimizer | AdamW | AdamW |
 | fp16 | No | Yes |
 | Gradient checkpointing | No | Yes |
 | Eval strategy | Every epoch | Every epoch |
 | Best checkpoint | By `macro_f1` | same |
 | Early stopping patience | 3 epochs | 3 epochs |
 ### Output
 Saved to `models/avocet-{name}/`:
 - Model weights + tokenizer (standard HuggingFace format)
 - `training_info.json`:
 ```json
 {
  "name": "avocet-deberta-small",
  "base_model_id": "cross-encoder/nli-deberta-v3-small",
  "timestamp": "2026-03-15T12:00:00Z",
  "epochs_run": 5,
  "val_macro_f1": 0.712,
  "val_accuracy": 0.798,
  "sample_count": 401,
  "label_counts": { "digest": 116, "neutral": 104, ... }
 }
 ```
 ---
 ## Data Flow
 ```
 email_score.jsonl
      │
      ▼
 finetune_classifier.py
  ├── drop non-canonical labels
  ├── check for < 2 total samples per class (drop + warn)
  ├── stratified 80/20 split
  ├── tokenize (subject [SEP] body[:400])
  ├── compute class weights
  ├── WeightedTrainer + EarlyStoppingCallback
  └── save → models/avocet-{name}/
                    │
                    ├── FineTunedAdapter (classifier_adapters.py)
                    │       ├── pipeline("text-classification")
                    │       ├── input: subject [SEP] body[:400]   ← must match training format
                    │       └── ~10–20ms/email inference
                    │
                    └── training_info.json
                            └── /api/finetune/status
                                        └── BenchmarkView badge row
 ```
 ---
 ## Error Handling
 - **Insufficient data (< 2 total samples in a class):** Drop class before split, print warning with class name and count.
 - **Low data warning (< 5 training samples in a class):** Warn but continue; note eval F1 for that class will be unreliable.
 - **VRAM OOM on bge-m3:** Surface as clear SSE error message. Suggest stopping Peregrine vLLM first (it holds ~5.7GB).
 - **Missing score file:** Raise `FileNotFoundError` with actionable message (same pattern as `load_scoring_jsonl`).
 - **Model dir already exists:** Overwrite with a warning log line. Re-running always produces a fresh checkpoint.
 ---
 ## Testing
 - Unit test `WeightedTrainer.compute_loss` with a mock model and known label distribution — verify weighted loss differs from unweighted; verify `**kwargs` does not raise `TypeError`
 - Unit test `compute_metrics_for_trainer` — verify `macro_f1` key in output, correct value on known inputs
 - Unit test `FineTunedAdapter.classify` with a mock pipeline — verify it returns a string from `LABELS` using `subject [SEP] body[:400]` format
 - Unit test auto-discovery in `benchmark_classifier.py` — mock `models/` dir with two `training_info.json` files, verify both appear in the active registry
 - Integration test: fine-tune on `data/email_score.jsonl.example` (8 samples, 5 of 10 labels represented, 1 epoch, `--model deberta-small`). The 5 missing labels trigger the `< 2 total samples` drop path — the test must verify the drop warning is emitted for each missing label rather than treating it as a failure. Verify `models/avocet-deberta-small/training_info.json` is written with correct keys.
 ---
 ## Out of Scope
 - Pushing fine-tuned weights to HuggingFace Hub (future)
 - Cross-validation or k-fold evaluation (future — dataset too small to be meaningful now)
 - Hyperparameter search (future)
 - LoRA/PEFT adapter fine-tuning (future — relevant if model sizes grow beyond available VRAM)
 - Fine-tuning models other than `deberta-small` and `bge-m3`
--- a/manage.sh
+++ b/manage.sh
@ -19,8 +19,9 @@ LOG_FILE="${LOG_DIR}/label_tool.log"
 DEFAULT_PORT=8503
 CONDA_BASE="${CONDA_BASE:-/devl/miniconda3}"
-ENV_UI="${AVOCET_ENV:-cf}"
+ENV_UI="job-seeker"
 ENV_BM="job-seeker-classifiers"
 STREAMLIT="${CONDA_BASE}/envs/${ENV_UI}/bin/streamlit"
 PYTHON_BM="${CONDA_BASE}/envs/${ENV_BM}/bin/python"
 PYTHON_UI="${CONDA_BASE}/envs/${ENV_UI}/bin/python"
@ -78,11 +79,13 @@ usage() {
    echo ""
    echo "  Usage: ./manage.sh <command> [args]"
    echo ""
-    echo "  Vue UI + FastAPI:"
+    echo "  Label tool:"
-    echo -e "    ${GREEN}start${NC}                    Build Vue SPA + start FastAPI on port 8503"
+    echo -e "    ${GREEN}start${NC}                    Start label tool UI (port collision-safe)"
-    echo -e "    ${GREEN}stop${NC}                     Stop FastAPI server"
+    echo -e "    ${GREEN}stop${NC}                     Stop label tool UI"
-    echo -e "    ${GREEN}restart${NC}                  Stop + rebuild + restart FastAPI server"
+    echo -e "    ${GREEN}restart${NC}                  Restart label tool UI"
-    echo -e "    ${GREEN}open${NC}                     Open Vue UI in browser (http://localhost:8503)"
+    echo -e "    ${GREEN}status${NC}                   Show running state and port"
    echo -e "    ${GREEN}logs${NC}                     Tail label tool log output"
    echo -e "    ${GREEN}open${NC}                     Open label tool in browser"
    echo ""
    echo "  Benchmark:"
    echo -e "    ${GREEN}benchmark [args]${NC}         Run benchmark_classifier.py (args passed through)"
@ -90,8 +93,13 @@ usage() {
    echo -e "    ${GREEN}score [args]${NC}             Shortcut: --score [args]"
    echo -e "    ${GREEN}compare [args]${NC}           Shortcut: --compare [args]"
    echo ""
    echo "  Vue API:"
    echo -e "    ${GREEN}start-api${NC}                Build Vue SPA + start FastAPI on port 8503"
    echo -e "    ${GREEN}stop-api${NC}                 Stop FastAPI server"
    echo -e "    ${GREEN}restart-api${NC}              Stop + rebuild + restart FastAPI server"
    echo -e "    ${GREEN}open-api${NC}                 Open Vue UI in browser (http://localhost:8503)"
    echo ""
    echo "  Dev:"
    echo -e "    ${GREEN}dev${NC}                      Hot-reload: uvicorn --reload (:8503) + Vite HMR (:5173)"
    echo -e "    ${GREEN}test${NC}                     Run pytest suite"
    echo ""
    echo "  Port defaults to ${DEFAULT_PORT}; auto-increments if occupied."
@ -113,102 +121,102 @@ shift || true
 case "$CMD" in
    start)
-        API_PID_FILE=".avocet-api.pid"
+        pid=$(_running_pid)
-        API_PORT=8503
+        if [[ -n "$pid" ]]; then
-        if [[ -f "$API_PID_FILE" ]] && kill -0 "$(<"$API_PID_FILE")" 2>/dev/null; then
+            port=$(_running_port)
-            warn "API already running (PID $(<"$API_PID_FILE")) → http://localhost:${API_PORT}"
+            warn "Already running (PID ${pid}) on port ${port} → http://localhost:${port}"
            exit 0
        fi
        if [[ ! -x "$STREAMLIT" ]]; then
            error "Streamlit not found at ${STREAMLIT}\nActivate env: conda run -n ${ENV_UI} ..."
        fi
        port=$(_find_free_port "$DEFAULT_PORT")
        mkdir -p "$LOG_DIR"
-        API_LOG="${LOG_DIR}/api.log"
+
-        info "Building Vue SPA…"
+        info "Starting label tool on port ${port}…"
-        (cd web && npm run build) >> "$API_LOG" 2>&1
+        nohup "$STREAMLIT" run app/label_tool.py \
-        info "Starting FastAPI on port ${API_PORT}…"
+            --server.port "$port" \
-        nohup "$PYTHON_UI" -m uvicorn app.api:app \
+            --server.headless true \
-            --host 0.0.0.0 --port "$API_PORT" \
+            --server.fileWatcherType none \
-            >> "$API_LOG" 2>&1 &
+            >"$LOG_FILE" 2>&1 &
-        echo $! > "$API_PID_FILE"
+
-        # Poll until port is actually bound (up to 10 s), not just process alive
+        pid=$!
-        for _i in $(seq 1 20); do
+        echo "$pid"  > "$PID_FILE"
-            sleep 0.5
+        echo "$port" > "$PORT_FILE"
-            if (echo "" >/dev/tcp/127.0.0.1/"$API_PORT") 2>/dev/null; then
+
-                success "Avocet started → http://localhost:${API_PORT}  (PID $(<"$API_PID_FILE"))"
+        # Wait briefly and confirm the process survived
-                break
+        sleep 1
-            fi
+        if kill -0 "$pid" 2>/dev/null; then
-            if ! kill -0 "$(<"$API_PID_FILE")" 2>/dev/null; then
+            success "Avocet label tool started → http://localhost:${port}  (PID ${pid})"
-                rm -f "$API_PID_FILE"
+            success "Logs: ${LOG_FILE}"
-                error "Server died during startup. Check ${API_LOG}"
+        else
-            fi
+            rm -f "$PID_FILE" "$PORT_FILE"
-        done
+            error "Process died immediately. Check ${LOG_FILE} for details."
        if ! (echo "" >/dev/tcp/127.0.0.1/"$API_PORT") 2>/dev/null; then
            error "Server did not bind to port ${API_PORT} within 10 s. Check ${API_LOG}"
        fi
        ;;
    stop)
-        API_PID_FILE=".avocet-api.pid"
+        pid=$(_running_pid)
-        if [[ ! -f "$API_PID_FILE" ]]; then
+        if [[ -z "$pid" ]]; then
            warn "Not running."
            exit 0
        fi
-        PID="$(<"$API_PID_FILE")"
+        info "Stopping label tool (PID ${pid})…"
-        if kill -0 "$PID" 2>/dev/null; then
+        kill "$pid"
-            kill "$PID" && rm -f "$API_PID_FILE"
+        # Wait up to 5 s for clean exit
-            success "Stopped (PID ${PID})."
+        for _ in $(seq 1 10); do
-        else
+            kill -0 "$pid" 2>/dev/null || break
-            warn "Stale PID file (process ${PID} not running). Cleaning up."
+            sleep 0.5
-            rm -f "$API_PID_FILE"
+        done
        if kill -0 "$pid" 2>/dev/null; then
            warn "Process did not exit cleanly; sending SIGKILL…"
            kill -9 "$pid" 2>/dev/null || true
        fi
        rm -f "$PID_FILE" "$PORT_FILE"
        success "Stopped."
        ;;
    restart)
-        bash "$0" stop
+        pid=$(_running_pid)
        if [[ -n "$pid" ]]; then
            info "Stopping existing process (PID ${pid})…"
            kill "$pid"
            for _ in $(seq 1 10); do
                kill -0 "$pid" 2>/dev/null || break
                sleep 0.5
            done
            kill -0 "$pid" 2>/dev/null && kill -9 "$pid" 2>/dev/null || true
            rm -f "$PID_FILE" "$PORT_FILE"
        fi
        exec bash "$0" start
        ;;
-    dev)
+    status)
-        API_PORT=8503
+        pid=$(_running_pid)
-        VITE_PORT=5173
+        if [[ -n "$pid" ]]; then
-        DEV_API_PID_FILE=".avocet-dev-api.pid"
+            port=$(_running_port)
-        mkdir -p "$LOG_DIR"
+            success "Running  — PID ${pid}  port ${port}  → http://localhost:${port}"
        DEV_API_LOG="${LOG_DIR}/dev-api.log"
        if [[ -f "$DEV_API_PID_FILE" ]] && kill -0 "$(<"$DEV_API_PID_FILE")" 2>/dev/null; then
            warn "Dev API already running (PID $(<"$DEV_API_PID_FILE"))"
        else
-            info "Starting uvicorn with --reload on port ${API_PORT}…"
+            warn "Not running."
            nohup "$PYTHON_UI" -m uvicorn app.api:app \
                --host 0.0.0.0 --port "$API_PORT" --reload \
                >> "$DEV_API_LOG" 2>&1 &
            echo $! > "$DEV_API_PID_FILE"
            # Wait for API to bind
            for _i in $(seq 1 20); do
                sleep 0.5
                (echo "" >/dev/tcp/127.0.0.1/"$API_PORT") 2>/dev/null && break
                if ! kill -0 "$(<"$DEV_API_PID_FILE")" 2>/dev/null; then
                    rm -f "$DEV_API_PID_FILE"
                    error "Dev API died during startup. Check ${DEV_API_LOG}"
        fi
-            done
+        ;;
-            success "API (hot-reload) → http://localhost:${API_PORT}"
+
    logs)
        if [[ ! -f "$LOG_FILE" ]]; then
            warn "No log file found at ${LOG_FILE}. Has the tool been started?"
            exit 0
        fi
-
+        info "Tailing ${LOG_FILE} (Ctrl-C to stop)"
-        # Kill API on exit (Ctrl+C or Vite exits)
+        tail -f "$LOG_FILE"
        _cleanup_dev() {
            local pid
            pid=$(<"$DEV_API_PID_FILE" 2>/dev/null || true)
            [[ -n "$pid" ]] && kill "$pid" 2>/dev/null && rm -f "$DEV_API_PID_FILE"
            info "Dev servers stopped."
        }
        trap _cleanup_dev EXIT INT TERM
        info "Starting Vite HMR on port ${VITE_PORT} (proxy /api → :${API_PORT})…"
        success "Frontend (HMR) → http://localhost:${VITE_PORT}"
        (cd web && npm run dev -- --host 0.0.0.0 --port "$VITE_PORT")
        ;;
    open)
-        URL="http://localhost:8503"
+        port=$(_running_port)
        pid=$(_running_pid)
        [[ -z "$pid" ]] && warn "Label tool does not appear to be running. Start with: ./manage.sh start"
        URL="http://localhost:${port}"
        info "Opening ${URL}"
        if command -v xdg-open &>/dev/null; then
            xdg-open "$URL"
@ -249,6 +257,72 @@ case "$CMD" in
        exec "$0" benchmark --compare "$@"
        ;;
    start-api)
        API_PID_FILE=".avocet-api.pid"
        API_PORT=8503
        if [[ -f "$API_PID_FILE" ]] && kill -0 "$(<"$API_PID_FILE")" 2>/dev/null; then
            warn "API already running (PID $(<"$API_PID_FILE")) → http://localhost:${API_PORT}"
            exit 0
        fi
        mkdir -p "$LOG_DIR"
        API_LOG="${LOG_DIR}/api.log"
        info "Building Vue SPA…"
        (cd web && npm run build) >> "$API_LOG" 2>&1
        info "Starting FastAPI on port ${API_PORT}…"
        nohup "$PYTHON_UI" -m uvicorn app.api:app \
            --host 0.0.0.0 --port "$API_PORT" \
            >> "$API_LOG" 2>&1 &
        echo $! > "$API_PID_FILE"
        # Poll until port is actually bound (up to 10 s), not just process alive
        for _i in $(seq 1 20); do
            sleep 0.5
            if (echo "" >/dev/tcp/127.0.0.1/"$API_PORT") 2>/dev/null; then
                success "Avocet API started → http://localhost:${API_PORT}  (PID $(<"$API_PID_FILE"))"
                break
            fi
            if ! kill -0 "$(<"$API_PID_FILE")" 2>/dev/null; then
                rm -f "$API_PID_FILE"
                error "API died during startup. Check ${API_LOG}"
            fi
        done
        if ! (echo "" >/dev/tcp/127.0.0.1/"$API_PORT") 2>/dev/null; then
            error "API did not bind to port ${API_PORT} within 10 s. Check ${API_LOG}"
        fi
        ;;
    stop-api)
        API_PID_FILE=".avocet-api.pid"
        if [[ ! -f "$API_PID_FILE" ]]; then
            warn "API not running."
            exit 0
        fi
        PID="$(<"$API_PID_FILE")"
        if kill -0 "$PID" 2>/dev/null; then
            kill "$PID" && rm -f "$API_PID_FILE"
            success "API stopped (PID ${PID})."
        else
            warn "Stale PID file (process ${PID} not running). Cleaning up."
            rm -f "$API_PID_FILE"
        fi
        ;;
    restart-api)
        bash "$0" stop-api
        exec bash "$0" start-api
        ;;
    open-api)
        URL="http://localhost:8503"
        info "Opening ${URL}"
        if command -v xdg-open &>/dev/null; then
            xdg-open "$URL"
        elif command -v open &>/dev/null; then
            open "$URL"
        else
            echo "$URL"
        fi
        ;;
    help|--help|-h)
        usage
        ;;
--- a/scripts/sft_import.py
+++ b/scripts/sft_import.py
@ -1,110 +0,0 @@
 """Avocet — SFT candidate run discovery and JSONL import.
 No FastAPI dependency — pure Python file operations.
 Used by app/sft.py endpoints and can be run standalone.
 """
 from __future__ import annotations
 import json
 import logging
 from pathlib import Path
 logger = logging.getLogger(__name__)
 _CANDIDATES_FILENAME = "sft_candidates.jsonl"
 def discover_runs(bench_results_dir: Path) -> list[dict]:
    """Return one entry per run subdirectory that contains sft_candidates.jsonl.
    Sorted newest-first by directory name (directories are named YYYY-MM-DD-HHMMSS
    by the cf-orch benchmark harness, so lexicographic order is chronological).
    Each entry: {run_id, timestamp, candidate_count, sft_path}
    """
    if not bench_results_dir.exists() or not bench_results_dir.is_dir():
        return []
    runs = []
    for subdir in bench_results_dir.iterdir():
        if not subdir.is_dir():
            continue
        sft_path = subdir / _CANDIDATES_FILENAME
        if not sft_path.exists():
            continue
        records = _read_jsonl(sft_path)
        runs.append({
            "run_id": subdir.name,
            "timestamp": subdir.name,
            "candidate_count": len(records),
            "sft_path": sft_path,
        })
    runs.sort(key=lambda r: r["run_id"], reverse=True)
    return runs
 def import_run(sft_path: Path, data_dir: Path) -> dict[str, int]:
    """Append records from sft_path into data_dir/sft_candidates.jsonl.
    Deduplicates on the `id` field — records whose id already exists in the
    destination file are skipped silently. Records missing an `id` field are
    also skipped (malformed input from a partial benchmark write).
    Returns {imported: N, skipped: M}.
    """
    dest = data_dir / _CANDIDATES_FILENAME
    existing_ids = _read_existing_ids(dest)
    new_records: list[dict] = []
    skipped = 0
    for record in _read_jsonl(sft_path):
        if "id" not in record:
            logger.warning("Skipping record missing 'id' field in %s", sft_path)
            continue  # malformed — skip without crashing
        if record["id"] in existing_ids:
            skipped += 1
            continue
        new_records.append(record)
        existing_ids.add(record["id"])
    if new_records:
        dest.parent.mkdir(parents=True, exist_ok=True)
        with open(dest, "a", encoding="utf-8") as fh:
            for r in new_records:
                fh.write(json.dumps(r) + "\n")
    return {"imported": len(new_records), "skipped": skipped}
 def _read_jsonl(path: Path) -> list[dict]:
    """Read a JSONL file, returning valid records. Skips blank lines and malformed JSON."""
    if not path.exists():
        return []
    records: list[dict] = []
    for line in path.read_text(encoding="utf-8").splitlines():
        line = line.strip()
        if not line:
            continue
        try:
            records.append(json.loads(line))
        except json.JSONDecodeError as exc:
            logger.warning("Skipping malformed JSON line in %s: %s", path, exc)
    return records
 def _read_existing_ids(path: Path) -> set[str]:
    """Read only the id field from each line of a JSONL file."""
    if not path.exists():
        return set()
    ids: set[str] = set()
    with path.open() as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            try:
                record = json.loads(line)
                if "id" in record:
                    ids.add(record["id"])
            except json.JSONDecodeError:
                pass  # corrupt line, skip silently (ids file is our own output)
    return ids
--- a/tests/test_label_tool.py
+++ b/tests/test_label_tool.py
@ -5,83 +5,83 @@ These functions are stdlib-only and safe to test without an IMAP connection.
 from email.mime.multipart import MIMEMultipart
 from email.mime.text import MIMEText
-from app.utils import extract_body, strip_html
+from app.label_tool import _extract_body, _strip_html
-# ── strip_html ──────────────────────────────────────────────────────────────
+# ── _strip_html ──────────────────────────────────────────────────────────────
 def test_strip_html_removes_tags():
-    assert strip_html("<p>Hello <b>world</b></p>") == "Hello world"
+    assert _strip_html("<p>Hello <b>world</b></p>") == "Hello world"
 def test_strip_html_skips_script_content():
-    result = strip_html("<script>doEvil()</script><p>real</p>")
+    result = _strip_html("<script>doEvil()</script><p>real</p>")
    assert "doEvil" not in result
    assert "real" in result
 def test_strip_html_skips_style_content():
-    result = strip_html("<style>.foo{color:red}</style><p>visible</p>")
+    result = _strip_html("<style>.foo{color:red}</style><p>visible</p>")
    assert ".foo" not in result
    assert "visible" in result
 def test_strip_html_handles_br_as_newline():
-    result = strip_html("line1<br>line2")
+    result = _strip_html("line1<br>line2")
    assert "line1" in result
    assert "line2" in result
 def test_strip_html_decodes_entities():
    # convert_charrefs=True on HTMLParser handles &amp; etc.
-    result = strip_html("<p>Hello &amp; welcome</p>")
+    result = _strip_html("<p>Hello &amp; welcome</p>")
    assert "&amp;" not in result
    assert "Hello" in result
    assert "welcome" in result
 def test_strip_html_empty_string():
-    assert strip_html("") == ""
+    assert _strip_html("") == ""
 def test_strip_html_plain_text_passthrough():
-    assert strip_html("no tags here") == "no tags here"
+    assert _strip_html("no tags here") == "no tags here"
-# ── extract_body ────────────────────────────────────────────────────────────
+# ── _extract_body ────────────────────────────────────────────────────────────
 def test_extract_body_prefers_plain_over_html():
    msg = MIMEMultipart("alternative")
    msg.attach(MIMEText("plain body", "plain"))
    msg.attach(MIMEText("<html><body>html body</body></html>", "html"))
-    assert extract_body(msg) == "plain body"
+    assert _extract_body(msg) == "plain body"
 def test_extract_body_falls_back_to_html_when_no_plain():
    msg = MIMEMultipart("alternative")
    msg.attach(MIMEText("<html><body><p>HTML only email</p></body></html>", "html"))
-    result = extract_body(msg)
+    result = _extract_body(msg)
    assert "HTML only email" in result
    assert "<" not in result  # no raw HTML tags leaked through
 def test_extract_body_non_multipart_html_stripped():
    msg = MIMEText("<html><body><p>Solo HTML</p></body></html>", "html")
-    result = extract_body(msg)
+    result = _extract_body(msg)
    assert "Solo HTML" in result
    assert "<html>" not in result
 def test_extract_body_non_multipart_plain_unchanged():
    msg = MIMEText("just plain text", "plain")
-    assert extract_body(msg) == "just plain text"
+    assert _extract_body(msg) == "just plain text"
 def test_extract_body_empty_message():
    msg = MIMEText("", "plain")
-    assert extract_body(msg) == ""
+    assert _extract_body(msg) == ""
 def test_extract_body_multipart_empty_returns_empty():
    msg = MIMEMultipart("alternative")
-    assert extract_body(msg) == ""
+    assert _extract_body(msg) == ""
--- a/tests/test_models.py
+++ b/tests/test_models.py
@ -1,399 +0,0 @@
 """Tests for app/models.py — /api/models/* endpoints."""
 from __future__ import annotations
 import json
 from pathlib import Path
 from unittest.mock import MagicMock, patch
 import pytest
 from fastapi.testclient import TestClient
 # ── Fixtures ───────────────────────────────────────────────────────────────────
@pytest.fixture(autouse=True)
 def reset_models_globals(tmp_path):
    """Redirect module-level dirs to tmp_path and reset download progress."""
    from app import models as models_module
    prev_models = models_module._MODELS_DIR
    prev_queue = models_module._QUEUE_DIR
    prev_progress = dict(models_module._download_progress)
    models_dir = tmp_path / "models"
    queue_dir = tmp_path / "data"
    models_dir.mkdir()
    queue_dir.mkdir()
    models_module.set_models_dir(models_dir)
    models_module.set_queue_dir(queue_dir)
    models_module._download_progress = {}
    yield
    models_module.set_models_dir(prev_models)
    models_module.set_queue_dir(prev_queue)
    models_module._download_progress = prev_progress
@pytest.fixture
 def client():
    from app.api import app
    return TestClient(app)
 def _make_hf_response(repo_id: str = "org/model", pipeline_tag: str = "text-classification") -> dict:
    """Minimal HF API response payload."""
    return {
        "modelId": repo_id,
        "pipeline_tag": pipeline_tag,
        "tags": ["pytorch", pipeline_tag],
        "downloads": 42000,
        "siblings": [
            {"rfilename": "pytorch_model.bin", "size": 500_000_000},
        ],
        "cardData": {"description": "A test model description."},
    }
 def _queue_one(client, repo_id: str = "org/model") -> dict:
    """Helper: POST to /queue and return the created entry."""
    r = client.post("/api/models/queue", json={
        "repo_id": repo_id,
        "pipeline_tag": "text-classification",
        "adapter_recommendation": "ZeroShotAdapter",
    })
    assert r.status_code == 201, r.text
    return r.json()
 # ── GET /lookup ────────────────────────────────────────────────────────────────
 def test_lookup_invalid_repo_id_returns_422_no_slash(client):
    """repo_id without a '/' should be rejected with 422."""
    r = client.get("/api/models/lookup", params={"repo_id": "noslash"})
    assert r.status_code == 422
 def test_lookup_invalid_repo_id_returns_422_whitespace(client):
    """repo_id containing whitespace should be rejected with 422."""
    r = client.get("/api/models/lookup", params={"repo_id": "org/model name"})
    assert r.status_code == 422
 def test_lookup_hf_404_returns_404(client):
    """HF API returning 404 should surface as HTTP 404."""
    mock_resp = MagicMock()
    mock_resp.status_code = 404
    with patch("app.models.httpx.get", return_value=mock_resp):
        r = client.get("/api/models/lookup", params={"repo_id": "org/nonexistent"})
    assert r.status_code == 404
 def test_lookup_hf_network_error_returns_502(client):
    """Network error reaching HF API should return 502."""
    import httpx as _httpx
    with patch("app.models.httpx.get", side_effect=_httpx.RequestError("timeout")):
        r = client.get("/api/models/lookup", params={"repo_id": "org/model"})
    assert r.status_code == 502
 def test_lookup_returns_correct_shape(client):
    """Successful lookup returns all required fields."""
    mock_resp = MagicMock()
    mock_resp.status_code = 200
    mock_resp.json.return_value = _make_hf_response("org/mymodel", "text-classification")
    with patch("app.models.httpx.get", return_value=mock_resp):
        r = client.get("/api/models/lookup", params={"repo_id": "org/mymodel"})
    assert r.status_code == 200
    data = r.json()
    assert data["repo_id"] == "org/mymodel"
    assert data["pipeline_tag"] == "text-classification"
    assert data["adapter_recommendation"] == "ZeroShotAdapter"
    assert data["model_size_bytes"] == 500_000_000
    assert data["downloads"] == 42000
    assert data["already_installed"] is False
    assert data["already_queued"] is False
 def test_lookup_unknown_pipeline_tag_returns_null_adapter(client):
    """An unrecognised pipeline_tag yields adapter_recommendation=null."""
    mock_resp = MagicMock()
    mock_resp.status_code = 200
    mock_resp.json.return_value = _make_hf_response("org/m", "audio-classification")
    with patch("app.models.httpx.get", return_value=mock_resp):
        r = client.get("/api/models/lookup", params={"repo_id": "org/m"})
    assert r.status_code == 200
    assert r.json()["adapter_recommendation"] is None
 def test_lookup_already_queued_flag(client):
    """already_queued is True when repo_id is in the pending queue."""
    _queue_one(client, "org/queued-model")
    mock_resp = MagicMock()
    mock_resp.status_code = 200
    mock_resp.json.return_value = _make_hf_response("org/queued-model")
    with patch("app.models.httpx.get", return_value=mock_resp):
        r = client.get("/api/models/lookup", params={"repo_id": "org/queued-model"})
    assert r.status_code == 200
    assert r.json()["already_queued"] is True
 # ── GET /queue ─────────────────────────────────────────────────────────────────
 def test_queue_empty_initially(client):
    r = client.get("/api/models/queue")
    assert r.status_code == 200
    assert r.json() == []
 def test_queue_add_and_list(client):
    """POST then GET /queue should return the entry."""
    entry = _queue_one(client, "org/my-model")
    r = client.get("/api/models/queue")
    assert r.status_code == 200
    items = r.json()
    assert len(items) == 1
    assert items[0]["repo_id"] == "org/my-model"
    assert items[0]["status"] == "pending"
    assert items[0]["id"] == entry["id"]
 def test_queue_add_returns_entry_fields(client):
    """POST /queue returns an entry with all expected fields."""
    entry = _queue_one(client)
    assert "id" in entry
    assert "queued_at" in entry
    assert entry["status"] == "pending"
    assert entry["pipeline_tag"] == "text-classification"
    assert entry["adapter_recommendation"] == "ZeroShotAdapter"
 # ── POST /queue — 409 duplicate ────────────────────────────────────────────────
 def test_queue_duplicate_returns_409(client):
    """Posting the same repo_id twice should return 409."""
    _queue_one(client, "org/dup-model")
    r = client.post("/api/models/queue", json={
        "repo_id": "org/dup-model",
        "pipeline_tag": "text-classification",
        "adapter_recommendation": "ZeroShotAdapter",
    })
    assert r.status_code == 409
 def test_queue_multiple_different_models(client):
    """Multiple distinct repo_ids should all be accepted."""
    _queue_one(client, "org/model-a")
    _queue_one(client, "org/model-b")
    _queue_one(client, "org/model-c")
    r = client.get("/api/models/queue")
    assert r.status_code == 200
    assert len(r.json()) == 3
 # ── DELETE /queue/{id} — dismiss ──────────────────────────────────────────────
 def test_queue_dismiss(client):
    """DELETE /queue/{id} sets status=dismissed; entry not returned by GET /queue."""
    entry = _queue_one(client)
    entry_id = entry["id"]
    r = client.delete(f"/api/models/queue/{entry_id}")
    assert r.status_code == 200
    assert r.json() == {"ok": True}
    r2 = client.get("/api/models/queue")
    assert r2.status_code == 200
    assert r2.json() == []
 def test_queue_dismiss_nonexistent_returns_404(client):
    """DELETE /queue/{id} with unknown id returns 404."""
    r = client.delete("/api/models/queue/does-not-exist")
    assert r.status_code == 404
 def test_queue_dismiss_allows_re_queue(client):
    """After dismissal the same repo_id can be queued again."""
    entry = _queue_one(client, "org/requeue-model")
    client.delete(f"/api/models/queue/{entry['id']}")
    r = client.post("/api/models/queue", json={
        "repo_id": "org/requeue-model",
        "pipeline_tag": None,
        "adapter_recommendation": None,
    })
    assert r.status_code == 201
 # ── POST /queue/{id}/approve ───────────────────────────────────────────────────
 def test_approve_nonexistent_returns_404(client):
    """Approving an unknown id returns 404."""
    r = client.post("/api/models/queue/ghost-id/approve")
    assert r.status_code == 404
 def test_approve_non_pending_returns_409(client):
    """Approving an entry that is not in 'pending' state returns 409."""
    from app import models as models_module
    entry = _queue_one(client)
    # Manually flip status to 'failed'
    models_module._update_queue_entry(entry["id"], {"status": "failed"})
    r = client.post(f"/api/models/queue/{entry['id']}/approve")
    assert r.status_code == 409
 def test_approve_starts_download_and_returns_ok(client):
    """Approving a pending entry returns {ok: true} and starts a background thread."""
    import time
    import threading
    entry = _queue_one(client)
    # Patch snapshot_download so the thread doesn't actually hit the network.
    # Use an Event so we can wait for the thread to finish before asserting.
    thread_done = threading.Event()
    original_run = None
    def _fake_snapshot_download(**kwargs):
        pass
    with patch("app.models.snapshot_download", side_effect=_fake_snapshot_download):
        r = client.post(f"/api/models/queue/{entry['id']}/approve")
        assert r.status_code == 200
        assert r.json() == {"ok": True}
        # Give the background thread a moment to complete while snapshot_download is patched
        time.sleep(0.3)
        # Queue entry status should have moved to 'downloading' (or 'ready' if fast)
        from app import models as models_module
        updated = models_module._get_queue_entry(entry["id"])
        assert updated is not None, "Queue entry not found — thread may have run after fixture teardown"
        assert updated["status"] in ("downloading", "ready", "failed")
 # ── GET /download/stream ───────────────────────────────────────────────────────
 def test_download_stream_idle_when_no_download(client):
    """GET /download/stream returns a single idle event when nothing is downloading."""
    r = client.get("/api/models/download/stream")
    assert r.status_code == 200
    # SSE body should contain the idle event
    assert "idle" in r.text
 # ── GET /installed ─────────────────────────────────────────────────────────────
 def test_installed_empty(client):
    """GET /installed returns [] when models dir is empty."""
    r = client.get("/api/models/installed")
    assert r.status_code == 200
    assert r.json() == []
 def test_installed_detects_downloaded_model(client, tmp_path):
    """A subdir with config.json is surfaced as type='downloaded'."""
    from app import models as models_module
    model_dir = models_module._MODELS_DIR / "org--mymodel"
    model_dir.mkdir()
    (model_dir / "config.json").write_text(json.dumps({"model_type": "bert"}), encoding="utf-8")
    (model_dir / "model_info.json").write_text(
        json.dumps({"repo_id": "org/mymodel", "adapter_recommendation": "ZeroShotAdapter"}),
        encoding="utf-8",
    )
    r = client.get("/api/models/installed")
    assert r.status_code == 200
    items = r.json()
    assert len(items) == 1
    assert items[0]["type"] == "downloaded"
    assert items[0]["name"] == "org--mymodel"
    assert items[0]["adapter"] == "ZeroShotAdapter"
    assert items[0]["model_id"] == "org/mymodel"
 def test_installed_detects_finetuned_model(client):
    """A subdir with training_info.json is surfaced as type='finetuned'."""
    from app import models as models_module
    model_dir = models_module._MODELS_DIR / "my-finetuned"
    model_dir.mkdir()
    (model_dir / "training_info.json").write_text(
        json.dumps({"base_model": "org/base", "epochs": 5}), encoding="utf-8"
    )
    r = client.get("/api/models/installed")
    assert r.status_code == 200
    items = r.json()
    assert len(items) == 1
    assert items[0]["type"] == "finetuned"
    assert items[0]["name"] == "my-finetuned"
 # ── DELETE /installed/{name} ───────────────────────────────────────────────────
 def test_delete_installed_removes_directory(client):
    """DELETE /installed/{name} removes the directory and returns {ok: true}."""
    from app import models as models_module
    model_dir = models_module._MODELS_DIR / "org--removeme"
    model_dir.mkdir()
    (model_dir / "config.json").write_text("{}", encoding="utf-8")
    r = client.delete("/api/models/installed/org--removeme")
    assert r.status_code == 200
    assert r.json() == {"ok": True}
    assert not model_dir.exists()
 def test_delete_installed_not_found_returns_404(client):
    r = client.delete("/api/models/installed/does-not-exist")
    assert r.status_code == 404
 def test_delete_installed_path_traversal_blocked(client):
    """DELETE /installed/../../etc must be blocked (400 or 422)."""
    r = client.delete("/api/models/installed/../../etc")
    assert r.status_code in (400, 404, 422)
 def test_delete_installed_dotdot_name_blocked(client):
    """A name containing '..' in any form must be rejected."""
    r = client.delete("/api/models/installed/..%2F..%2Fetc")
    assert r.status_code in (400, 404, 422)
 def test_delete_installed_name_with_slash_blocked(client):
    """A name containing a literal '/' after URL decoding must be rejected."""
    from app import models as models_module
    # The router will see the path segment after /installed/ — a second '/' would
    # be parsed as a new path segment, so we test via the validation helper directly.
    with pytest.raises(Exception):
        # Simulate calling delete logic with a slash-containing name directly
        from fastapi import HTTPException as _HTTPException
        from app.models import delete_installed
        try:
            delete_installed("org/traversal")
        except _HTTPException as exc:
            assert exc.status_code in (400, 404)
            raise
--- a/tests/test_sft.py
+++ b/tests/test_sft.py
@ -1,377 +0,0 @@
 """API integration tests for app/sft.py — /api/sft/* endpoints."""
 import json
 import pytest
 from fastapi.testclient import TestClient
 from pathlib import Path
@pytest.fixture(autouse=True)
 def reset_sft_globals(tmp_path):
    from app import sft as sft_module
    _prev_data = sft_module._SFT_DATA_DIR
    _prev_cfg = sft_module._SFT_CONFIG_DIR
    sft_module.set_sft_data_dir(tmp_path)
    sft_module.set_sft_config_dir(tmp_path)
    yield
    sft_module.set_sft_data_dir(_prev_data)
    sft_module.set_sft_config_dir(_prev_cfg)
@pytest.fixture
 def client():
    from app.api import app
    return TestClient(app)
 def _make_record(id: str, run_id: str = "2026-04-07-143022") -> dict:
    return {
        "id": id, "source": "cf-orch-benchmark",
        "benchmark_run_id": run_id, "timestamp": "2026-04-07T10:00:00Z",
        "status": "needs_review",
        "prompt_messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Write a Python function that adds two numbers."},
        ],
        "model_response": "def add(a, b): return a - b",
        "corrected_response": None,
        "quality_score": 0.2, "failure_reason": "pattern_match: 0/2 matched",
        "task_id": "code-fn", "task_type": "code",
        "task_name": "Code: Write a Python function",
        "model_id": "Qwen/Qwen2.5-3B", "model_name": "Qwen2.5-3B",
        "node_id": "heimdall", "gpu_id": 0, "tokens_per_sec": 38.4,
    }
 def _write_run(tmp_path, run_id: str, records: list[dict]) -> Path:
    run_dir = tmp_path / "bench_results" / run_id
    run_dir.mkdir(parents=True)
    sft_path = run_dir / "sft_candidates.jsonl"
    sft_path.write_text(
        "\n".join(json.dumps(r) for r in records) + "\n", encoding="utf-8"
    )
    return sft_path
 def _write_config(tmp_path, bench_results_dir: Path) -> None:
    import yaml
    cfg = {"sft": {"bench_results_dir": str(bench_results_dir)}}
    (tmp_path / "label_tool.yaml").write_text(
        yaml.dump(cfg, allow_unicode=True), encoding="utf-8"
    )
 # ── /api/sft/runs ──────────────────────────────────────────────────────────
 def test_runs_returns_empty_when_no_config(client):
    r = client.get("/api/sft/runs")
    assert r.status_code == 200
    assert r.json() == []
 def test_runs_returns_available_runs(client, tmp_path):
    _write_run(tmp_path, "2026-04-07-143022", [_make_record("a"), _make_record("b")])
    _write_config(tmp_path, tmp_path / "bench_results")
    r = client.get("/api/sft/runs")
    assert r.status_code == 200
    data = r.json()
    assert len(data) == 1
    assert data[0]["run_id"] == "2026-04-07-143022"
    assert data[0]["candidate_count"] == 2
    assert data[0]["already_imported"] is False
 def test_runs_marks_already_imported(client, tmp_path):
    _write_run(tmp_path, "2026-04-07-143022", [_make_record("a")])
    _write_config(tmp_path, tmp_path / "bench_results")
    from app import sft as sft_module
    candidates = sft_module._candidates_file()
    candidates.parent.mkdir(parents=True, exist_ok=True)
    candidates.write_text(
        json.dumps(_make_record("a", run_id="2026-04-07-143022")) + "\n",
        encoding="utf-8"
    )
    r = client.get("/api/sft/runs")
    assert r.json()[0]["already_imported"] is True
 # ── /api/sft/import ─────────────────────────────────────────────────────────
 def test_import_adds_records(client, tmp_path):
    _write_run(tmp_path, "2026-04-07-143022", [_make_record("a"), _make_record("b")])
    _write_config(tmp_path, tmp_path / "bench_results")
    r = client.post("/api/sft/import", json={"run_id": "2026-04-07-143022"})
    assert r.status_code == 200
    assert r.json() == {"imported": 2, "skipped": 0}
 def test_import_is_idempotent(client, tmp_path):
    _write_run(tmp_path, "2026-04-07-143022", [_make_record("a")])
    _write_config(tmp_path, tmp_path / "bench_results")
    client.post("/api/sft/import", json={"run_id": "2026-04-07-143022"})
    r = client.post("/api/sft/import", json={"run_id": "2026-04-07-143022"})
    assert r.json() == {"imported": 0, "skipped": 1}
 def test_import_unknown_run_returns_404(client, tmp_path):
    _write_config(tmp_path, tmp_path / "bench_results")
    r = client.post("/api/sft/import", json={"run_id": "nonexistent"})
    assert r.status_code == 404
 # ── /api/sft/queue ──────────────────────────────────────────────────────────
 def _populate_candidates(tmp_path, records: list[dict]) -> None:
    from app import sft as sft_module
    path = sft_module._candidates_file()
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(
        "\n".join(json.dumps(r) for r in records) + "\n", encoding="utf-8"
    )
 def test_queue_returns_needs_review_only(client, tmp_path):
    records = [
        _make_record("a"),                                         # needs_review
        {**_make_record("b"), "status": "approved"},              # should not appear
        {**_make_record("c"), "status": "discarded"},             # should not appear
    ]
    _populate_candidates(tmp_path, records)
    r = client.get("/api/sft/queue")
    assert r.status_code == 200
    data = r.json()
    assert data["total"] == 1
    assert len(data["items"]) == 1
    assert data["items"][0]["id"] == "a"
 def test_queue_pagination(client, tmp_path):
    records = [_make_record(str(i)) for i in range(25)]
    _populate_candidates(tmp_path, records)
    r = client.get("/api/sft/queue?page=1&per_page=10")
    data = r.json()
    assert data["total"] == 25
    assert len(data["items"]) == 10
    r2 = client.get("/api/sft/queue?page=3&per_page=10")
    assert len(r2.json()["items"]) == 5
 def test_queue_empty_when_no_file(client):
    r = client.get("/api/sft/queue")
    assert r.status_code == 200
    assert r.json() == {"items": [], "total": 0, "page": 1, "per_page": 20}
 # ── /api/sft/submit ─────────────────────────────────────────────────────────
 def test_submit_correct_sets_approved(client, tmp_path):
    _populate_candidates(tmp_path, [_make_record("a")])
    r = client.post("/api/sft/submit", json={
        "id": "a", "action": "correct",
        "corrected_response": "def add(a, b): return a + b",
    })
    assert r.status_code == 200
    from app import sft as sft_module
    records = sft_module._read_candidates()
    assert records[0]["status"] == "approved"
    assert records[0]["corrected_response"] == "def add(a, b): return a + b"
 def test_submit_correct_also_appends_to_approved_file(client, tmp_path):
    _populate_candidates(tmp_path, [_make_record("a")])
    client.post("/api/sft/submit", json={
        "id": "a", "action": "correct",
        "corrected_response": "def add(a, b): return a + b",
    })
    from app import sft as sft_module
    from app.utils import read_jsonl
    approved = read_jsonl(sft_module._approved_file())
    assert len(approved) == 1
    assert approved[0]["id"] == "a"
 def test_submit_discard_sets_discarded(client, tmp_path):
    _populate_candidates(tmp_path, [_make_record("a")])
    r = client.post("/api/sft/submit", json={"id": "a", "action": "discard"})
    assert r.status_code == 200
    from app import sft as sft_module
    assert sft_module._read_candidates()[0]["status"] == "discarded"
 def test_submit_flag_sets_model_rejected(client, tmp_path):
    _populate_candidates(tmp_path, [_make_record("a")])
    r = client.post("/api/sft/submit", json={"id": "a", "action": "flag"})
    assert r.status_code == 200
    from app import sft as sft_module
    assert sft_module._read_candidates()[0]["status"] == "model_rejected"
 def test_submit_correct_empty_response_returns_422(client, tmp_path):
    _populate_candidates(tmp_path, [_make_record("a")])
    r = client.post("/api/sft/submit", json={
        "id": "a", "action": "correct", "corrected_response": "   ",
    })
    assert r.status_code == 422
 def test_submit_correct_null_response_returns_422(client, tmp_path):
    _populate_candidates(tmp_path, [_make_record("a")])
    r = client.post("/api/sft/submit", json={
        "id": "a", "action": "correct", "corrected_response": None,
    })
    assert r.status_code == 422
 def test_submit_unknown_id_returns_404(client, tmp_path):
    r = client.post("/api/sft/submit", json={"id": "nope", "action": "discard"})
    assert r.status_code == 404
 def test_submit_already_approved_returns_409(client, tmp_path):
    _populate_candidates(tmp_path, [{**_make_record("a"), "status": "approved"}])
    r = client.post("/api/sft/submit", json={"id": "a", "action": "discard"})
    assert r.status_code == 409
 def test_submit_correct_stores_failure_category(client, tmp_path):
    _populate_candidates(tmp_path, [_make_record("a")])
    r = client.post("/api/sft/submit", json={
        "id": "a", "action": "correct",
        "corrected_response": "def add(a, b): return a + b",
        "failure_category": "style_violation",
    })
    assert r.status_code == 200
    from app import sft as sft_module
    records = sft_module._read_candidates()
    assert records[0]["failure_category"] == "style_violation"
 def test_submit_correct_null_failure_category(client, tmp_path):
    _populate_candidates(tmp_path, [_make_record("a")])
    r = client.post("/api/sft/submit", json={
        "id": "a", "action": "correct",
        "corrected_response": "def add(a, b): return a + b",
    })
    assert r.status_code == 200
    from app import sft as sft_module
    records = sft_module._read_candidates()
    assert records[0]["failure_category"] is None
 def test_submit_invalid_failure_category_returns_422(client, tmp_path):
    _populate_candidates(tmp_path, [_make_record("a")])
    r = client.post("/api/sft/submit", json={
        "id": "a", "action": "correct",
        "corrected_response": "def add(a, b): return a + b",
        "failure_category": "nonsense",
    })
    assert r.status_code == 422
 # ── /api/sft/undo ────────────────────────────────────────────────────────────
 def test_undo_restores_discarded_to_needs_review(client, tmp_path):
    _populate_candidates(tmp_path, [_make_record("a")])
    client.post("/api/sft/submit", json={"id": "a", "action": "discard"})
    r = client.post("/api/sft/undo", json={"id": "a"})
    assert r.status_code == 200
    from app import sft as sft_module
    assert sft_module._read_candidates()[0]["status"] == "needs_review"
 def test_undo_removes_approved_from_approved_file(client, tmp_path):
    _populate_candidates(tmp_path, [_make_record("a")])
    client.post("/api/sft/submit", json={
        "id": "a", "action": "correct",
        "corrected_response": "def add(a, b): return a + b",
    })
    client.post("/api/sft/undo", json={"id": "a"})
    from app import sft as sft_module
    from app.utils import read_jsonl
    approved = read_jsonl(sft_module._approved_file())
    assert not any(r["id"] == "a" for r in approved)
 def test_undo_already_needs_review_returns_409(client, tmp_path):
    _populate_candidates(tmp_path, [_make_record("a")])
    r = client.post("/api/sft/undo", json={"id": "a"})
    assert r.status_code == 409
 # ── /api/sft/export ──────────────────────────────────────────────────────────
 def test_export_returns_approved_as_sft_jsonl(client, tmp_path):
    from app import sft as sft_module
    from app.utils import write_jsonl
    approved = {
        **_make_record("a"),
        "status": "approved",
        "corrected_response": "def add(a, b): return a + b",
        "prompt_messages": [
            {"role": "system", "content": "You are a coding assistant."},
            {"role": "user", "content": "Write a Python add function."},
        ],
    }
    write_jsonl(sft_module._approved_file(), [approved])
    _populate_candidates(tmp_path, [approved])
    r = client.get("/api/sft/export")
    assert r.status_code == 200
    assert "application/x-ndjson" in r.headers["content-type"]
    lines = [l for l in r.text.splitlines() if l.strip()]
    assert len(lines) == 1
    record = json.loads(lines[0])
    assert record["messages"][-1] == {
        "role": "assistant", "content": "def add(a, b): return a + b"
    }
    assert record["messages"][0]["role"] == "system"
    assert record["messages"][1]["role"] == "user"
 def test_export_excludes_non_approved(client, tmp_path):
    from app import sft as sft_module
    from app.utils import write_jsonl
    records = [
        {**_make_record("a"), "status": "discarded", "corrected_response": None},
        {**_make_record("b"), "status": "needs_review", "corrected_response": None},
    ]
    write_jsonl(sft_module._approved_file(), records)
    r = client.get("/api/sft/export")
    assert r.text.strip() == ""
 def test_export_empty_when_no_approved_file(client):
    r = client.get("/api/sft/export")
    assert r.status_code == 200
    assert r.text.strip() == ""
 # ── /api/sft/stats ───────────────────────────────────────────────────────────
 def test_stats_counts_by_status(client, tmp_path):
    from app import sft as sft_module
    from app.utils import write_jsonl
    records = [
        _make_record("a"),
        {**_make_record("b"), "status": "approved", "corrected_response": "ok"},
        {**_make_record("c"), "status": "discarded"},
        {**_make_record("d"), "status": "model_rejected"},
    ]
    _populate_candidates(tmp_path, records)
    write_jsonl(sft_module._approved_file(), [records[1]])
    r = client.get("/api/sft/stats")
    assert r.status_code == 200
    data = r.json()
    assert data["total"] == 4
    assert data["by_status"]["needs_review"] == 1
    assert data["by_status"]["approved"] == 1
    assert data["by_status"]["discarded"] == 1
    assert data["by_status"]["model_rejected"] == 1
    assert data["export_ready"] == 1
 def test_stats_empty_when_no_data(client):
    r = client.get("/api/sft/stats")
    assert r.status_code == 200
    data = r.json()
    assert data["total"] == 0
    assert data["export_ready"] == 0
--- a/tests/test_sft_import.py
+++ b/tests/test_sft_import.py
@ -1,95 +0,0 @@
 """Unit tests for scripts/sft_import.py — run discovery and JSONL deduplication."""
 import json
 import pytest
 from pathlib import Path
 def _write_candidates(path: Path, records: list[dict]) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text("\n".join(json.dumps(r) for r in records) + "\n", encoding="utf-8")
 def _make_record(id: str, run_id: str = "run1") -> dict:
    return {
        "id": id, "source": "cf-orch-benchmark",
        "benchmark_run_id": run_id, "timestamp": "2026-04-07T10:00:00Z",
        "status": "needs_review", "prompt_messages": [],
        "model_response": "bad", "corrected_response": None,
        "quality_score": 0.3, "failure_reason": "missing patterns",
        "task_id": "code-fn", "task_type": "code", "task_name": "Code: fn",
        "model_id": "Qwen/Qwen2.5-3B", "model_name": "Qwen2.5-3B",
        "node_id": "heimdall", "gpu_id": 0, "tokens_per_sec": 38.4,
    }
 def test_discover_runs_empty_when_dir_missing(tmp_path):
    from scripts.sft_import import discover_runs
    result = discover_runs(tmp_path / "nonexistent")
    assert result == []
 def test_discover_runs_returns_runs(tmp_path):
    from scripts.sft_import import discover_runs
    run_dir = tmp_path / "2026-04-07-143022"
    _write_candidates(run_dir / "sft_candidates.jsonl", [_make_record("a"), _make_record("b")])
    result = discover_runs(tmp_path)
    assert len(result) == 1
    assert result[0]["run_id"] == "2026-04-07-143022"
    assert result[0]["candidate_count"] == 2
    assert "sft_path" in result[0]
 def test_discover_runs_skips_dirs_without_sft_file(tmp_path):
    from scripts.sft_import import discover_runs
    (tmp_path / "2026-04-07-no-sft").mkdir()
    result = discover_runs(tmp_path)
    assert result == []
 def test_discover_runs_sorted_newest_first(tmp_path):
    from scripts.sft_import import discover_runs
    for name in ["2026-04-05-120000", "2026-04-07-143022", "2026-04-06-090000"]:
        run_dir = tmp_path / name
        _write_candidates(run_dir / "sft_candidates.jsonl", [_make_record("x")])
    result = discover_runs(tmp_path)
    assert [r["run_id"] for r in result] == [
        "2026-04-07-143022", "2026-04-06-090000", "2026-04-05-120000"
    ]
 def test_import_run_imports_new_records(tmp_path):
    from scripts.sft_import import import_run
    sft_path = tmp_path / "run1" / "sft_candidates.jsonl"
    _write_candidates(sft_path, [_make_record("a"), _make_record("b")])
    result = import_run(sft_path, tmp_path)
    assert result == {"imported": 2, "skipped": 0}
    dest = tmp_path / "sft_candidates.jsonl"
    lines = [json.loads(l) for l in dest.read_text().splitlines() if l.strip()]
    assert len(lines) == 2
 def test_import_run_deduplicates_on_id(tmp_path):
    from scripts.sft_import import import_run
    sft_path = tmp_path / "run1" / "sft_candidates.jsonl"
    _write_candidates(sft_path, [_make_record("a"), _make_record("b")])
    import_run(sft_path, tmp_path)
    result = import_run(sft_path, tmp_path)  # second import
    assert result == {"imported": 0, "skipped": 2}
    dest = tmp_path / "sft_candidates.jsonl"
    lines = [l for l in dest.read_text().splitlines() if l.strip()]
    assert len(lines) == 2  # no duplicates
 def test_import_run_skips_records_missing_id(tmp_path, caplog):
    import logging
    from scripts.sft_import import import_run
    sft_path = tmp_path / "run1" / "sft_candidates.jsonl"
    sft_path.parent.mkdir()
    sft_path.write_text(
        json.dumps({"model_response": "bad", "status": "needs_review"}) + "\n"
        + json.dumps({"id": "abc123", "model_response": "good", "status": "needs_review"}) + "\n"
    )
    with caplog.at_level(logging.WARNING, logger="scripts.sft_import"):
        result = import_run(sft_path, tmp_path)
    assert result == {"imported": 1, "skipped": 0}
    assert "missing 'id'" in caplog.text
--- a/web/src/components/AppSidebar.vue
+++ b/web/src/components/AppSidebar.vue
@ -66,8 +66,6 @@ const navItems = [
  { path: '/fetch',     icon: '📥', label: 'Fetch'     },
  { path: '/stats',     icon: '📊', label: 'Stats'     },
  { path: '/benchmark', icon: '🏁', label: 'Benchmark' },
  { path: '/models',      icon: '🤗', label: 'Models'      },
  { path: '/corrections', icon: '✍️', label: 'Corrections' },
  { path: '/settings',  icon: '⚙️', label: 'Settings'  },
 ]
--- a/web/src/components/SftCard.test.ts
+++ b/web/src/components/SftCard.test.ts
@ -1,179 +0,0 @@
 import { mount } from '@vue/test-utils'
 import SftCard from './SftCard.vue'
 import type { SftQueueItem } from '../stores/sft'
 import { describe, it, expect } from 'vitest'
 const LOW_QUALITY_ITEM: SftQueueItem = {
  id: 'abc', source: 'cf-orch-benchmark', benchmark_run_id: 'run1',
  timestamp: '2026-04-07T10:00:00Z', status: 'needs_review',
  prompt_messages: [
    { role: 'system', content: 'You are a coding assistant.' },
    { role: 'user', content: 'Write a Python add function.' },
  ],
  model_response: 'def add(a, b): return a - b',
  corrected_response: null, quality_score: 0.2,
  failure_reason: 'pattern_match: 0/2 matched',
  failure_category: null,
  task_id: 'code-fn', task_type: 'code', task_name: 'Code: Write a function',
  model_id: 'Qwen/Qwen2.5-3B', model_name: 'Qwen2.5-3B',
  node_id: 'heimdall', gpu_id: 0, tokens_per_sec: 38.4,
 }
 const MID_QUALITY_ITEM: SftQueueItem = { ...LOW_QUALITY_ITEM, id: 'mid', quality_score: 0.55 }
 const HIGH_QUALITY_ITEM: SftQueueItem = { ...LOW_QUALITY_ITEM, id: 'hi', quality_score: 0.72 }
 describe('SftCard', () => {
  it('renders model name chip', () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    expect(w.text()).toContain('Qwen2.5-3B')
  })
  it('renders task type chip', () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    expect(w.text()).toContain('code')
  })
  it('renders failure reason', () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    expect(w.text()).toContain('pattern_match: 0/2 matched')
  })
  it('renders model response', () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    expect(w.text()).toContain('def add(a, b): return a - b')
  })
  it('quality chip shows numeric value for low quality', () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    expect(w.text()).toContain('0.20')
  })
  it('quality chip has low-quality class when score < 0.4', () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    expect(w.find('[data-testid="quality-chip"]').classes()).toContain('quality-low')
  })
  it('quality chip has mid-quality class when score is 0.4 to <0.7', () => {
    const w = mount(SftCard, { props: { item: MID_QUALITY_ITEM } })
    expect(w.find('[data-testid="quality-chip"]').classes()).toContain('quality-mid')
  })
  it('quality chip has acceptable class when score >= 0.7', () => {
    const w = mount(SftCard, { props: { item: HIGH_QUALITY_ITEM } })
    expect(w.find('[data-testid="quality-chip"]').classes()).toContain('quality-ok')
  })
  it('clicking Correct button emits correct', async () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    await w.find('[data-testid="correct-btn"]').trigger('click')
    expect(w.emitted('correct')).toBeTruthy()
  })
  it('clicking Discard button then confirming emits discard', async () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    await w.find('[data-testid="discard-btn"]').trigger('click')
    await w.find('[data-testid="confirm-pending-btn"]').trigger('click')
    expect(w.emitted('discard')).toBeTruthy()
  })
  it('clicking Flag Model button then confirming emits flag', async () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    await w.find('[data-testid="flag-btn"]').trigger('click')
    await w.find('[data-testid="confirm-pending-btn"]').trigger('click')
    expect(w.emitted('flag')).toBeTruthy()
  })
  it('correction area hidden initially', () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    expect(w.find('[data-testid="correction-area"]').exists()).toBe(false)
  })
  it('correction area shown when correcting prop is true', () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM, correcting: true } })
    expect(w.find('[data-testid="correction-area"]').exists()).toBe(true)
  })
  it('renders nothing for failure reason when null', () => {
    const item = { ...LOW_QUALITY_ITEM, failure_reason: null }
    const w = mount(SftCard, { props: { item } })
    expect(w.find('.failure-reason').exists()).toBe(false)
  })
  // ── Failure category chip-group ───────────────────────────────────
  it('failure category section hidden when not correcting and no pending action', () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    expect(w.find('[data-testid="failure-category-section"]').exists()).toBe(false)
  })
  it('failure category section shown when correcting prop is true', () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM, correcting: true } })
    expect(w.find('[data-testid="failure-category-section"]').exists()).toBe(true)
  })
  it('renders all six category chips when correcting', () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM, correcting: true } })
    const chips = w.findAll('.category-chip')
    expect(chips).toHaveLength(6)
  })
  it('clicking a category chip selects it (adds active class)', async () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM, correcting: true } })
    const chip = w.find('[data-testid="category-chip-wrong_answer"]')
    await chip.trigger('click')
    expect(chip.classes()).toContain('category-chip--active')
  })
  it('clicking the active chip again deselects it', async () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM, correcting: true } })
    const chip = w.find('[data-testid="category-chip-hallucination"]')
    await chip.trigger('click')
    expect(chip.classes()).toContain('category-chip--active')
    await chip.trigger('click')
    expect(chip.classes()).not.toContain('category-chip--active')
  })
  it('only one chip can be active at a time', async () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM, correcting: true } })
    await w.find('[data-testid="category-chip-wrong_answer"]').trigger('click')
    await w.find('[data-testid="category-chip-hallucination"]').trigger('click')
    const active = w.findAll('.category-chip--active')
    expect(active).toHaveLength(1)
    expect(active[0].attributes('data-testid')).toBe('category-chip-hallucination')
  })
  it('clicking Discard shows pending action row with category section', async () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    await w.find('[data-testid="discard-btn"]').trigger('click')
    expect(w.find('[data-testid="failure-category-section"]').exists()).toBe(true)
    expect(w.find('[data-testid="pending-action-row"]').exists()).toBe(true)
  })
  it('clicking Flag shows pending action row', async () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    await w.find('[data-testid="flag-btn"]').trigger('click')
    expect(w.find('[data-testid="pending-action-row"]').exists()).toBe(true)
  })
  it('confirming discard emits discard with null when no category selected', async () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    await w.find('[data-testid="discard-btn"]').trigger('click')
    await w.find('[data-testid="confirm-pending-btn"]').trigger('click')
    expect(w.emitted('discard')).toBeTruthy()
    expect(w.emitted('discard')![0]).toEqual([null])
  })
  it('confirming discard emits discard with selected category', async () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    await w.find('[data-testid="discard-btn"]').trigger('click')
    await w.find('[data-testid="category-chip-scoring_artifact"]').trigger('click')
    await w.find('[data-testid="confirm-pending-btn"]').trigger('click')
    expect(w.emitted('discard')![0]).toEqual(['scoring_artifact'])
  })
  it('cancelling pending action hides the pending row', async () => {
    const w = mount(SftCard, { props: { item: LOW_QUALITY_ITEM } })
    await w.find('[data-testid="discard-btn"]').trigger('click')
    await w.find('[data-testid="cancel-pending-btn"]').trigger('click')
    expect(w.find('[data-testid="pending-action-row"]').exists()).toBe(false)
  })
 })
--- a/web/src/components/SftCard.vue
+++ b/web/src/components/SftCard.vue
@ -1,393 +0,0 @@
 <template>
  <article class="sft-card">
    <!-- Chips row -->
    <div class="chips-row">
      <span class="chip chip-model">{{ item.model_name }}</span>
      <span class="chip chip-task">{{ item.task_type }}</span>
      <span class="chip chip-node">{{ item.node_id }} · GPU {{ item.gpu_id }}</span>
      <span class="chip chip-speed">{{ item.tokens_per_sec.toFixed(1) }} tok/s</span>
      <span
        class="chip quality-chip"
        :class="qualityClass"
        data-testid="quality-chip"
        :title="qualityLabel"
      >
        {{ item.quality_score.toFixed(2) }} · {{ qualityLabel }}
      </span>
    </div>
    <!-- Failure reason -->
    <p v-if="item.failure_reason" class="failure-reason">{{ item.failure_reason }}</p>
    <!-- Prompt (collapsible) -->
    <div class="prompt-section">
      <button
        class="prompt-toggle"
        :aria-expanded="promptExpanded"
        @click="promptExpanded = !promptExpanded"
      >
        {{ promptExpanded ? 'Hide prompt ↑' : 'Show full prompt ↓' }}
      </button>
      <div v-if="promptExpanded" class="prompt-messages">
        <div
          v-for="(msg, i) in item.prompt_messages"
          :key="i"
          class="prompt-message"
          :class="`role-${msg.role}`"
        >
          <span class="role-label">{{ msg.role }}</span>
          <pre class="message-content">{{ msg.content }}</pre>
        </div>
      </div>
    </div>
    <!-- Model response -->
    <div class="model-response-section">
      <p class="section-label">Model output (incorrect)</p>
      <pre class="model-response">{{ item.model_response }}</pre>
    </div>
    <!-- Action bar -->
    <div class="action-bar">
      <button
        data-testid="correct-btn"
        class="btn-correct"
        @click="$emit('correct')"
      >✓ Correct</button>
      <button
        data-testid="discard-btn"
        class="btn-discard"
        @click="emitWithCategory('discard')"
      >✕ Discard</button>
      <button
        data-testid="flag-btn"
        class="btn-flag"
        @click="emitWithCategory('flag')"
      >⚑ Flag Model</button>
    </div>
    <!-- Failure category selector (shown when correcting or acting) -->
    <div
      v-if="correcting || pendingAction"
      class="failure-category-section"
      data-testid="failure-category-section"
    >
      <p class="section-label">Failure category <span class="optional-label">(optional)</span></p>
      <div class="category-chips" role="group" aria-label="Failure category">
        <button
          v-for="cat in FAILURE_CATEGORIES"
          :key="cat.value"
          type="button"
          class="category-chip"
          :class="{ 'category-chip--active': selectedCategory === cat.value }"
          :aria-pressed="selectedCategory === cat.value || undefined"
          :data-testid="'category-chip-' + cat.value"
          @click="toggleCategory(cat.value)"
        >{{ cat.label }}</button>
      </div>
      <!-- Pending discard/flag confirm row -->
      <div v-if="pendingAction" class="pending-action-row" data-testid="pending-action-row">
        <button class="btn-confirm" @click="confirmPendingAction" data-testid="confirm-pending-btn">
          Confirm {{ pendingAction }}
        </button>
        <button class="btn-cancel-pending" @click="cancelPendingAction" data-testid="cancel-pending-btn">
          Cancel
        </button>
      </div>
    </div>
    <!-- Correction area (shown when correcting = true) -->
    <div v-if="correcting" data-testid="correction-area">
      <SftCorrectionArea
        ref="correctionAreaEl"
        :described-by="'sft-failure-' + item.id"
        @submit="handleSubmitCorrection"
        @cancel="$emit('cancel-correction')"
      />
    </div>
  </article>
 </template>
 <script setup lang="ts">
 import { ref, computed } from 'vue'
 import type { SftQueueItem, SftFailureCategory } from '../stores/sft'
 import SftCorrectionArea from './SftCorrectionArea.vue'
 const props = defineProps<{ item: SftQueueItem; correcting?: boolean }>()
 const emit = defineEmits<{
  correct: []
  discard: [category: SftFailureCategory | null]
  flag: [category: SftFailureCategory | null]
  'submit-correction': [text: string, category: SftFailureCategory | null]
  'cancel-correction': []
 }>()
 const FAILURE_CATEGORIES: { value: SftFailureCategory; label: string }[] = [
  { value: 'scoring_artifact', label: 'Scoring artifact' },
  { value: 'style_violation',  label: 'Style violation'  },
  { value: 'partial_answer',   label: 'Partial answer'   },
  { value: 'wrong_answer',     label: 'Wrong answer'     },
  { value: 'format_error',     label: 'Format error'     },
  { value: 'hallucination',    label: 'Hallucination'    },
 ]
 const promptExpanded   = ref(false)
 const correctionAreaEl = ref<InstanceType<typeof SftCorrectionArea> | null>(null)
 const selectedCategory = ref<SftFailureCategory | null>(null)
 const pendingAction    = ref<'discard' | 'flag' | null>(null)
 const qualityClass = computed(() => {
  const s = props.item.quality_score
  if (s < 0.4) return 'quality-low'
  if (s < 0.7) return 'quality-mid'
  return 'quality-ok'
 })
 const qualityLabel = computed(() => {
  const s = props.item.quality_score
  if (s < 0.4) return 'low quality'
  if (s < 0.7) return 'fair'
  return 'acceptable'
 })
 function toggleCategory(cat: SftFailureCategory) {
  selectedCategory.value = selectedCategory.value === cat ? null : cat
 }
 function emitWithCategory(action: 'discard' | 'flag') {
  pendingAction.value = action
 }
 function confirmPendingAction() {
  if (!pendingAction.value) return
  emit(pendingAction.value, selectedCategory.value)
  pendingAction.value = null
  selectedCategory.value = null
 }
 function cancelPendingAction() {
  pendingAction.value = null
 }
 function handleSubmitCorrection(text: string) {
  emit('submit-correction', text, selectedCategory.value)
  selectedCategory.value = null
 }
 function resetCorrection() {
  correctionAreaEl.value?.reset()
  selectedCategory.value = null
  pendingAction.value = null
 }
 defineExpose({ resetCorrection })
 </script>
 <style scoped>
 .sft-card {
  background: var(--color-surface-raised);
  border: 1px solid var(--color-border);
  border-radius: var(--radius-lg);
  padding: var(--space-4);
  display: flex;
  flex-direction: column;
  gap: var(--space-3);
 }
 .chips-row {
  display: flex;
  flex-wrap: wrap;
  gap: var(--space-2);
 }
 .chip {
  padding: var(--space-1) var(--space-2);
  border-radius: var(--radius-full);
  font-size: 0.78rem;
  font-weight: 600;
  white-space: nowrap;
 }
 .chip-model   { background: var(--color-primary-light, #e8f2e7); color: var(--color-primary); }
 .chip-task    { background: var(--color-surface-alt); color: var(--color-text-muted); }
 .chip-node    { background: var(--color-surface-alt); color: var(--color-text-muted); }
 .chip-speed   { background: var(--color-surface-alt); color: var(--color-text-muted); }
 .quality-chip { color: #fff; }
 .quality-low  { background: var(--color-error, #c0392b); }
 .quality-mid  { background: var(--color-warning, #d4891a); }
 .quality-ok   { background: var(--color-success, #3a7a32); }
 .failure-reason {
  font-size: 0.82rem;
  color: var(--color-text-muted);
  font-style: italic;
 }
 .prompt-toggle {
  background: none;
  border: none;
  color: var(--color-accent);
  font-size: 0.85rem;
  cursor: pointer;
  padding: 0;
  text-decoration: underline;
 }
 .prompt-messages {
  margin-top: var(--space-2);
  display: flex;
  flex-direction: column;
  gap: var(--space-2);
 }
 .prompt-message {
  display: flex;
  flex-direction: column;
  gap: var(--space-1);
 }
 .role-label {
  font-size: 0.75rem;
  font-weight: 700;
  text-transform: uppercase;
  letter-spacing: 0.05em;
  color: var(--color-text-muted);
 }
 .message-content {
  font-family: var(--font-mono);
  font-size: 0.82rem;
  white-space: pre-wrap;
  background: var(--color-surface-alt);
  padding: var(--space-2) var(--space-3);
  border-radius: var(--radius-md);
  max-height: 200px;
  overflow-y: auto;
 }
 .section-label {
  font-size: 0.82rem;
  font-weight: 600;
  color: var(--color-text-muted);
  margin-bottom: var(--space-1);
 }
 .model-response {
  font-family: var(--font-mono);
  font-size: 0.88rem;
  white-space: pre-wrap;
  background: color-mix(in srgb, var(--color-error, #c0392b) 8%, var(--color-surface-alt));
  border-left: 3px solid var(--color-error, #c0392b);
  padding: var(--space-3);
  border-radius: var(--radius-md);
  max-height: 300px;
  overflow-y: auto;
 }
 .action-bar {
  display: flex;
  gap: var(--space-3);
  flex-wrap: wrap;
 }
 .action-bar button {
  padding: var(--space-2) var(--space-4);
  border-radius: var(--radius-md);
  border: 1px solid var(--color-border);
  font-size: 0.9rem;
  cursor: pointer;
  background: var(--color-surface-raised);
  color: var(--color-text);
 }
 .btn-correct { border-color: var(--color-success); color: var(--color-success); }
 .btn-correct:hover { background: color-mix(in srgb, var(--color-success) 10%, transparent); }
 .btn-discard { border-color: var(--color-error); color: var(--color-error); }
 .btn-discard:hover { background: color-mix(in srgb, var(--color-error) 10%, transparent); }
 .btn-flag { border-color: var(--color-warning); color: var(--color-warning); }
 .btn-flag:hover { background: color-mix(in srgb, var(--color-warning) 10%, transparent); }
 /* ── Failure category selector ─────────────────── */
 .failure-category-section {
  display: flex;
  flex-direction: column;
  gap: var(--space-2);
 }
 .optional-label {
  font-size: 0.75rem;
  font-weight: 400;
  color: var(--color-text-muted);
 }
 .category-chips {
  display: flex;
  flex-wrap: wrap;
  gap: var(--space-2);
 }
 .category-chip {
  padding: var(--space-1) var(--space-3);
  border-radius: var(--radius-full);
  border: 1px solid var(--color-border);
  background: var(--color-surface-alt);
  color: var(--color-text-muted);
  font-size: 0.78rem;
  font-weight: 500;
  cursor: pointer;
  transition: background var(--transition), color var(--transition), border-color var(--transition);
 }
 .category-chip:hover {
  border-color: var(--color-accent);
  color: var(--color-accent);
  background: var(--color-accent-light);
 }
 .category-chip--active {
  background: var(--color-accent-light);
  border-color: var(--color-accent);
  color: var(--color-accent);
  font-weight: 700;
 }
 .pending-action-row {
  display: flex;
  gap: var(--space-2);
  margin-top: var(--space-1);
 }
 .btn-confirm {
  padding: var(--space-1) var(--space-3);
  border-radius: var(--radius-md);
  border: 1px solid var(--color-accent);
  background: var(--color-accent-light);
  color: var(--color-accent);
  font-size: 0.85rem;
  font-weight: 600;
  cursor: pointer;
 }
 .btn-confirm:hover {
  background: color-mix(in srgb, var(--color-accent) 15%, transparent);
 }
 .btn-cancel-pending {
  padding: var(--space-1) var(--space-3);
  border-radius: var(--radius-md);
  border: 1px solid var(--color-border);
  background: none;
  color: var(--color-text-muted);
  font-size: 0.85rem;
  cursor: pointer;
 }
 .btn-cancel-pending:hover {
  background: var(--color-surface-alt);
 }
 </style>
--- a/web/src/components/SftCorrectionArea.test.ts
+++ b/web/src/components/SftCorrectionArea.test.ts
@ -1,68 +0,0 @@
 import { mount } from '@vue/test-utils'
 import SftCorrectionArea from './SftCorrectionArea.vue'
 import { describe, it, expect } from 'vitest'
 describe('SftCorrectionArea', () => {
  it('renders a textarea', () => {
    const w = mount(SftCorrectionArea)
    expect(w.find('textarea').exists()).toBe(true)
  })
  it('submit button is disabled when textarea is empty', () => {
    const w = mount(SftCorrectionArea)
    const btn = w.find('[data-testid="submit-btn"]')
    expect((btn.element as HTMLButtonElement).disabled).toBe(true)
  })
  it('submit button is disabled when textarea is whitespace only', async () => {
    const w = mount(SftCorrectionArea)
    await w.find('textarea').setValue('   ')
    const btn = w.find('[data-testid="submit-btn"]')
    expect((btn.element as HTMLButtonElement).disabled).toBe(true)
  })
  it('submit button is enabled when textarea has content', async () => {
    const w = mount(SftCorrectionArea)
    await w.find('textarea').setValue('def add(a, b): return a + b')
    const btn = w.find('[data-testid="submit-btn"]')
    expect((btn.element as HTMLButtonElement).disabled).toBe(false)
  })
  it('clicking submit emits submit with trimmed text', async () => {
    const w = mount(SftCorrectionArea)
    await w.find('textarea').setValue('  def add(a, b): return a + b  ')
    await w.find('[data-testid="submit-btn"]').trigger('click')
    expect(w.emitted('submit')?.[0]).toEqual(['def add(a, b): return a + b'])
  })
  it('clicking cancel emits cancel', async () => {
    const w = mount(SftCorrectionArea)
    await w.find('[data-testid="cancel-btn"]').trigger('click')
    expect(w.emitted('cancel')).toBeTruthy()
  })
  it('Escape key emits cancel', async () => {
    const w = mount(SftCorrectionArea)
    await w.find('textarea').trigger('keydown', { key: 'Escape' })
    expect(w.emitted('cancel')).toBeTruthy()
  })
  it('Ctrl+Enter emits submit when text is non-empty', async () => {
    const w = mount(SftCorrectionArea)
    await w.find('textarea').setValue('correct answer')
    await w.find('textarea').trigger('keydown', { key: 'Enter', ctrlKey: true })
    expect(w.emitted('submit')?.[0]).toEqual(['correct answer'])
  })
  it('Ctrl+Enter does not emit submit when text is empty', async () => {
    const w = mount(SftCorrectionArea)
    await w.find('textarea').trigger('keydown', { key: 'Enter', ctrlKey: true })
    expect(w.emitted('submit')).toBeFalsy()
  })
  it('omits aria-describedby when describedBy prop is not provided', () => {
    const w = mount(SftCorrectionArea)
    const textarea = w.find('textarea')
    expect(textarea.attributes('aria-describedby')).toBeUndefined()
  })
 })
--- a/web/src/components/SftCorrectionArea.vue
+++ b/web/src/components/SftCorrectionArea.vue
@ -1,130 +0,0 @@
 <template>
  <div class="correction-area">
    <label class="correction-label" for="correction-textarea">
      Write the corrected response:
    </label>
    <textarea
      id="correction-textarea"
      ref="textareaEl"
      v-model="text"
      class="correction-textarea"
      aria-label="Write corrected response"
      aria-required="true"
      :aria-describedby="describedBy || undefined"
      placeholder="Write the response this model should have given..."
      rows="4"
      @keydown.escape="$emit('cancel')"
      @keydown.enter.ctrl.prevent="submitIfValid"
      @keydown.enter.meta.prevent="submitIfValid"
    />
    <div class="correction-actions">
      <button
        data-testid="submit-btn"
        class="btn-submit"
        :disabled="!isValid"
        @click="submitIfValid"
      >
        Submit correction
      </button>
      <button data-testid="cancel-btn" class="btn-cancel" @click="$emit('cancel')">
        Cancel
      </button>
    </div>
  </div>
 </template>
 <script setup lang="ts">
 import { ref, computed, onMounted } from 'vue'
 const props = withDefaults(defineProps<{ describedBy?: string }>(), { describedBy: undefined })
 const emit = defineEmits<{ submit: [text: string]; cancel: [] }>()
 const text       = ref('')
 const textareaEl = ref<HTMLTextAreaElement | null>(null)
 const isValid    = computed(() => text.value.trim().length > 0)
 onMounted(() => textareaEl.value?.focus())
 function submitIfValid() {
  if (isValid.value) emit('submit', text.value.trim())
 }
 function reset() {
  text.value = ''
 }
 defineExpose({ reset })
 </script>
 <style scoped>
 .correction-area {
  display: flex;
  flex-direction: column;
  gap: var(--space-3);
  padding: var(--space-4);
  border-top: 1px solid var(--color-border);
  background: var(--color-surface-alt, var(--color-surface));
  border-radius: 0 0 var(--radius-lg) var(--radius-lg);
 }
 .correction-label {
  font-size: 0.85rem;
  font-weight: 600;
  color: var(--color-text-muted);
 }
 .correction-textarea {
  width: 100%;
  min-height: 7rem;
  padding: var(--space-3);
  border: 1px solid var(--color-border);
  border-radius: var(--radius-md);
  background: var(--color-surface-raised);
  color: var(--color-text);
  font-family: var(--font-mono);
  font-size: 0.88rem;
  line-height: 1.5;
  resize: vertical;
 }
 .correction-textarea:focus {
  outline: 2px solid var(--color-primary);
  outline-offset: 1px;
 }
 .correction-actions {
  display: flex;
  gap: var(--space-3);
  align-items: center;
 }
 .btn-submit {
  padding: var(--space-2) var(--space-4);
  background: var(--color-primary);
  color: var(--color-text-inverse, #fff);
  border: none;
  border-radius: var(--radius-md);
  font-size: 0.9rem;
  cursor: pointer;
 }
 .btn-submit:disabled {
  opacity: 0.45;
  cursor: not-allowed;
 }
 .btn-submit:not(:disabled):hover {
  background: var(--color-primary-hover, var(--color-primary));
 }
 .btn-cancel {
  background: none;
  border: none;
  color: var(--color-text-muted);
  font-size: 0.9rem;
  cursor: pointer;
  text-decoration: underline;
  padding: 0;
 }
 </style>
--- a/web/src/composables/useSftKeyboard.ts
+++ b/web/src/composables/useSftKeyboard.ts
@ -1,42 +0,0 @@
 // src/composables/useSftKeyboard.ts
 import { onUnmounted, getCurrentInstance } from 'vue'
 interface Options {
  onCorrect:  () => void
  onDiscard:  () => void
  onFlag:     () => void
  onEscape:   () => void
  onSubmit:   () => void
  isEditing:  () => boolean   // returns true when correction area is open
 }
 export function useSftKeyboard(opts: Options) {
  function handler(e: KeyboardEvent) {
    // Never intercept keys when focus is in an input (correction textarea handles its own keys)
    if (e.target instanceof HTMLInputElement) return
    // When correction area is open, only handle Escape (textarea handles Ctrl+Enter itself)
    if (e.target instanceof HTMLTextAreaElement) return
    const k = e.key.toLowerCase()
    if (opts.isEditing()) {
      if (k === 'escape') opts.onEscape()
      return
    }
    if (k === 'c') { opts.onCorrect();  return }
    if (k === 'd') { opts.onDiscard();  return }
    if (k === 'f') { opts.onFlag();     return }
    if (k === 'escape') { opts.onEscape(); return }
  }
  window.addEventListener('keydown', handler)
  const cleanup = () => window.removeEventListener('keydown', handler)
  if (getCurrentInstance()) {
    onUnmounted(cleanup)
  }
  return { cleanup }
 }
--- a/web/src/router/index.ts
+++ b/web/src/router/index.ts
@ -6,8 +6,6 @@ const FetchView     = () => import('../views/FetchView.vue')
 const StatsView     = () => import('../views/StatsView.vue')
 const BenchmarkView = () => import('../views/BenchmarkView.vue')
 const SettingsView  = () => import('../views/SettingsView.vue')
 const CorrectionsView = () => import('../views/CorrectionsView.vue')
 const ModelsView      = () => import('../views/ModelsView.vue')
 export const router = createRouter({
  history: createWebHashHistory(),
@ -16,8 +14,6 @@ export const router = createRouter({
    { path: '/fetch',     component: FetchView,     meta: { title: 'Fetch' } },
    { path: '/stats',     component: StatsView,     meta: { title: 'Stats' } },
    { path: '/benchmark', component: BenchmarkView, meta: { title: 'Benchmark' } },
    { path: '/models',      component: ModelsView,      meta: { title: 'Models' } },
    { path: '/corrections', component: CorrectionsView, meta: { title: 'Corrections' } },
    { path: '/settings',  component: SettingsView,  meta: { title: 'Settings' } },
  ],
 })
--- a/web/src/stores/sft.test.ts
+++ b/web/src/stores/sft.test.ts
@ -1,78 +0,0 @@
 import { setActivePinia, createPinia } from 'pinia'
 import { useSftStore } from './sft'
 import type { SftQueueItem } from './sft'
 import { beforeEach, describe, it, expect } from 'vitest'
 function makeMockItem(overrides: Partial<SftQueueItem> = {}): SftQueueItem {
  return {
    id: 'abc',
    source: 'cf-orch-benchmark',
    benchmark_run_id: 'run1',
    timestamp: '2026-04-07T10:00:00Z',
    status: 'needs_review',
    prompt_messages: [
      { role: 'system', content: 'You are a coding assistant.' },
      { role: 'user', content: 'Write a Python add function.' },
    ],
    model_response: 'def add(a, b): return a - b',
    corrected_response: null,
    quality_score: 0.2,
    failure_reason: 'pattern_match: 0/2 matched',
    task_id: 'code-fn',
    task_type: 'code',
    task_name: 'Code: Write a Python function',
    model_id: 'Qwen/Qwen2.5-3B',
    model_name: 'Qwen2.5-3B',
    node_id: 'heimdall',
    gpu_id: 0,
    tokens_per_sec: 38.4,
    ...overrides,
  }
 }
 describe('useSftStore', () => {
  beforeEach(() => setActivePinia(createPinia()))
  it('starts with empty queue', () => {
    const store = useSftStore()
    expect(store.queue).toEqual([])
    expect(store.current).toBeNull()
  })
  it('current returns first item', () => {
    const store = useSftStore()
    store.queue = [makeMockItem()]
    expect(store.current?.id).toBe('abc')
  })
  it('removeCurrentFromQueue removes first item', () => {
    const store = useSftStore()
    const second = makeMockItem({ id: 'def' })
    store.queue = [makeMockItem(), second]
    store.removeCurrentFromQueue()
    expect(store.queue[0].id).toBe('def')
  })
  it('restoreItem adds to front of queue', () => {
    const store = useSftStore()
    const second = makeMockItem({ id: 'def' })
    store.queue = [second]
    store.restoreItem(makeMockItem())
    expect(store.queue[0].id).toBe('abc')
    expect(store.queue[1].id).toBe('def')
  })
  it('setLastAction records the action', () => {
    const store = useSftStore()
    store.setLastAction('discard', makeMockItem())
    expect(store.lastAction?.type).toBe('discard')
    expect(store.lastAction?.item.id).toBe('abc')
  })
  it('clearLastAction nulls lastAction', () => {
    const store = useSftStore()
    store.setLastAction('flag', makeMockItem())
    store.clearLastAction()
    expect(store.lastAction).toBeNull()
  })
 })
--- a/web/src/stores/sft.ts
+++ b/web/src/stores/sft.ts
@ -1,72 +0,0 @@
 // src/stores/sft.ts
 import { defineStore } from 'pinia'
 import { computed, ref } from 'vue'
 export type SftFailureCategory =
  | 'scoring_artifact'
  | 'style_violation'
  | 'partial_answer'
  | 'wrong_answer'
  | 'format_error'
  | 'hallucination'
 export interface SftQueueItem {
  id: string
  source: 'cf-orch-benchmark'
  benchmark_run_id: string
  timestamp: string
  status: 'needs_review' | 'approved' | 'discarded' | 'model_rejected'
  prompt_messages: { role: string; content: string }[]
  model_response: string
  corrected_response: string | null
  quality_score: number          // 0.0 to 1.0
  failure_reason: string | null
  failure_category: SftFailureCategory | null
  task_id: string
  task_type: string
  task_name: string
  model_id: string
  model_name: string
  node_id: string
  gpu_id: number
  tokens_per_sec: number
 }
 export interface SftLastAction {
  type: 'correct' | 'discard' | 'flag'
  item: SftQueueItem
  failure_category?: SftFailureCategory | null
 }
 export const useSftStore = defineStore('sft', () => {
  const queue          = ref<SftQueueItem[]>([])
  const totalRemaining = ref(0)
  const lastAction     = ref<SftLastAction | null>(null)
  const current = computed(() => queue.value[0] ?? null)
  function removeCurrentFromQueue() {
    queue.value.shift()
  }
  function setLastAction(
    type: SftLastAction['type'],
    item: SftQueueItem,
    failure_category?: SftFailureCategory | null,
  ) {
    lastAction.value = { type, item, failure_category }
  }
  function clearLastAction() {
    lastAction.value = null
  }
  function restoreItem(item: SftQueueItem) {
    queue.value.unshift(item)
  }
  return {
    queue, totalRemaining, lastAction, current,
    removeCurrentFromQueue, setLastAction, clearLastAction, restoreItem,
  }
 })
--- a/web/src/views/BenchmarkView.vue
+++ b/web/src/views/BenchmarkView.vue
@ -24,54 +24,6 @@
      </div>
    </header>
    <!-- Model Picker -->
    <details class="model-picker" ref="pickerEl">
      <summary class="picker-summary">
        <span class="picker-title">🎯 Model Selection</span>
        <span class="picker-badge">{{ pickerSummaryText }}</span>
      </summary>
      <div class="picker-body">
        <div v-if="modelsLoading" class="picker-loading">Loading models…</div>
        <div v-else-if="Object.keys(modelCategories).length === 0" class="picker-empty">
          No models found — check API connection.
        </div>
        <template v-else>
          <div
            v-for="(models, category) in modelCategories"
            :key="category"
            class="picker-category"
          >
            <label class="picker-cat-header">
              <input
                type="checkbox"
                :checked="isCategoryAllSelected(models)"
                :indeterminate="isCategoryIndeterminate(models)"
                @change="toggleCategory(models, ($event.target as HTMLInputElement).checked)"
              />
              <span class="picker-cat-name">{{ category }}</span>
              <span class="picker-cat-count">({{ models.length }})</span>
            </label>
            <div v-if="models.length === 0" class="picker-no-models">No models installed</div>
            <div v-else class="picker-model-list">
              <label
                v-for="m in models"
                :key="m.name"
                class="picker-model-row"
              >
                <input
                  type="checkbox"
                  :checked="selectedModels.has(m.name)"
                  @change="toggleModel(m.name, ($event.target as HTMLInputElement).checked)"
                />
                <span class="picker-model-name" :title="m.repo_id ?? m.name">{{ m.name }}</span>
                <span class="picker-adapter-type">{{ m.adapter_type }}</span>
              </label>
            </div>
          </div>
        </template>
      </div>
    </details>
    <!-- Trained models badge row -->
    <div v-if="fineTunedModels.length > 0" class="trained-models-row">
      <span class="trained-label">Trained:</span>
@ -272,16 +224,6 @@ const LABEL_META: Record<string, { emoji: string }> = {
 }
 // ── Types ────────────────────────────────────────────────────────────────────
 interface AvailableModel {
  name: string
  repo_id?: string
  adapter_type: string
 }
 interface ModelCategoriesResponse {
  categories: Record<string, AvailableModel[]>
 }
 interface FineTunedModel {
  name: string
  base_model_id?: string
@ -312,13 +254,6 @@ const runError    = ref('')
 const includeSlow = ref(false)
 const logEl       = ref<HTMLElement | null>(null)
 // Model picker state
 const modelCategories = ref<Record<string, AvailableModel[]>>({})
 const selectedModels  = ref<Set<string>>(new Set())
 const allModels       = ref<string[]>([])
 const modelsLoading   = ref(false)
 const pickerEl        = ref<HTMLDetailsElement | null>(null)
 // Fine-tune state
 const fineTunedModels  = ref<FineTunedModel[]>([])
 const ftModel          = ref('deberta-small')
@ -339,52 +274,6 @@ async function cancelFinetune() {
  await fetch('/api/finetune/cancel', { method: 'POST' }).catch(() => {})
 }
 // ── Model picker computed ─────────────────────────────────────────────────────
 const pickerSummaryText = computed(() => {
  const total = allModels.value.length
  if (total === 0) return 'No models available'
  const selected = selectedModels.value.size
  if (selected === total) return `All models (${total})`
  return `${selected} of ${total} selected`
 })
 function isCategoryAllSelected(models: AvailableModel[]): boolean {
  return models.length > 0 && models.every(m => selectedModels.value.has(m.name))
 }
 function isCategoryIndeterminate(models: AvailableModel[]): boolean {
  const someSelected = models.some(m => selectedModels.value.has(m.name))
  return someSelected && !isCategoryAllSelected(models)
 }
 function toggleModel(name: string, checked: boolean) {
  const next = new Set(selectedModels.value)
  if (checked) next.add(name)
  else next.delete(name)
  selectedModels.value = next
 }
 function toggleCategory(models: AvailableModel[], checked: boolean) {
  const next = new Set(selectedModels.value)
  for (const m of models) {
    if (checked) next.add(m.name)
    else next.delete(m.name)
  }
  selectedModels.value = next
 }
 async function loadModelCategories() {
  modelsLoading.value = true
  const { data } = await useApiFetch<ModelCategoriesResponse>('/api/benchmark/models')
  modelsLoading.value = false
  if (data?.categories) {
    modelCategories.value = data.categories
    const flat = Object.values(data.categories).flat().map(m => m.name)
    allModels.value = flat
    selectedModels.value = new Set(flat)
  }
 }
 // ── Derived ──────────────────────────────────────────────────────────────────
 const modelNames = computed(() => Object.keys(results.value?.models ?? {}))
 const modelCount = computed(() => modelNames.value.length)
@ -466,16 +355,7 @@ function startBenchmark() {
  runError.value = ''
  runCancelled.value = false
-  const params = new URLSearchParams()
+  const url = `/api/benchmark/run${includeSlow.value ? '?include_slow=true' : ''}`
  if (includeSlow.value) params.set('include_slow', 'true')
  // Only send model_names when a subset is selected (not all, not none)
  const total    = allModels.value.length
  const selected = selectedModels.value.size
  if (total > 0 && selected > 0 && selected < total) {
    params.set('model_names', [...selectedModels.value].join(','))
  }
  const qs  = params.toString()
  const url = `/api/benchmark/run${qs ? `?${qs}` : ''}`
  useApiSSE(
    url,
    async (event) => {
@ -547,7 +427,6 @@ function startFinetune() {
 onMounted(() => {
  loadResults()
  loadFineTunedModels()
  loadModelCategories()
 })
 </script>
@ -883,134 +762,6 @@ onMounted(() => {
  font-weight: 700;
 }
 /* ── Model Picker ───────────────────────────────────────── */
 .model-picker {
  border: 1px solid var(--color-border, #d0d7e8);
  border-radius: 0.5rem;
  overflow: hidden;
 }
 .picker-summary {
  display: flex;
  align-items: center;
  gap: 0.6rem;
  padding: 0.65rem 0.9rem;
  cursor: pointer;
  user-select: none;
  list-style: none;
  background: var(--color-surface-raised, #e4ebf5);
 }
 .picker-summary::-webkit-details-marker { display: none; }
 .picker-summary::before { content: '▶  '; font-size: 0.65rem; color: var(--color-text-secondary, #6b7a99); }
 details[open] .picker-summary::before { content: '▼  '; }
 .picker-title {
  font-size: 0.9rem;
  font-weight: 600;
  color: var(--color-text, #1a2338);
 }
 .picker-badge {
  font-size: 0.75rem;
  color: var(--color-text-secondary, #6b7a99);
  background: var(--color-surface, #fff);
  border: 1px solid var(--color-border, #d0d7e8);
  padding: 0.15rem 0.5rem;
  border-radius: 1rem;
  font-family: var(--font-mono, monospace);
  margin-left: auto;
 }
 .picker-body {
  padding: 0.75rem;
  border-top: 1px solid var(--color-border, #d0d7e8);
  display: flex;
  flex-direction: column;
  gap: 0.75rem;
 }
 .picker-loading,
 .picker-empty {
  font-size: 0.85rem;
  color: var(--color-text-secondary, #6b7a99);
  padding: 0.5rem 0;
 }
 .picker-category {
  display: flex;
  flex-direction: column;
  gap: 0.3rem;
 }
 .picker-cat-header {
  display: flex;
  align-items: center;
  gap: 0.45rem;
  font-size: 0.82rem;
  font-weight: 700;
  color: var(--color-text, #1a2338);
  text-transform: uppercase;
  letter-spacing: 0.04em;
  cursor: pointer;
 }
 .picker-cat-count {
  font-weight: 400;
  color: var(--color-text-secondary, #6b7a99);
  font-family: var(--font-mono, monospace);
  font-size: 0.75rem;
  text-transform: none;
  letter-spacing: 0;
 }
 .picker-no-models {
  font-size: 0.78rem;
  color: var(--color-text-secondary, #6b7a99);
  opacity: 0.65;
  padding-left: 1.4rem;
  font-style: italic;
 }
 .picker-model-list {
  display: flex;
  flex-wrap: wrap;
  gap: 0.35rem 0.75rem;
  padding-left: 1.4rem;
 }
 .picker-model-row {
  display: flex;
  align-items: center;
  gap: 0.35rem;
  font-size: 0.82rem;
  cursor: pointer;
  color: var(--color-text, #1a2338);
 }
 .picker-model-name {
  font-family: var(--font-mono, monospace);
  font-size: 0.78rem;
  white-space: nowrap;
  max-width: 18ch;
  overflow: hidden;
  text-overflow: ellipsis;
 }
 .picker-adapter-type {
  font-size: 0.68rem;
  color: var(--color-text-secondary, #6b7a99);
  background: var(--color-surface-raised, #e4ebf5);
  border: 1px solid var(--color-border, #d0d7e8);
  border-radius: 0.25rem;
  padding: 0.05rem 0.3rem;
  font-family: var(--font-mono, monospace);
 }
@media (max-width: 600px) {
  .picker-model-list { padding-left: 0; }
  .picker-model-name { max-width: 14ch; }
 }
 /* ── Fine-tune section ──────────────────────────────────── */
 .ft-section {
  border: 1px solid var(--color-border, #d0d7e8);
--- a/web/src/views/CorrectionsView.vue
+++ b/web/src/views/CorrectionsView.vue
@ -1,328 +0,0 @@
 <template>
  <div class="corrections-view">
    <header class="cv-header">
      <span class="queue-count">
        <template v-if="loading">Loading…</template>
        <template v-else-if="store.totalRemaining > 0">
          {{ store.totalRemaining }} remaining
        </template>
        <span v-else class="queue-empty-label">All caught up</span>
      </span>
      <div class="header-actions">
        <button @click="handleUndo" :disabled="!store.lastAction" class="btn-action">↩ Undo</button>
      </div>
    </header>
    <!-- States -->
    <div v-if="loading" class="skeleton-card" aria-label="Loading candidates" />
    <div v-else-if="apiError" class="error-display" role="alert">
      <p>Couldn't reach Avocet API.</p>
      <button @click="fetchBatch" class="btn-action">Retry</button>
    </div>
    <div v-else-if="!store.current" class="empty-state">
      <p>No candidates need review.</p>
      <p class="empty-hint">Import a benchmark run from the Settings tab to get started.</p>
    </div>
    <template v-else>
      <div class="card-wrapper">
        <SftCard
          :item="store.current"
          :correcting="correcting"
          @correct="startCorrection"
          @discard="handleDiscard"
          @flag="handleFlag"
          @submit-correction="handleCorrect"
          @cancel-correction="correcting = false"
          ref="sftCardEl"
        />
      </div>
    </template>
    <!-- Stats footer -->
    <footer v-if="stats" class="stats-footer">
      <span class="stat">✓ {{ stats.by_status?.approved ?? 0 }} approved</span>
      <span class="stat">✕ {{ stats.by_status?.discarded ?? 0 }} discarded</span>
      <span class="stat">⚑ {{ stats.by_status?.model_rejected ?? 0 }} flagged</span>
      <a
        v-if="(stats.export_ready ?? 0) > 0"
        :href="exportUrl"
        download
        class="btn-export"
      >
        ⬇ Export {{ stats.export_ready }} corrections
      </a>
    </footer>
    <!-- Undo toast (inline — UndoToast.vue uses label store's LastAction shape, not SFT's) -->
    <div v-if="store.lastAction" class="undo-toast">
      <span>Last: {{ store.lastAction.type }}</span>
      <button @click="handleUndo" class="btn-undo">↩ Undo</button>
      <button @click="store.clearLastAction()" class="btn-dismiss">✕</button>
    </div>
  </div>
 </template>
 <script setup lang="ts">
 import { ref, onMounted } from 'vue'
 import { useSftStore } from '../stores/sft'
 import type { SftFailureCategory } from '../stores/sft'
 import { useSftKeyboard } from '../composables/useSftKeyboard'
 import SftCard from '../components/SftCard.vue'
 const store      = useSftStore()
 const loading    = ref(false)
 const apiError   = ref(false)
 const correcting = ref(false)
 const stats      = ref<Record<string, any> | null>(null)
 const exportUrl  = '/api/sft/export'
 const sftCardEl  = ref<InstanceType<typeof SftCard> | null>(null)
 useSftKeyboard({
  onCorrect: () => { if (store.current && !correcting.value) correcting.value = true },
  onDiscard: () => { if (store.current && !correcting.value) handleDiscard() },
  onFlag:    () => { if (store.current && !correcting.value) handleFlag() },
  onEscape:  () => { correcting.value = false },
  onSubmit:  () => {},
  isEditing: () => correcting.value,
 })
 async function fetchBatch() {
  loading.value  = true
  apiError.value = false
  try {
    const res = await fetch('/api/sft/queue?per_page=20')
    if (!res.ok) throw new Error('API error')
    const data = await res.json()
    store.queue          = data.items
    store.totalRemaining = data.total
  } catch {
    apiError.value = true
  } finally {
    loading.value = false
  }
 }
 async function fetchStats() {
  try {
    const res = await fetch('/api/sft/stats')
    if (res.ok) stats.value = await res.json()
  } catch { /* ignore */ }
 }
 function startCorrection() {
  correcting.value = true
 }
 async function handleCorrect(text: string, category: SftFailureCategory | null = null) {
  if (!store.current) return
  const item = store.current
  correcting.value = false
  try {
    const body: Record<string, unknown> = { id: item.id, action: 'correct', corrected_response: text }
    if (category != null) body.failure_category = category
    const res = await fetch('/api/sft/submit', {
      method:  'POST',
      headers: { 'Content-Type': 'application/json' },
      body:    JSON.stringify(body),
    })
    if (!res.ok) throw new Error(`HTTP ${res.status}`)
    store.removeCurrentFromQueue()
    store.setLastAction('correct', item, category)
    store.totalRemaining = Math.max(0, store.totalRemaining - 1)
    fetchStats()
    if (store.queue.length < 5) fetchBatch()
  } catch (err) {
    console.error('handleCorrect failed:', err)
  }
 }
 async function handleDiscard(category: SftFailureCategory | null = null) {
  if (!store.current) return
  const item = store.current
  try {
    const body: Record<string, unknown> = { id: item.id, action: 'discard' }
    if (category != null) body.failure_category = category
    const res = await fetch('/api/sft/submit', {
      method:  'POST',
      headers: { 'Content-Type': 'application/json' },
      body:    JSON.stringify(body),
    })
    if (!res.ok) throw new Error(`HTTP ${res.status}`)
    store.removeCurrentFromQueue()
    store.setLastAction('discard', item, category)
    store.totalRemaining = Math.max(0, store.totalRemaining - 1)
    fetchStats()
    if (store.queue.length < 5) fetchBatch()
  } catch (err) {
    console.error('handleDiscard failed:', err)
  }
 }
 async function handleFlag(category: SftFailureCategory | null = null) {
  if (!store.current) return
  const item = store.current
  try {
    const body: Record<string, unknown> = { id: item.id, action: 'flag' }
    if (category != null) body.failure_category = category
    const res = await fetch('/api/sft/submit', {
      method:  'POST',
      headers: { 'Content-Type': 'application/json' },
      body:    JSON.stringify(body),
    })
    if (!res.ok) throw new Error(`HTTP ${res.status}`)
    store.removeCurrentFromQueue()
    store.setLastAction('flag', item, category)
    store.totalRemaining = Math.max(0, store.totalRemaining - 1)
    fetchStats()
    if (store.queue.length < 5) fetchBatch()
  } catch (err) {
    console.error('handleFlag failed:', err)
  }
 }
 async function handleUndo() {
  if (!store.lastAction) return
  const action = store.lastAction
  const { item } = action
  try {
    const res = await fetch('/api/sft/undo', {
      method:  'POST',
      headers: { 'Content-Type': 'application/json' },
      body:    JSON.stringify({ id: item.id }),
    })
    if (!res.ok) throw new Error(`HTTP ${res.status}`)
    store.restoreItem(item)
    store.totalRemaining++
    store.clearLastAction()
    fetchStats()
  } catch (err) {
    // Backend did not restore — clear the undo UI without restoring queue state
    console.error('handleUndo failed:', err)
    store.clearLastAction()
  }
 }
 onMounted(() => {
  fetchBatch()
  fetchStats()
 })
 </script>
 <style scoped>
 .corrections-view {
  display: flex;
  flex-direction: column;
  min-height: 100dvh;
  padding: var(--space-4);
  gap: var(--space-4);
  max-width: 760px;
  margin: 0 auto;
 }
 .cv-header {
  display: flex;
  justify-content: space-between;
  align-items: center;
 }
 .queue-count {
  font-size: 1rem;
  font-weight: 600;
  color: var(--color-text);
 }
 .queue-empty-label { color: var(--color-text-muted); }
 .btn-action {
  padding: var(--space-2) var(--space-3);
  border: 1px solid var(--color-border);
  border-radius: var(--radius-md);
  background: var(--color-surface-raised);
  cursor: pointer;
  font-size: 0.88rem;
 }
 .btn-action:disabled { opacity: 0.4; cursor: not-allowed; }
 .skeleton-card {
  height: 320px;
  background: var(--color-surface-alt);
  border-radius: var(--radius-lg);
  animation: pulse 1.5s ease-in-out infinite;
 }
@keyframes pulse {
  0%, 100% { opacity: 1; }
  50% { opacity: 0.5; }
 }
@media (prefers-reduced-motion: reduce) {
  .skeleton-card { animation: none; }
 }
 .error-display, .empty-state {
  text-align: center;
  padding: var(--space-12);
  color: var(--color-text-muted);
 }
 .empty-hint { font-size: 0.88rem; margin-top: var(--space-2); }
 .stats-footer {
  display: flex;
  gap: var(--space-4);
  align-items: center;
  flex-wrap: wrap;
  padding: var(--space-3) 0;
  border-top: 1px solid var(--color-border-light);
  font-size: 0.85rem;
  color: var(--color-text-muted);
 }
 .btn-export {
  margin-left: auto;
  padding: var(--space-2) var(--space-3);
  background: var(--color-primary);
  color: var(--color-text-inverse, #fff);
  border-radius: var(--radius-md);
  text-decoration: none;
  font-size: 0.88rem;
 }
 .undo-toast {
  position: fixed;
  bottom: var(--space-6);
  left: 50%;
  transform: translateX(-50%);
  display: flex;
  align-items: center;
  gap: var(--space-3);
  background: var(--color-surface-raised);
  border: 1px solid var(--color-border);
  border-radius: var(--radius-md);
  padding: var(--space-3) var(--space-4);
  box-shadow: 0 4px 12px rgba(0,0,0,0.15);
  font-size: 0.9rem;
 }
 .btn-undo {
  background: var(--color-primary);
  color: var(--color-text-inverse, #fff);
  border: none;
  border-radius: var(--radius-sm);
  padding: var(--space-1) var(--space-3);
  cursor: pointer;
  font-size: 0.88rem;
 }
 .btn-dismiss {
  background: none;
  border: none;
  color: var(--color-text-muted);
  cursor: pointer;
  font-size: 1rem;
 }
 </style>
--- a/web/src/views/ModelsView.vue
+++ b/web/src/views/ModelsView.vue
@ -1,826 +0,0 @@
 <template>
  <div class="models-view">
    <h1 class="page-title">🤗 Models</h1>
    <!-- ── 1. HF Lookup ───────────────────────────────── -->
    <section class="section">
      <h2 class="section-title">HuggingFace Lookup</h2>
      <div class="lookup-row">
        <input
          v-model="lookupInput"
          type="text"
          class="lookup-input"
          placeholder="org/model or huggingface.co/org/model"
          :disabled="lookupLoading"
          @keydown.enter="doLookup"
          aria-label="HuggingFace model ID"
        />
        <button
          class="btn-primary"
          :disabled="lookupLoading || !lookupInput.trim()"
          @click="doLookup"
        >
          {{ lookupLoading ? 'Looking up…' : 'Lookup' }}
        </button>
      </div>
      <div v-if="lookupError" class="error-notice" role="alert">
        {{ lookupError }}
      </div>
      <div v-if="lookupResult" class="preview-card">
        <div class="preview-header">
          <span class="preview-repo-id">{{ lookupResult.repo_id }}</span>
          <div class="badge-group">
            <span v-if="lookupResult.already_installed" class="badge badge-success">Installed</span>
            <span v-if="lookupResult.already_queued"    class="badge badge-info">In queue</span>
          </div>
        </div>
        <div class="preview-meta">
          <span v-if="lookupResult.pipeline_tag" class="chip chip-pipeline">
            {{ lookupResult.pipeline_tag }}
          </span>
          <span v-if="lookupResult.adapter_recommendation" class="chip chip-adapter">
            {{ lookupResult.adapter_recommendation }}
          </span>
          <span v-if="lookupResult.size != null" class="preview-size">
            {{ humanBytes(lookupResult.size) }}
          </span>
        </div>
        <p v-if="lookupResult.description" class="preview-desc">
          {{ lookupResult.description }}
        </p>
        <button
          class="btn-primary btn-add-queue"
          :disabled="lookupResult.already_installed || lookupResult.already_queued || addingToQueue"
          @click="addToQueue"
        >
          {{ addingToQueue ? 'Adding…' : 'Add to queue' }}
        </button>
      </div>
    </section>
    <!-- ── 2. Approval Queue ──────────────────────────── -->
    <section class="section">
      <h2 class="section-title">Approval Queue</h2>
      <div v-if="pendingModels.length === 0" class="empty-notice">
        No models waiting for approval.
      </div>
      <div v-for="model in pendingModels" :key="model.id" class="model-card">
        <div class="model-card-header">
          <span class="model-repo-id">{{ model.repo_id }}</span>
          <button
            class="btn-dismiss"
            :aria-label="`Dismiss ${model.repo_id}`"
            @click="dismissModel(model.id)"
          >
            ✕
          </button>
        </div>
        <div class="model-meta">
          <span v-if="model.pipeline_tag"         class="chip chip-pipeline">{{ model.pipeline_tag }}</span>
          <span v-if="model.adapter_recommendation" class="chip chip-adapter">{{ model.adapter_recommendation }}</span>
        </div>
        <div class="model-card-actions">
          <button class="btn-primary btn-sm" @click="approveModel(model.id)">
            Approve download
          </button>
        </div>
      </div>
    </section>
    <!-- ── 3. Active Downloads ────────────────────────── -->
    <section class="section">
      <h2 class="section-title">Active Downloads</h2>
      <div v-if="downloadingModels.length === 0" class="empty-notice">
        No active downloads.
      </div>
      <div v-for="model in downloadingModels" :key="model.id" class="model-card">
        <div class="model-card-header">
          <span class="model-repo-id">{{ model.repo_id }}</span>
          <span v-if="downloadErrors[model.id]" class="badge badge-error">Error</span>
        </div>
        <div class="model-meta">
          <span v-if="model.pipeline_tag" class="chip chip-pipeline">{{ model.pipeline_tag }}</span>
        </div>
        <div v-if="downloadErrors[model.id]" class="download-error" role="alert">
          {{ downloadErrors[model.id] }}
        </div>
        <div v-else class="progress-wrap" :aria-label="`Download progress for ${model.repo_id}`">
          <div
            class="progress-bar"
            :style="{ width: `${downloadProgress[model.id] ?? 0}%` }"
            role="progressbar"
            :aria-valuenow="downloadProgress[model.id] ?? 0"
            aria-valuemin="0"
            aria-valuemax="100"
          />
          <span class="progress-label">
            {{ downloadProgress[model.id] == null ? 'Preparing…' : `${downloadProgress[model.id]}%` }}
          </span>
        </div>
      </div>
    </section>
    <!-- ── 4. Installed Models ────────────────────────── -->
    <section class="section">
      <h2 class="section-title">Installed Models</h2>
      <div v-if="installedModels.length === 0" class="empty-notice">
        No models installed yet.
      </div>
      <div v-else class="installed-table-wrap">
        <table class="installed-table">
          <thead>
            <tr>
              <th>Name</th>
              <th>Type</th>
              <th>Adapter</th>
              <th>Size</th>
              <th></th>
            </tr>
          </thead>
          <tbody>
            <tr v-for="model in installedModels" :key="model.name">
              <td class="td-name">{{ model.name }}</td>
              <td>
                <span
                  class="badge"
                  :class="model.type === 'finetuned' ? 'badge-accent' : 'badge-info'"
                >
                  {{ model.type }}
                </span>
              </td>
              <td>{{ model.adapter ?? '—' }}</td>
              <td>{{ humanBytes(model.size) }}</td>
              <td>
                <button
                  class="btn-danger btn-sm"
                  @click="deleteInstalled(model.name)"
                >
                  Delete
                </button>
              </td>
            </tr>
          </tbody>
        </table>
      </div>
    </section>
  </div>
 </template>
 <script setup lang="ts">
 import { ref, computed, onMounted, onUnmounted } from 'vue'
 // ── Type definitions ──────────────────────────────────
 interface LookupResult {
  repo_id: string
  pipeline_tag: string | null
  adapter_recommendation: string | null
  size: number | null
  description: string | null
  already_installed: boolean
  already_queued: boolean
 }
 interface QueuedModel {
  id: string
  repo_id: string
  status: 'pending' | 'downloading' | 'done' | 'error'
  pipeline_tag: string | null
  adapter_recommendation: string | null
 }
 interface InstalledModel {
  name: string
  type: 'finetuned' | 'downloaded'
  adapter: string | null
  size: number
 }
 interface SseProgressEvent {
  model_id: string
  pct: number | null
  status: 'progress' | 'done' | 'error'
  message?: string
 }
 // ── State ─────────────────────────────────────────────
 const lookupInput   = ref('')
 const lookupLoading = ref(false)
 const lookupError   = ref<string | null>(null)
 const lookupResult  = ref<LookupResult | null>(null)
 const addingToQueue = ref(false)
 const queuedModels    = ref<QueuedModel[]>([])
 const installedModels = ref<InstalledModel[]>([])
 const downloadProgress = ref<Record<string, number>>({})
 const downloadErrors   = ref<Record<string, string>>({})
 let pollInterval: ReturnType<typeof setInterval> | null = null
 let sseSource: EventSource | null = null
 // ── Derived ───────────────────────────────────────────
 const pendingModels = computed(() =>
  queuedModels.value.filter(m => m.status === 'pending')
 )
 const downloadingModels = computed(() =>
  queuedModels.value.filter(m => m.status === 'downloading')
 )
 // ── Helpers ───────────────────────────────────────────
 function humanBytes(bytes: number | null): string {
  if (bytes == null) return '—'
  const units = ['B', 'KB', 'MB', 'GB', 'TB']
  let value = bytes
  let unitIndex = 0
  while (value >= 1024 && unitIndex < units.length - 1) {
    value /= 1024
    unitIndex++
  }
  return `${value.toFixed(unitIndex === 0 ? 0 : 1)} ${units[unitIndex]}`
 }
 function normalizeRepoId(raw: string): string {
  return raw.trim().replace(/^https?:\/\/huggingface\.co\//, '')
 }
 // ── API calls ─────────────────────────────────────────
 async function doLookup() {
  const repoId = normalizeRepoId(lookupInput.value)
  if (!repoId) return
  lookupLoading.value = true
  lookupError.value   = null
  lookupResult.value  = null
  try {
    const res = await fetch(`/api/models/lookup?repo_id=${encodeURIComponent(repoId)}`)
    if (res.status === 404) {
      lookupError.value = 'Model not found on HuggingFace.'
      return
    }
    if (res.status === 502) {
      lookupError.value = 'HuggingFace unreachable. Check your connection and try again.'
      return
    }
    if (!res.ok) {
      lookupError.value = `Lookup failed (HTTP ${res.status}).`
      return
    }
    lookupResult.value = await res.json() as LookupResult
  } catch {
    lookupError.value = 'Network error. Is the Avocet API running?'
  } finally {
    lookupLoading.value = false
  }
 }
 async function addToQueue() {
  if (!lookupResult.value) return
  addingToQueue.value = true
  try {
    const res = await fetch('/api/models/queue', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ repo_id: lookupResult.value.repo_id }),
    })
    if (res.ok) {
      lookupResult.value = { ...lookupResult.value, already_queued: true }
      await loadQueue()
    }
  } catch { /* ignore — already_queued badge won't flip, user can retry */ }
  finally {
    addingToQueue.value = false
  }
 }
 async function approveModel(id: string) {
  try {
    const res = await fetch(`/api/models/queue/${encodeURIComponent(id)}/approve`, { method: 'POST' })
    if (res.ok) {
      await loadQueue()
      startSse()
    }
  } catch { /* ignore */ }
 }
 async function dismissModel(id: string) {
  try {
    const res = await fetch(`/api/models/queue/${encodeURIComponent(id)}`, { method: 'DELETE' })
    if (res.ok) {
      queuedModels.value = queuedModels.value.filter(m => m.id !== id)
    }
  } catch { /* ignore */ }
 }
 async function deleteInstalled(name: string) {
  if (!window.confirm(`Delete installed model "${name}"? This cannot be undone.`)) return
  try {
    const res = await fetch(`/api/models/installed/${encodeURIComponent(name)}`, { method: 'DELETE' })
    if (res.ok) {
      installedModels.value = installedModels.value.filter(m => m.name !== name)
    }
  } catch { /* ignore */ }
 }
 async function loadQueue() {
  try {
    const res = await fetch('/api/models/queue')
    if (res.ok) queuedModels.value = await res.json() as QueuedModel[]
  } catch { /* non-fatal */ }
 }
 async function loadInstalled() {
  try {
    const res = await fetch('/api/models/installed')
    if (res.ok) installedModels.value = await res.json() as InstalledModel[]
  } catch { /* non-fatal */ }
 }
 // ── SSE for download progress ─────────────────────────
 function startSse() {
  if (sseSource) return  // already connected
  sseSource = new EventSource('/api/models/download/stream')
  sseSource.addEventListener('message', (e: MessageEvent) => {
    let event: SseProgressEvent
    try {
      event = JSON.parse(e.data as string) as SseProgressEvent
    } catch {
      return
    }
    const { model_id, pct, status, message } = event
    if (status === 'progress' && pct != null) {
      downloadProgress.value = { ...downloadProgress.value, [model_id]: pct }
    } else if (status === 'done') {
      const updated = { ...downloadProgress.value }
      delete updated[model_id]
      downloadProgress.value = updated
      queuedModels.value = queuedModels.value.filter(m => m.id !== model_id)
      loadInstalled()
    } else if (status === 'error') {
      downloadErrors.value = {
        ...downloadErrors.value,
        [model_id]: message ?? 'Download failed.',
      }
    }
  })
  sseSource.onerror = () => {
    sseSource?.close()
    sseSource = null
  }
 }
 function stopSse() {
  sseSource?.close()
  sseSource = null
 }
 // ── Polling ───────────────────────────────────────────
 function startPollingIfDownloading() {
  if (pollInterval) return
  pollInterval = setInterval(async () => {
    await loadQueue()
    if (downloadingModels.value.length === 0) {
      stopPolling()
    }
  }, 5000)
 }
 function stopPolling() {
  if (pollInterval) {
    clearInterval(pollInterval)
    pollInterval = null
  }
 }
 // ── Lifecycle ─────────────────────────────────────────
 onMounted(async () => {
  await Promise.all([loadQueue(), loadInstalled()])
  if (downloadingModels.value.length > 0) {
    startSse()
    startPollingIfDownloading()
  }
 })
 onUnmounted(() => {
  stopPolling()
  stopSse()
 })
 </script>
 <style scoped>
 .models-view {
  max-width: 760px;
  margin: 0 auto;
  padding: 1.5rem 1rem 4rem;
  display: flex;
  flex-direction: column;
  gap: 2rem;
 }
 .page-title {
  font-family: var(--font-display, var(--font-body, sans-serif));
  font-size: 1.4rem;
  font-weight: 700;
  color: var(--color-primary, #2d5a27);
 }
 /* ── Sections ── */
 .section {
  display: flex;
  flex-direction: column;
  gap: 0.75rem;
 }
 .section-title {
  font-size: 1rem;
  font-weight: 600;
  color: var(--color-text, #1a2338);
  padding-bottom: 0.4rem;
  border-bottom: 1px solid var(--color-border, #a8b8d0);
 }
 /* ── Lookup row ── */
 .lookup-row {
  display: flex;
  gap: 0.5rem;
  flex-wrap: wrap;
 }
 .lookup-input {
  flex: 1;
  min-width: 0;
  padding: 0.45rem 0.7rem;
  border: 1px solid var(--color-border, #a8b8d0);
  border-radius: var(--radius-md, 0.5rem);
  background: var(--color-surface-raised, #f5f7fc);
  color: var(--color-text, #1a2338);
  font-size: 0.9rem;
  font-family: var(--font-body, sans-serif);
 }
 .lookup-input:disabled {
  opacity: 0.6;
 }
 .lookup-input::placeholder {
  color: var(--color-text-muted, #4a5c7a);
 }
 /* ── Notices ── */
 .error-notice {
  padding: 0.6rem 0.8rem;
  background: color-mix(in srgb, var(--color-error, #c0392b) 12%, transparent);
  border: 1px solid color-mix(in srgb, var(--color-error, #c0392b) 30%, transparent);
  border-radius: var(--radius-md, 0.5rem);
  color: var(--color-error, #c0392b);
  font-size: 0.88rem;
 }
 .empty-notice {
  color: var(--color-text-muted, #4a5c7a);
  font-size: 0.9rem;
  padding: 0.75rem;
  border: 1px dashed var(--color-border, #a8b8d0);
  border-radius: var(--radius-md, 0.5rem);
 }
 /* ── Preview card ── */
 .preview-card {
  border: 1px solid var(--color-border, #a8b8d0);
  border-radius: var(--radius-lg, 1rem);
  background: var(--color-surface-raised, #f5f7fc);
  padding: 1rem;
  display: flex;
  flex-direction: column;
  gap: 0.6rem;
  box-shadow: var(--shadow-sm);
 }
 .preview-header {
  display: flex;
  align-items: flex-start;
  justify-content: space-between;
  gap: 0.5rem;
  flex-wrap: wrap;
 }
 .preview-repo-id {
  font-family: var(--font-mono, monospace);
  font-size: 0.95rem;
  font-weight: 600;
  color: var(--color-text, #1a2338);
  word-break: break-all;
 }
 .preview-meta {
  display: flex;
  gap: 0.4rem;
  flex-wrap: wrap;
  align-items: center;
 }
 .preview-size {
  font-size: 0.8rem;
  color: var(--color-text-muted, #4a5c7a);
  margin-left: 0.25rem;
 }
 .preview-desc {
  font-size: 0.875rem;
  color: var(--color-text-muted, #4a5c7a);
  line-height: 1.5;
  margin: 0;
  display: -webkit-box;
  -webkit-line-clamp: 3;
  -webkit-box-orient: vertical;
  overflow: hidden;
 }
 .btn-add-queue {
  align-self: flex-start;
 }
 /* ── Model cards (queue + downloads) ── */
 .model-card {
  border: 1px solid var(--color-border, #a8b8d0);
  border-radius: var(--radius-md, 0.5rem);
  background: var(--color-surface-raised, #f5f7fc);
  padding: 0.75rem 1rem;
  display: flex;
  flex-direction: column;
  gap: 0.5rem;
  box-shadow: var(--shadow-sm);
 }
 .model-card-header {
  display: flex;
  align-items: center;
  justify-content: space-between;
  gap: 0.5rem;
 }
 .model-repo-id {
  font-family: var(--font-mono, monospace);
  font-size: 0.9rem;
  font-weight: 600;
  color: var(--color-text, #1a2338);
  word-break: break-all;
 }
 .model-meta {
  display: flex;
  gap: 0.4rem;
  flex-wrap: wrap;
 }
 .model-card-actions {
  display: flex;
  gap: 0.5rem;
  flex-wrap: wrap;
  padding-top: 0.25rem;
 }
 /* ── Progress bar ── */
 .progress-wrap {
  position: relative;
  height: 1.5rem;
  background: var(--color-surface-alt, #dde4f0);
  border-radius: var(--radius-full, 9999px);
  overflow: hidden;
 }
 .progress-bar {
  position: absolute;
  top: 0;
  left: 0;
  height: 100%;
  background: var(--color-accent, #c4732a);
  border-radius: var(--radius-full, 9999px);
  transition: width 300ms ease;
 }
 .progress-label {
  position: absolute;
  inset: 0;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.75rem;
  font-weight: 600;
  color: var(--color-text, #1a2338);
  pointer-events: none;
 }
 .download-error {
  font-size: 0.85rem;
  color: var(--color-error, #c0392b);
  padding: 0.4rem 0.5rem;
  background: color-mix(in srgb, var(--color-error, #c0392b) 10%, transparent);
  border-radius: var(--radius-sm, 0.25rem);
 }
 /* ── Installed table ── */
 .installed-table-wrap {
  overflow-x: auto;
 }
 .installed-table {
  width: 100%;
  border-collapse: collapse;
  font-size: 0.875rem;
 }
 .installed-table th {
  text-align: left;
  padding: 0.4rem 0.6rem;
  color: var(--color-text-muted, #4a5c7a);
  font-size: 0.78rem;
  font-weight: 600;
  text-transform: uppercase;
  letter-spacing: 0.03em;
  border-bottom: 1px solid var(--color-border, #a8b8d0);
  white-space: nowrap;
 }
 .installed-table td {
  padding: 0.55rem 0.6rem;
  border-bottom: 1px solid var(--color-border-light, #ccd5e6);
  vertical-align: middle;
 }
 .td-name {
  font-family: var(--font-mono, monospace);
  font-size: 0.85rem;
  word-break: break-all;
 }
 /* ── Badges ── */
 .badge-group {
  display: flex;
  gap: 0.35rem;
  flex-wrap: wrap;
  align-items: center;
 }
 .badge {
  display: inline-flex;
  align-items: center;
  padding: 0.15rem 0.55rem;
  border-radius: var(--radius-full, 9999px);
  font-size: 0.72rem;
  font-weight: 700;
  letter-spacing: 0.02em;
  text-transform: uppercase;
  white-space: nowrap;
 }
 .badge-success {
  background: color-mix(in srgb, var(--color-success, #3a7a32) 15%, transparent);
  color: var(--color-success, #3a7a32);
 }
 .badge-info {
  background: color-mix(in srgb, var(--color-info, #1e6091) 15%, transparent);
  color: var(--color-info, #1e6091);
 }
 .badge-accent {
  background: color-mix(in srgb, var(--color-accent, #c4732a) 15%, transparent);
  color: var(--color-accent, #c4732a);
 }
 .badge-error {
  background: color-mix(in srgb, var(--color-error, #c0392b) 15%, transparent);
  color: var(--color-error, #c0392b);
 }
 /* ── Chips ── */
 .chip {
  display: inline-flex;
  align-items: center;
  padding: 0.15rem 0.5rem;
  border-radius: var(--radius-full, 9999px);
  font-size: 0.75rem;
  font-weight: 600;
  background: var(--color-surface-alt, #dde4f0);
  white-space: nowrap;
 }
 .chip-pipeline {
  color: var(--color-primary, #2d5a27);
  background: color-mix(in srgb, var(--color-primary, #2d5a27) 12%, var(--color-surface-alt, #dde4f0));
 }
 .chip-adapter {
  color: var(--color-accent, #c4732a);
  background: color-mix(in srgb, var(--color-accent, #c4732a) 12%, var(--color-surface-alt, #dde4f0));
 }
 /* ── Buttons ── */
 .btn-primary, .btn-danger {
  padding: 0.4rem 0.9rem;
  border-radius: var(--radius-md, 0.5rem);
  font-size: 0.85rem;
  cursor: pointer;
  border: 1px solid;
  font-family: var(--font-body, sans-serif);
  transition: background var(--transition, 200ms ease), color var(--transition, 200ms ease);
 }
 .btn-sm {
  padding: 0.25rem 0.65rem;
  font-size: 0.8rem;
 }
 .btn-primary {
  border-color: var(--color-primary, #2d5a27);
  background: var(--color-primary, #2d5a27);
  color: var(--color-text-inverse, #eaeff8);
 }
 .btn-primary:hover:not(:disabled) {
  background: var(--color-primary-hover, #234820);
  border-color: var(--color-primary-hover, #234820);
 }
 .btn-primary:disabled {
  opacity: 0.5;
  cursor: not-allowed;
 }
 .btn-danger {
  border-color: var(--color-error, #c0392b);
  background: transparent;
  color: var(--color-error, #c0392b);
 }
 .btn-danger:hover {
  background: color-mix(in srgb, var(--color-error, #c0392b) 10%, transparent);
 }
 .btn-dismiss {
  border: none;
  background: transparent;
  color: var(--color-text-muted, #4a5c7a);
  cursor: pointer;
  font-size: 0.9rem;
  padding: 0.15rem 0.4rem;
  border-radius: var(--radius-sm, 0.25rem);
  flex-shrink: 0;
  transition: color var(--transition, 200ms ease), background var(--transition, 200ms ease);
 }
 .btn-dismiss:hover {
  color: var(--color-error, #c0392b);
  background: color-mix(in srgb, var(--color-error, #c0392b) 10%, transparent);
 }
 /* ── Responsive ── */
@media (max-width: 480px) {
  .lookup-row {
    flex-direction: column;
  }
  .lookup-input {
    width: 100%;
  }
  .btn-primary:not(.btn-sm) {
    width: 100%;
  }
  .installed-table th:nth-child(3),
  .installed-table td:nth-child(3) {
    display: none;  /* hide Adapter column on very narrow screens */
  }
 }
 </style>
--- a/web/src/views/SettingsView.vue
+++ b/web/src/views/SettingsView.vue
@ -110,63 +110,6 @@
      </label>
    </section>
    <!-- cf-orch SFT Integration section -->
    <section class="section">
      <h2 class="section-title">cf-orch Integration</h2>
      <p class="section-desc">
        Import SFT (supervised fine-tuning) candidates from cf-orch benchmark runs.
      </p>
      <div class="field-row">
        <label class="field field-grow">
          <span>bench_results_dir</span>
          <input
            id="bench-results-dir"
            v-model="benchResultsDir"
            type="text"
            placeholder="/path/to/circuitforge-orch/scripts/bench_results"
          />
        </label>
      </div>
      <div class="account-actions">
        <button class="btn-primary" @click="saveSftConfig">Save</button>
        <button class="btn-secondary" @click="scanRuns">Scan for runs</button>
        <span v-if="saveStatus" class="save-status">{{ saveStatus }}</span>
      </div>
      <table v-if="runs.length > 0" class="runs-table">
        <thead>
          <tr>
            <th>Timestamp</th>
            <th>Candidates</th>
            <th>Imported</th>
            <th></th>
          </tr>
        </thead>
        <tbody>
          <tr v-for="run in runs" :key="run.run_id">
            <td>{{ run.timestamp }}</td>
            <td>{{ run.candidate_count }}</td>
            <td>{{ run.already_imported ? '✓' : '—' }}</td>
            <td>
              <button
                class="btn-import"
                :disabled="run.already_imported || importingRunId === run.run_id"
                @click="importRun(run.run_id)"
              >
                {{ importingRunId === run.run_id ? 'Importing…' : 'Import' }}
              </button>
            </td>
          </tr>
        </tbody>
      </table>
      <div v-if="importResult" class="import-result">
        Imported {{ importResult.imported }}, skipped {{ importResult.skipped }}.
      </div>
    </section>
    <!-- Save / Reload -->
    <div class="save-bar">
      <button class="btn-primary" :disabled="saving" @click="save">
@ -199,64 +142,6 @@ const saveOk        = ref(true)
 const richMotion    = ref(localStorage.getItem('cf-avocet-rich-motion') !== 'false')
 const keyHints      = ref(localStorage.getItem('cf-avocet-key-hints') !== 'false')
 // SFT integration state
 const benchResultsDir = ref('')
 const runs            = ref<Array<{ run_id: string; timestamp: string; candidate_count: number; already_imported: boolean }>>([])
 const importingRunId  = ref<string | null>(null)
 const importResult    = ref<{ imported: number; skipped: number } | null>(null)
 const saveStatus      = ref('')
 async function loadSftConfig() {
  try {
    const res = await fetch('/api/sft/config')
    if (res.ok) {
      const data = await res.json()
      benchResultsDir.value = data.bench_results_dir ?? ''
    }
  } catch {
    // non-fatal — leave field empty
  }
 }
 async function saveSftConfig() {
  saveStatus.value = 'Saving…'
  try {
    const res = await fetch('/api/sft/config', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ bench_results_dir: benchResultsDir.value }),
    })
    saveStatus.value = res.ok ? 'Saved.' : 'Error saving.'
  } catch {
    saveStatus.value = 'Error saving.'
  }
  setTimeout(() => { saveStatus.value = '' }, 2000)
 }
 async function scanRuns() {
  try {
    const res = await fetch('/api/sft/runs')
    if (res.ok) runs.value = await res.json()
  } catch { /* ignore */ }
 }
 async function importRun(runId: string) {
  importingRunId.value = runId
  importResult.value = null
  try {
    const res = await fetch('/api/sft/import', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ run_id: runId }),
    })
    if (res.ok) {
      importResult.value = await res.json()
      scanRuns()  // refresh already_imported flags
    }
  } catch { /* ignore */ }
  importingRunId.value = null
 }
 async function reload() {
  const { data } = await useApiFetch<{ accounts: Account[]; max_per_account: number }>('/api/config')
  if (data) {
@ -334,10 +219,7 @@ function onKeyHintsChange() {
  document.documentElement.classList.toggle('hide-key-hints', !keyHints.value)
 }
-onMounted(() => {
+onMounted(reload)
  reload()
  loadSftConfig()
 })
 </script>
 <style scoped>
@ -546,61 +428,4 @@ onMounted(() => {
  border: 1px dashed var(--color-border, #d0d7e8);
  border-radius: 0.5rem;
 }
 .section-desc {
  color: var(--color-text-secondary, #6b7a99);
  font-size: 0.88rem;
  line-height: 1.5;
 }
 .field-input {
  padding: 0.4rem 0.6rem;
  border: 1px solid var(--color-border, #d0d7e8);
  border-radius: 0.375rem;
  background: var(--color-surface, #fff);
  color: var(--color-text, #1a2338);
  font-size: 0.9rem;
  font-family: var(--font-body, sans-serif);
  width: 100%;
 }
 .runs-table {
  width: 100%;
  border-collapse: collapse;
  margin-top: var(--space-3, 0.75rem);
  font-size: 0.88rem;
 }
 .runs-table th,
 .runs-table td {
  padding: var(--space-2, 0.5rem) var(--space-3, 0.75rem);
  text-align: left;
  border-bottom: 1px solid var(--color-border, #d0d7e8);
 }
 .btn-import {
  padding: var(--space-1, 0.25rem) var(--space-3, 0.75rem);
  border: 1px solid var(--app-primary, #2A6080);
  border-radius: var(--radius-sm, 0.25rem);
  background: none;
  color: var(--app-primary, #2A6080);
  cursor: pointer;
  font-size: 0.85rem;
 }
 .btn-import:disabled {
  opacity: 0.45;
  cursor: not-allowed;
 }
 .import-result {
  margin-top: var(--space-2, 0.5rem);
  font-size: 0.88rem;
  color: var(--color-text-secondary, #6b7a99);
 }
 .save-status {
  font-size: 0.85rem;
  color: var(--color-text-secondary, #6b7a99);
 }
 </style>
--- a/web/src/views/StatsView.vue
+++ b/web/src/views/StatsView.vue
@ -35,39 +35,6 @@
        </div>
      </div>
      <!-- Benchmark Results -->
      <template v-if="benchRows.length > 0">
        <h2 class="section-title">🏁 Benchmark Results</h2>
        <div class="bench-table-wrap">
          <table class="bench-table">
            <thead>
              <tr>
                <th class="bt-model-col">Model</th>
                <th
                  v-for="m in BENCH_METRICS"
                  :key="m.key as string"
                  class="bt-metric-col"
                >{{ m.label }}</th>
              </tr>
            </thead>
            <tbody>
              <tr v-for="row in benchRows" :key="row.name">
                <td class="bt-model-cell" :title="row.name">{{ row.name }}</td>
                <td
                  v-for="m in BENCH_METRICS"
                  :key="m.key as string"
                  class="bt-metric-cell"
                  :class="{ 'bt-best': bestByMetric[m.key as string] === row.name }"
                >
                  {{ formatMetric(row.result[m.key]) }}
                </td>
              </tr>
            </tbody>
          </table>
        </div>
        <p class="bench-hint">Highlighted cells are the best-scoring model per metric.</p>
      </template>
      <div class="file-info">
        <span class="file-path">Score file: <code>data/email_score.jsonl</code></span>
        <span class="file-size">{{ fileSizeLabel }}</span>
@ -87,18 +54,10 @@
 import { ref, computed, onMounted } from 'vue'
 import { useApiFetch } from '../composables/useApi'
 interface BenchmarkModelResult {
  accuracy?: number
  macro_f1?: number
  weighted_f1?: number
  [key: string]: number | undefined
 }
 interface StatsResponse {
  total: number
  counts: Record<string, number>
  score_file_bytes: number
  benchmark_results?: Record<string, BenchmarkModelResult>
 }
 // Canonical label order + metadata
@ -149,42 +108,6 @@ const fileSizeLabel = computed(() => {
  return `${(b / 1024 / 1024).toFixed(2)} MB`
 })
 // Benchmark results helpers
 const BENCH_METRICS: Array<{ key: keyof BenchmarkModelResult; label: string }> = [
  { key: 'accuracy',    label: 'Accuracy' },
  { key: 'macro_f1',   label: 'Macro F1' },
  { key: 'weighted_f1', label: 'Weighted F1' },
 ]
 const benchRows = computed(() => {
  const br = stats.value.benchmark_results
  if (!br || Object.keys(br).length === 0) return []
  return Object.entries(br).map(([name, result]) => ({ name, result }))
 })
 // Find the best model name for each metric
 const bestByMetric = computed((): Record<string, string> => {
  const result: Record<string, string> = {}
  for (const { key } of BENCH_METRICS) {
    let bestName = ''
    let bestVal  = -Infinity
    for (const { name, result: r } of benchRows.value) {
      const v = r[key]
      if (v != null && v > bestVal) { bestVal = v; bestName = name }
    }
    result[key as string] = bestName
  }
  return result
 })
 function formatMetric(v: number | undefined): string {
  if (v == null) return '—'
  // Values in 0-1 range: format as percentage
  if (v <= 1) return `${(v * 100).toFixed(1)}%`
  // Already a percentage
  return `${v.toFixed(1)}%`
 }
 async function load() {
  loading.value = true
  error.value   = ''
@ -311,79 +234,6 @@ onMounted(load)
  padding: 1rem;
 }
 /* ── Benchmark Results ──────────────────────────── */
 .section-title {
  font-family: var(--font-display, var(--font-body, sans-serif));
  font-size: 1.05rem;
  font-weight: 700;
  color: var(--app-primary, #2A6080);
  margin: 0;
 }
 .bench-table-wrap {
  overflow-x: auto;
  border: 1px solid var(--color-border, #d0d7e8);
  border-radius: 0.5rem;
 }
 .bench-table {
  border-collapse: collapse;
  width: 100%;
  font-size: 0.82rem;
 }
 .bt-model-col {
  text-align: left;
  padding: 0.45rem 0.75rem;
  background: var(--color-surface-raised, #e4ebf5);
  border-bottom: 1px solid var(--color-border, #d0d7e8);
  font-weight: 600;
  min-width: 12rem;
 }
 .bt-metric-col {
  text-align: right;
  padding: 0.45rem 0.75rem;
  background: var(--color-surface-raised, #e4ebf5);
  border-bottom: 1px solid var(--color-border, #d0d7e8);
  font-weight: 600;
  white-space: nowrap;
  min-width: 6rem;
 }
 .bt-model-cell {
  padding: 0.4rem 0.75rem;
  border-top: 1px solid var(--color-border, #d0d7e8);
  font-family: var(--font-mono, monospace);
  font-size: 0.76rem;
  white-space: nowrap;
  overflow: hidden;
  text-overflow: ellipsis;
  max-width: 16rem;
  color: var(--color-text, #1a2338);
 }
 .bt-metric-cell {
  padding: 0.4rem 0.75rem;
  border-top: 1px solid var(--color-border, #d0d7e8);
  text-align: right;
  font-family: var(--font-mono, monospace);
  font-variant-numeric: tabular-nums;
  color: var(--color-text, #1a2338);
 }
 .bt-metric-cell.bt-best {
  color: var(--color-success, #3a7a32);
  font-weight: 700;
  background: color-mix(in srgb, var(--color-success, #3a7a32) 8%, transparent);
 }
 .bench-hint {
  font-size: 0.75rem;
  color: var(--color-text-secondary, #6b7a99);
  margin: 0;
 }
@media (max-width: 480px) {
  .bar-row {
    grid-template-columns: 1.5rem 1fr 1fr 3rem;
--- a/web/vite.config.ts
+++ b/web/vite.config.ts
@ -4,11 +4,6 @@ import UnoCSS from 'unocss/vite'
 export default defineConfig({
  plugins: [vue(), UnoCSS()],
  server: {
    proxy: {
      '/api': 'http://localhost:8503',
    },
  },
  test: {
    environment: 'jsdom',
    globals: true,