avocet/docs/superpowers/plans/2026-03-15-finetune-classifier.md

1861 lines
58 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Fine-tune Email Classifier Implementation Plan
> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Fine-tune `deberta-small` and `bge-m3` on the labeled dataset, surface trained models in the benchmark harness, and expose a UI-triggerable training workflow with SSE streaming logs.
**Architecture:** A new CLI script (`scripts/finetune_classifier.py`) handles data prep, weighted training, and checkpoint saving. A new `FineTunedAdapter` in `classifier_adapters.py` loads saved checkpoints for inference. `benchmark_classifier.py` auto-discovers these adapters at startup via `training_info.json` files. Two GET endpoints in `api.py` expose status and streaming run. `BenchmarkView.vue` adds a badge row and collapsible fine-tune section.
**Tech Stack:** transformers 4.57.3, torch 2.10.0, accelerate 1.12.0, scikit-learn (new), FastAPI SSE, Vue 3 + EventSource
---
## File Structure
| File | Action | Responsibility |
|------|--------|---------------|
| `environment.yml` | Modify | Add `scikit-learn` dependency |
| `scripts/classifier_adapters.py` | Modify | Add `FineTunedAdapter` class |
| `scripts/benchmark_classifier.py` | Modify | Add `_MODELS_DIR`, `discover_finetuned_models()`, merge into model registry at startup |
| `scripts/finetune_classifier.py` | Create | Full training pipeline: data prep, class weights, `WeightedTrainer`, CLI |
| `app/api.py` | Modify | Add `GET /api/finetune/status` and `GET /api/finetune/run` |
| `web/src/views/BenchmarkView.vue` | Modify | Add trained models badge row + collapsible fine-tune section |
| `tests/test_classifier_adapters.py` | Modify | Add `FineTunedAdapter` unit tests |
| `tests/test_benchmark_classifier.py` | Modify | Add auto-discovery unit tests |
| `tests/test_finetune.py` | Create | Unit tests for data pipeline, `WeightedTrainer`, `compute_metrics_for_trainer` |
| `tests/test_api.py` | Modify | Add tests for `/api/finetune/status` and `/api/finetune/run` |
---
## Chunk 1: Foundation — FineTunedAdapter + Auto-discovery
### Task 1: Add scikit-learn to environment.yml
**Files:**
- Modify: `environment.yml`
- [ ] **Step 1: Add scikit-learn**
Edit `environment.yml` — add `scikit-learn>=1.4` in the pip section after `accelerate`:
```yaml
- scikit-learn>=1.4
```
- [ ] **Step 2: Verify environment.yml is valid YAML**
```bash
python -c "import yaml; yaml.safe_load(open('environment.yml'))" && echo OK
```
Expected: `OK`
- [ ] **Step 3: Commit**
```bash
git add environment.yml
git commit -m "chore(avocet): add scikit-learn to classifier env"
```
---
### Task 2: FineTunedAdapter — write failing tests
**Files:**
- Modify: `tests/test_classifier_adapters.py`
- [ ] **Step 1: Write the failing tests**
Append to `tests/test_classifier_adapters.py`:
```python
# ---- FineTunedAdapter tests ----
def test_finetuned_adapter_classify_calls_pipeline_with_sep_format(tmp_path):
"""classify() must format input as 'subject [SEP] body[:400]' — not the zero-shot format."""
from unittest.mock import MagicMock, patch
from scripts.classifier_adapters import FineTunedAdapter
mock_result = [{"label": "digest", "score": 0.95}]
mock_pipe_instance = MagicMock(return_value=mock_result)
mock_pipe_factory = MagicMock(return_value=mock_pipe_instance)
adapter = FineTunedAdapter("avocet-deberta-small", str(tmp_path))
with patch("scripts.classifier_adapters.pipeline", mock_pipe_factory):
result = adapter.classify("Test subject", "Test body")
assert result == "digest"
call_args = mock_pipe_instance.call_args[0][0]
assert "[SEP]" in call_args
assert "Test subject" in call_args
assert "Test body" in call_args
def test_finetuned_adapter_truncates_body_to_400():
"""Body must be truncated to 400 chars in the [SEP] format."""
from unittest.mock import MagicMock, patch
from scripts.classifier_adapters import FineTunedAdapter, LABELS
long_body = "x" * 800
mock_result = [{"label": "neutral", "score": 0.9}]
mock_pipe_instance = MagicMock(return_value=mock_result)
mock_pipe_factory = MagicMock(return_value=mock_pipe_instance)
adapter = FineTunedAdapter("avocet-deberta-small", "/fake/path")
with patch("scripts.classifier_adapters.pipeline", mock_pipe_factory):
adapter.classify("Subject", long_body)
call_text = mock_pipe_instance.call_args[0][0]
# "Subject [SEP] " prefix + 400 body chars = 414 chars max
assert len(call_text) <= 420
def test_finetuned_adapter_returns_label_string():
"""classify() must return a plain string, not a dict."""
from unittest.mock import MagicMock, patch
from scripts.classifier_adapters import FineTunedAdapter
mock_result = [{"label": "interview_scheduled", "score": 0.87}]
mock_pipe_instance = MagicMock(return_value=mock_result)
mock_pipe_factory = MagicMock(return_value=mock_pipe_instance)
adapter = FineTunedAdapter("avocet-deberta-small", "/fake/path")
with patch("scripts.classifier_adapters.pipeline", mock_pipe_factory):
result = adapter.classify("S", "B")
assert isinstance(result, str)
assert result == "interview_scheduled"
def test_finetuned_adapter_lazy_loads_pipeline():
"""Pipeline factory must not be called until classify() is first called."""
from unittest.mock import MagicMock, patch
from scripts.classifier_adapters import FineTunedAdapter
mock_pipe_factory = MagicMock(return_value=MagicMock(return_value=[{"label": "neutral", "score": 0.9}]))
with patch("scripts.classifier_adapters.pipeline", mock_pipe_factory):
adapter = FineTunedAdapter("avocet-deberta-small", "/fake/path")
assert not mock_pipe_factory.called
adapter.classify("s", "b")
assert mock_pipe_factory.called
def test_finetuned_adapter_unload_clears_pipeline():
"""unload() must set _pipeline to None so memory is released."""
from unittest.mock import MagicMock, patch
from scripts.classifier_adapters import FineTunedAdapter
mock_pipe_factory = MagicMock(return_value=MagicMock(return_value=[{"label": "neutral", "score": 0.9}]))
with patch("scripts.classifier_adapters.pipeline", mock_pipe_factory):
adapter = FineTunedAdapter("avocet-deberta-small", "/fake/path")
adapter.classify("s", "b")
assert adapter._pipeline is not None
adapter.unload()
assert adapter._pipeline is None
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py -k "finetuned" -v
```
Expected: `ImportError` or `AttributeError``FineTunedAdapter` not yet defined.
---
### Task 3: FineTunedAdapter — implement
**Files:**
- Modify: `scripts/classifier_adapters.py`
- [ ] **Step 1: Add FineTunedAdapter to `__all__`**
In `scripts/classifier_adapters.py`, add `"FineTunedAdapter"` to `__all__`.
- [ ] **Step 2: Implement FineTunedAdapter**
Append after `RerankerAdapter`:
```python
class FineTunedAdapter(ClassifierAdapter):
"""Loads a fine-tuned checkpoint from a local models/ directory.
Uses pipeline("text-classification") for a single forward pass.
Input format: 'subject [SEP] body[:400]' — must match training format exactly.
Expected inference speed: ~1020ms/email vs 111338ms for zero-shot.
"""
def __init__(self, name: str, model_dir: str) -> None:
self._name = name
self._model_dir = model_dir
self._pipeline: Any = None
@property
def name(self) -> str:
return self._name
@property
def model_id(self) -> str:
return self._model_dir
def load(self) -> None:
import scripts.classifier_adapters as _mod # noqa: PLC0415
_pipe_fn = _mod.pipeline
if _pipe_fn is None:
raise ImportError("transformers not installed")
self._pipeline = _pipe_fn("text-classification", model=self._model_dir)
def unload(self) -> None:
self._pipeline = None
def classify(self, subject: str, body: str) -> str:
if self._pipeline is None:
self.load()
text = f"{subject} [SEP] {body[:400]}"
result = self._pipeline(text)
return result[0]["label"]
```
- [ ] **Step 3: Run tests to verify they pass**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py -k "finetuned" -v
```
Expected: 5 tests PASS.
- [ ] **Step 4: Run full adapter test suite to verify no regressions**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py -v
```
Expected: All tests PASS.
- [ ] **Step 5: Commit**
```bash
git add scripts/classifier_adapters.py tests/test_classifier_adapters.py
git commit -m "feat(avocet): add FineTunedAdapter for local checkpoint inference"
```
---
### Task 4: Auto-discovery in benchmark_classifier.py — write failing tests
**Files:**
- Modify: `tests/test_benchmark_classifier.py`
- [ ] **Step 1: Write the failing tests**
Append to `tests/test_benchmark_classifier.py`:
```python
# ---- Auto-discovery tests ----
def test_discover_finetuned_models_finds_training_info_files(tmp_path):
"""discover_finetuned_models() must return one entry per training_info.json found."""
import json
from scripts.benchmark_classifier import discover_finetuned_models
# Create two fake model directories
for name in ("avocet-deberta-small", "avocet-bge-m3"):
model_dir = tmp_path / name
model_dir.mkdir()
info = {
"name": name,
"base_model_id": "cross-encoder/nli-deberta-v3-small",
"timestamp": "2026-03-15T12:00:00Z",
"val_macro_f1": 0.72,
"val_accuracy": 0.80,
"sample_count": 401,
}
(model_dir / "training_info.json").write_text(json.dumps(info))
results = discover_finetuned_models(tmp_path)
assert len(results) == 2
names = {r["name"] for r in results}
assert "avocet-deberta-small" in names
assert "avocet-bge-m3" in names
def test_discover_finetuned_models_returns_empty_when_no_models_dir():
"""discover_finetuned_models() must return [] silently if models/ doesn't exist."""
from pathlib import Path
from scripts.benchmark_classifier import discover_finetuned_models
results = discover_finetuned_models(Path("/nonexistent/path/models"))
assert results == []
def test_discover_finetuned_models_skips_dirs_without_training_info(tmp_path):
"""Subdirs without training_info.json are silently skipped."""
from scripts.benchmark_classifier import discover_finetuned_models
# A dir WITHOUT training_info.json
(tmp_path / "some-other-dir").mkdir()
results = discover_finetuned_models(tmp_path)
assert results == []
def test_active_models_includes_discovered_finetuned(tmp_path):
"""The active models dict must include FineTunedAdapter entries for discovered models."""
import json
from unittest.mock import patch
from scripts.benchmark_classifier import _active_models
from scripts.classifier_adapters import FineTunedAdapter
model_dir = tmp_path / "avocet-deberta-small"
model_dir.mkdir()
(model_dir / "training_info.json").write_text(json.dumps({
"name": "avocet-deberta-small",
"base_model_id": "cross-encoder/nli-deberta-v3-small",
"val_macro_f1": 0.72,
"sample_count": 401,
}))
with patch("scripts.benchmark_classifier._MODELS_DIR", tmp_path):
models = _active_models(include_slow=False)
assert "avocet-deberta-small" in models
assert isinstance(models["avocet-deberta-small"]["adapter_instance"], FineTunedAdapter)
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_benchmark_classifier.py -k "discover or active_models" -v
```
Expected: `ImportError``discover_finetuned_models` and `_MODELS_DIR` not yet defined.
---
### Task 5: Auto-discovery — implement in benchmark_classifier.py
**Files:**
- Modify: `scripts/benchmark_classifier.py`
- [ ] **Step 1: Add imports and _MODELS_DIR**
Near the top of `scripts/benchmark_classifier.py`, after the existing imports, add:
```python
from scripts.classifier_adapters import FineTunedAdapter
```
And define `_MODELS_DIR` (after `_ROOT` is defined — find where `_ROOT = Path(__file__).parent.parent` is, or add it):
```python
_ROOT = Path(__file__).parent.parent
_MODELS_DIR = _ROOT / "models"
```
(If `_ROOT` already exists in the file, only add `_MODELS_DIR`.)
- [ ] **Step 2: Add discover_finetuned_models()**
Add after the `MODEL_REGISTRY` dict:
```python
def discover_finetuned_models(models_dir: Path | None = None) -> list[dict]:
"""Scan models/ for subdirs containing training_info.json.
Returns a list of training_info dicts, each with an added 'model_dir' key.
Returns [] silently if models_dir does not exist.
"""
if models_dir is None:
models_dir = _MODELS_DIR
if not models_dir.exists():
return []
found = []
for sub in models_dir.iterdir():
if not sub.is_dir():
continue
info_path = sub / "training_info.json"
if not info_path.exists():
continue
info = json.loads(info_path.read_text(encoding="utf-8"))
info["model_dir"] = str(sub)
found.append(info)
return found
```
- [ ] **Step 3: Add _active_models() function**
Add after `discover_finetuned_models()`:
```python
def _active_models(include_slow: bool = False) -> dict[str, dict]:
"""Return the active model registry, merged with any discovered fine-tuned models."""
active = {
key: {**entry, "adapter_instance": entry["adapter"](
key,
entry["model_id"],
**entry.get("kwargs", {}),
)}
for key, entry in MODEL_REGISTRY.items()
if include_slow or entry.get("default", False)
}
for info in discover_finetuned_models():
name = info["name"]
active[name] = {
"adapter_instance": FineTunedAdapter(name, info["model_dir"]),
"params": "fine-tuned",
"default": True,
}
return active
```
- [ ] **Step 4: Run tests to verify they pass**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_benchmark_classifier.py -k "discover or active_models" -v
```
Expected: 4 tests PASS.
- [ ] **Step 5: Run full benchmark test suite**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_benchmark_classifier.py -v
```
Expected: All tests PASS. (Existing tests that construct adapters directly from `MODEL_REGISTRY` still work because we only added new functions.)
- [ ] **Step 6: Commit**
```bash
git add scripts/benchmark_classifier.py tests/test_benchmark_classifier.py
git commit -m "feat(avocet): auto-discover fine-tuned models in benchmark harness"
```
---
## Chunk 2: Training Script — finetune_classifier.py
### Task 6: Data loading and class weights — write failing tests
**Files:**
- Create: `tests/test_finetune.py`
- [ ] **Step 1: Create test file with data pipeline tests**
Create `tests/test_finetune.py`:
```python
"""Tests for finetune_classifier — no model downloads required."""
from __future__ import annotations
import json
import pytest
# ---- Data loading tests ----
def test_load_and_prepare_data_drops_non_canonical_labels(tmp_path):
"""Rows with labels not in LABELS must be silently dropped."""
from scripts.finetune_classifier import load_and_prepare_data
from scripts.classifier_adapters import LABELS
rows = [
{"subject": "s1", "body": "b1", "label": "digest"},
{"subject": "s2", "body": "b2", "label": "profile_alert"}, # non-canonical
{"subject": "s3", "body": "b3", "label": "neutral"},
]
score_file = tmp_path / "email_score.jsonl"
score_file.write_text("\n".join(json.dumps(r) for r in rows))
texts, labels = load_and_prepare_data(score_file)
assert len(texts) == 2
assert all(l in LABELS for l in labels)
def test_load_and_prepare_data_formats_input_as_sep():
"""Input text must be 'subject [SEP] body[:400]'."""
import json
from pathlib import Path
from scripts.finetune_classifier import load_and_prepare_data
import tempfile, os
with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False) as f:
f.write(json.dumps({"subject": "Hello", "body": "World" * 100, "label": "neutral"}) + "\n")
fname = f.name
try:
texts, labels = load_and_prepare_data(Path(fname))
finally:
os.unlink(fname)
assert texts[0].startswith("Hello [SEP] ")
assert len(texts[0]) <= len("Hello [SEP] ") + 400 + 5 # small buffer for truncation
def test_load_and_prepare_data_raises_on_missing_file():
"""FileNotFoundError must be raised with actionable message."""
from pathlib import Path
from scripts.finetune_classifier import load_and_prepare_data
with pytest.raises(FileNotFoundError, match="email_score.jsonl"):
load_and_prepare_data(Path("/nonexistent/email_score.jsonl"))
def test_load_and_prepare_data_drops_class_with_fewer_than_2_samples(tmp_path, capsys):
"""Classes with < 2 total samples must be dropped with a warning."""
from scripts.finetune_classifier import load_and_prepare_data
rows = [
{"subject": "s1", "body": "b", "label": "digest"},
{"subject": "s2", "body": "b", "label": "digest"},
{"subject": "s3", "body": "b", "label": "new_lead"}, # only 1 sample — drop
]
score_file = tmp_path / "email_score.jsonl"
score_file.write_text("\n".join(json.dumps(r) for r in rows))
texts, labels = load_and_prepare_data(score_file)
captured = capsys.readouterr()
assert "new_lead" not in labels
assert "new_lead" in captured.out # warning printed
# ---- Class weights tests ----
def test_compute_class_weights_returns_tensor_for_each_class():
"""compute_class_weights must return a float tensor of length n_classes."""
import torch
from scripts.finetune_classifier import compute_class_weights
label_ids = [0, 0, 0, 1, 1, 2] # 3 classes, imbalanced
weights = compute_class_weights(label_ids, n_classes=3)
assert isinstance(weights, torch.Tensor)
assert weights.shape == (3,)
assert all(w > 0 for w in weights)
def test_compute_class_weights_upweights_minority():
"""Minority classes must receive higher weight than majority classes."""
from scripts.finetune_classifier import compute_class_weights
# Class 0: 10 samples, Class 1: 2 samples
label_ids = [0] * 10 + [1] * 2
weights = compute_class_weights(label_ids, n_classes=2)
assert weights[1] > weights[0]
# ---- compute_metrics_for_trainer tests ----
def test_compute_metrics_for_trainer_returns_macro_f1_key():
"""Must return a dict with 'macro_f1' key."""
import numpy as np
from scripts.finetune_classifier import compute_metrics_for_trainer
from transformers import EvalPrediction
logits = np.array([[2.0, 0.1], [0.1, 2.0], [2.0, 0.1]])
labels = np.array([0, 1, 0])
pred = EvalPrediction(predictions=logits, label_ids=labels)
result = compute_metrics_for_trainer(pred)
assert "macro_f1" in result
assert result["macro_f1"] == pytest.approx(1.0)
def test_compute_metrics_for_trainer_returns_accuracy_key():
"""Must also return 'accuracy' key."""
import numpy as np
from scripts.finetune_classifier import compute_metrics_for_trainer
from transformers import EvalPrediction
logits = np.array([[2.0, 0.1], [0.1, 2.0]])
labels = np.array([0, 1])
pred = EvalPrediction(predictions=logits, label_ids=labels)
result = compute_metrics_for_trainer(pred)
assert "accuracy" in result
assert result["accuracy"] == pytest.approx(1.0)
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_finetune.py -k "load_and_prepare or class_weights or compute_metrics_for_trainer" -v
```
Expected: `ModuleNotFoundError``scripts.finetune_classifier` not yet created.
---
### Task 7: Implement data loading and class weights in finetune_classifier.py
**Files:**
- Create: `scripts/finetune_classifier.py`
- [ ] **Step 1: Create finetune_classifier.py with data loading + class weights**
Create `scripts/finetune_classifier.py`:
```python
"""Fine-tune email classifiers on the labeled dataset.
CLI entry point. All prints use flush=True so stdout is SSE-streamable.
Usage:
python scripts/finetune_classifier.py --model deberta-small [--epochs 5]
Supported --model values: deberta-small, bge-m3
"""
from __future__ import annotations
import argparse
import json
import sys
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import torch
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
EvalPrediction,
Trainer,
TrainingArguments,
EarlyStoppingCallback,
)
sys.path.insert(0, str(Path(__file__).parent.parent))
from scripts.classifier_adapters import LABELS
_ROOT = Path(__file__).parent.parent
# ---------------------------------------------------------------------------
# Model registry
# ---------------------------------------------------------------------------
_MODEL_CONFIG: dict[str, dict[str, Any]] = {
"deberta-small": {
"base_model_id": "cross-encoder/nli-deberta-v3-small",
"max_tokens": 512,
"fp16": False,
"batch_size": 16,
"grad_accum": 1,
"gradient_checkpointing": False,
},
"bge-m3": {
"base_model_id": "MoritzLaurer/bge-m3-zeroshot-v2.0",
"max_tokens": 512,
"fp16": True,
"batch_size": 4,
"grad_accum": 4,
"gradient_checkpointing": True,
},
}
# ---------------------------------------------------------------------------
# Data preparation
# ---------------------------------------------------------------------------
def load_and_prepare_data(score_file: Path) -> tuple[list[str], list[str]]:
"""Load email_score.jsonl and return (texts, labels) ready for training.
- Drops rows with non-canonical labels (warns).
- Drops classes with < 2 total samples (warns).
- Warns (but continues) for classes with < 5 training samples.
- Input text format: 'subject [SEP] body[:400]'
"""
if not score_file.exists():
raise FileNotFoundError(
f"Score file not found: {score_file}\n"
"Run the label tool first to create email_score.jsonl"
)
lines = score_file.read_text(encoding="utf-8").splitlines()
rows = [json.loads(l) for l in lines if l.strip()]
# Drop non-canonical labels
canonical = set(LABELS)
kept = []
for r in rows:
lbl = r.get("label", "")
if lbl not in canonical:
print(f"[data] Dropping row with non-canonical label: {lbl!r}", flush=True)
continue
kept.append(r)
# Count samples per class
from collections import Counter
counts = Counter(r["label"] for r in kept)
# Drop classes with < 2 total samples
drop_classes = {lbl for lbl, cnt in counts.items() if cnt < 2}
for lbl in sorted(drop_classes):
print(
f"[data] WARNING: Dropping class {lbl!r} — only {counts[lbl]} total sample(s). "
"Need at least 2 for stratified split.",
flush=True,
)
kept = [r for r in kept if r["label"] not in drop_classes]
# Warn for classes with < 5 samples (after drops)
counts = Counter(r["label"] for r in kept)
for lbl, cnt in sorted(counts.items()):
if cnt < 5:
print(
f"[data] WARNING: Class {lbl!r} has only {cnt} sample(s). "
"Eval F1 for this class will be unreliable.",
flush=True,
)
texts = [f"{r['subject']} [SEP] {r['body'][:400]}" for r in kept]
labels = [r["label"] for r in kept]
return texts, labels
# ---------------------------------------------------------------------------
# Class weights
# ---------------------------------------------------------------------------
def compute_class_weights(label_ids: list[int], n_classes: int) -> torch.Tensor:
"""Compute per-class weights: total / (n_classes * class_count).
Returns a CPU float tensor of shape (n_classes,).
"""
from collections import Counter
counts = Counter(label_ids)
total = len(label_ids)
weights = []
for i in range(n_classes):
cnt = counts.get(i, 1) # avoid division by zero for unseen classes
weights.append(total / (n_classes * cnt))
return torch.tensor(weights, dtype=torch.float32)
# ---------------------------------------------------------------------------
# compute_metrics callback for Trainer
# ---------------------------------------------------------------------------
def compute_metrics_for_trainer(eval_pred: EvalPrediction) -> dict:
"""Trainer callback: EvalPrediction → {macro_f1, accuracy}.
Distinct from compute_metrics() in classifier_adapters.py (which operates
on string predictions). This one operates on numpy logits + label_ids.
"""
logits, labels = eval_pred
preds = logits.argmax(axis=-1)
return {
"macro_f1": f1_score(labels, preds, average="macro", zero_division=0),
"accuracy": accuracy_score(labels, preds),
}
```
- [ ] **Step 2: Run data pipeline tests**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_finetune.py -k "load_and_prepare or class_weights or compute_metrics_for_trainer" -v
```
Expected: All 7 tests PASS. (Note: `compute_metrics_for_trainer` test requires transformers — run in `job-seeker-classifiers` env if needed.)
```bash
/devl/miniconda3/envs/job-seeker-classifiers/bin/pytest tests/test_finetune.py -k "load_and_prepare or class_weights or compute_metrics_for_trainer" -v
```
Expected: All 7 tests PASS.
- [ ] **Step 3: Commit**
```bash
git add scripts/finetune_classifier.py tests/test_finetune.py
git commit -m "feat(avocet): add finetune data pipeline + class weights + compute_metrics"
```
---
### Task 8: WeightedTrainer — write failing tests
**Files:**
- Modify: `tests/test_finetune.py`
- [ ] **Step 1: Append WeightedTrainer tests**
Append to `tests/test_finetune.py`:
```python
# ---- WeightedTrainer tests ----
def test_weighted_trainer_compute_loss_returns_scalar():
"""compute_loss must return a scalar tensor when return_outputs=False."""
import torch
from unittest.mock import MagicMock
from scripts.finetune_classifier import WeightedTrainer
# Minimal mock model that returns logits
n_classes = 3
batch = 4
logits = torch.randn(batch, n_classes)
mock_outputs = MagicMock()
mock_outputs.logits = logits
mock_model = MagicMock(return_value=mock_outputs)
# Build a trainer with class weights
weights = torch.ones(n_classes)
trainer = WeightedTrainer.__new__(WeightedTrainer)
trainer.class_weights = weights
inputs = {
"input_ids": torch.zeros(batch, 10, dtype=torch.long),
"labels": torch.randint(0, n_classes, (batch,)),
}
loss = trainer.compute_loss(mock_model, inputs, return_outputs=False)
assert isinstance(loss, torch.Tensor)
assert loss.ndim == 0 # scalar
def test_weighted_trainer_compute_loss_accepts_kwargs():
"""compute_loss must not raise TypeError when called with num_items_in_batch kwarg.
Transformers 4.38+ passes this extra kwarg — **kwargs absorbs it.
"""
import torch
from unittest.mock import MagicMock
from scripts.finetune_classifier import WeightedTrainer
n_classes = 3
batch = 2
logits = torch.randn(batch, n_classes)
mock_outputs = MagicMock()
mock_outputs.logits = logits
mock_model = MagicMock(return_value=mock_outputs)
trainer = WeightedTrainer.__new__(WeightedTrainer)
trainer.class_weights = torch.ones(n_classes)
inputs = {
"input_ids": torch.zeros(batch, 5, dtype=torch.long),
"labels": torch.randint(0, n_classes, (batch,)),
}
# Must not raise TypeError
loss = trainer.compute_loss(mock_model, inputs, return_outputs=False,
num_items_in_batch=batch)
assert isinstance(loss, torch.Tensor)
def test_weighted_trainer_weighted_loss_differs_from_unweighted():
"""Weighted loss must differ from uniform-weight loss for imbalanced inputs."""
import torch
from unittest.mock import MagicMock
from scripts.finetune_classifier import WeightedTrainer
n_classes = 2
batch = 4
# All labels are class 0 (majority class scenario)
labels = torch.zeros(batch, dtype=torch.long)
logits = torch.zeros(batch, n_classes) # neutral logits
mock_outputs = MagicMock()
mock_outputs.logits = logits
# Uniform weights
trainer_uniform = WeightedTrainer.__new__(WeightedTrainer)
trainer_uniform.class_weights = torch.ones(n_classes)
inputs_uniform = {"input_ids": torch.zeros(batch, 5, dtype=torch.long), "labels": labels.clone()}
loss_uniform = trainer_uniform.compute_loss(MagicMock(return_value=mock_outputs),
inputs_uniform)
# Heavily imbalanced weights: class 1 much more important
trainer_weighted = WeightedTrainer.__new__(WeightedTrainer)
trainer_weighted.class_weights = torch.tensor([0.1, 10.0])
inputs_weighted = {"input_ids": torch.zeros(batch, 5, dtype=torch.long), "labels": labels.clone()}
mock_outputs2 = MagicMock()
mock_outputs2.logits = logits.clone()
loss_weighted = trainer_weighted.compute_loss(MagicMock(return_value=mock_outputs2),
inputs_weighted)
assert not torch.isclose(loss_uniform, loss_weighted)
def test_weighted_trainer_compute_loss_returns_outputs_when_requested():
"""compute_loss with return_outputs=True must return (loss, outputs) tuple."""
import torch
from unittest.mock import MagicMock
from scripts.finetune_classifier import WeightedTrainer
n_classes = 3
batch = 2
logits = torch.randn(batch, n_classes)
mock_outputs = MagicMock()
mock_outputs.logits = logits
mock_model = MagicMock(return_value=mock_outputs)
trainer = WeightedTrainer.__new__(WeightedTrainer)
trainer.class_weights = torch.ones(n_classes)
inputs = {
"input_ids": torch.zeros(batch, 5, dtype=torch.long),
"labels": torch.randint(0, n_classes, (batch,)),
}
result = trainer.compute_loss(mock_model, inputs, return_outputs=True)
assert isinstance(result, tuple)
loss, outputs = result
assert isinstance(loss, torch.Tensor)
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_finetune.py -k "weighted_trainer" -v
```
Expected: `ImportError``WeightedTrainer` not yet defined.
---
### Task 9: Implement WeightedTrainer
**Files:**
- Modify: `scripts/finetune_classifier.py`
- [ ] **Step 1: Add WeightedTrainer class**
Append to `scripts/finetune_classifier.py` after `compute_metrics_for_trainer`:
```python
# ---------------------------------------------------------------------------
# Weighted Trainer
# ---------------------------------------------------------------------------
class WeightedTrainer(Trainer):
"""Trainer subclass that applies per-class weights to cross-entropy loss.
Handles class imbalance by down-weighting majority classes and up-weighting
minority classes. Attach class_weights (CPU float tensor) before training.
"""
def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
# **kwargs is required — absorbs num_items_in_batch added in Transformers 4.38.
# Do not remove it; removing it causes TypeError on the first training step.
labels = inputs.pop("labels")
outputs = model(**inputs)
# Move class_weights to the same device as logits — required for GPU training.
# class_weights is created on CPU; logits are on cuda:0 during training.
weight = self.class_weights.to(outputs.logits.device)
loss = F.cross_entropy(outputs.logits, labels, weight=weight)
return (loss, outputs) if return_outputs else loss
```
- [ ] **Step 2: Run WeightedTrainer tests**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_finetune.py -k "weighted_trainer" -v
```
Expected: 4 tests PASS.
- [ ] **Step 3: Run full test_finetune.py**
```bash
/devl/miniconda3/envs/job-seeker-classifiers/bin/pytest tests/test_finetune.py -v
```
Expected: All tests PASS.
- [ ] **Step 4: Commit**
```bash
git add scripts/finetune_classifier.py tests/test_finetune.py
git commit -m "feat(avocet): add WeightedTrainer with device-aware class weights"
```
---
### Task 10: Implement run_finetune() and CLI
**Files:**
- Modify: `scripts/finetune_classifier.py`
- [ ] **Step 1: Add run_finetune() and CLI to finetune_classifier.py**
Append to `scripts/finetune_classifier.py`:
```python
# ---------------------------------------------------------------------------
# Training dataset wrapper
# ---------------------------------------------------------------------------
from torch.utils.data import Dataset as TorchDataset
class _EmailDataset(TorchDataset):
def __init__(self, encodings: dict, label_ids: list[int]) -> None:
self.encodings = encodings
self.label_ids = label_ids
def __len__(self) -> int:
return len(self.label_ids)
def __getitem__(self, idx: int) -> dict:
item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
item["labels"] = torch.tensor(self.label_ids[idx], dtype=torch.long)
return item
# ---------------------------------------------------------------------------
# Main training function
# ---------------------------------------------------------------------------
def run_finetune(model_key: str, epochs: int = 5) -> None:
"""Fine-tune the specified model on data/email_score.jsonl.
Saves model + tokenizer + training_info.json to models/avocet-{model_key}/.
All prints use flush=True for SSE streaming.
"""
if model_key not in _MODEL_CONFIG:
raise ValueError(f"Unknown model key: {model_key!r}. Choose from: {list(_MODEL_CONFIG)}")
config = _MODEL_CONFIG[model_key]
base_model_id = config["base_model_id"]
output_dir = _ROOT / "models" / f"avocet-{model_key}"
print(f"[finetune] Model: {model_key} ({base_model_id})", flush=True)
print(f"[finetune] Output: {output_dir}", flush=True)
if output_dir.exists():
print(f"[finetune] WARNING: {output_dir} already exists — will overwrite.", flush=True)
# --- Data ---
score_file = _ROOT / "data" / "email_score.jsonl"
print(f"[finetune] Loading data from {score_file} ...", flush=True)
texts, str_labels = load_and_prepare_data(score_file)
present_labels = sorted(set(str_labels))
label2id = {l: i for i, l in enumerate(present_labels)}
id2label = {i: l for l, i in label2id.items()}
n_classes = len(present_labels)
label_ids = [label2id[l] for l in str_labels]
print(f"[finetune] {len(texts)} samples, {n_classes} classes", flush=True)
# Stratified 80/20 split
(train_texts, val_texts,
train_label_ids, val_label_ids) = train_test_split(
texts, label_ids,
test_size=0.2,
stratify=label_ids,
random_state=42,
)
print(f"[finetune] Train: {len(train_texts)}, Val: {len(val_texts)}", flush=True)
# Warn for classes with < 5 training samples
from collections import Counter
train_counts = Counter(train_label_ids)
for cls_id, cnt in train_counts.items():
if cnt < 5:
print(
f"[finetune] WARNING: Class {id2label[cls_id]!r} has {cnt} training sample(s). "
"Eval F1 for this class will be unreliable.",
flush=True,
)
# --- Tokenize ---
print(f"[finetune] Loading tokenizer ...", flush=True)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
train_enc = tokenizer(train_texts, truncation=True,
max_length=config["max_tokens"], padding=True)
val_enc = tokenizer(val_texts, truncation=True,
max_length=config["max_tokens"], padding=True)
train_dataset = _EmailDataset(train_enc, train_label_ids)
val_dataset = _EmailDataset(val_enc, val_label_ids)
# --- Class weights ---
class_weights = compute_class_weights(train_label_ids, n_classes)
print(f"[finetune] Class weights: {dict(zip(present_labels, class_weights.tolist()))}", flush=True)
# --- Model ---
print(f"[finetune] Loading model ...", flush=True)
model = AutoModelForSequenceClassification.from_pretrained(
base_model_id,
num_labels=n_classes,
ignore_mismatched_sizes=True, # NLI head (3-class) → new head (n_classes)
id2label=id2label,
label2id=label2id,
)
if config["gradient_checkpointing"]:
model.gradient_checkpointing_enable()
# --- TrainingArguments ---
training_args = TrainingArguments(
output_dir=str(output_dir),
num_train_epochs=epochs,
per_device_train_batch_size=config["batch_size"],
per_device_eval_batch_size=config["batch_size"],
gradient_accumulation_steps=config["grad_accum"],
learning_rate=2e-5,
lr_scheduler_type="linear",
warmup_ratio=0.1,
fp16=config["fp16"],
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="macro_f1",
greater_is_better=True,
logging_steps=10,
report_to="none",
save_total_limit=2,
)
trainer = WeightedTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics_for_trainer,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
trainer.class_weights = class_weights
# --- Train ---
print(f"[finetune] Starting training ({epochs} epochs) ...", flush=True)
train_result = trainer.train()
print(f"[finetune] Training complete. Steps: {train_result.global_step}", flush=True)
# --- Evaluate ---
print(f"[finetune] Evaluating best checkpoint ...", flush=True)
metrics = trainer.evaluate()
val_macro_f1 = metrics.get("eval_macro_f1", 0.0)
val_accuracy = metrics.get("eval_accuracy", 0.0)
print(f"[finetune] Val macro-F1: {val_macro_f1:.4f}, Accuracy: {val_accuracy:.4f}", flush=True)
# --- Save model + tokenizer ---
print(f"[finetune] Saving model to {output_dir} ...", flush=True)
trainer.save_model(str(output_dir))
tokenizer.save_pretrained(str(output_dir))
# --- Write training_info.json ---
from collections import Counter
label_counts = dict(Counter(str_labels))
info = {
"name": f"avocet-{model_key}",
"base_model_id": base_model_id,
"timestamp": datetime.now(timezone.utc).isoformat(),
"epochs_run": epochs,
"val_macro_f1": round(val_macro_f1, 4),
"val_accuracy": round(val_accuracy, 4),
"sample_count": len(train_texts),
"label_counts": label_counts,
}
info_path = output_dir / "training_info.json"
info_path.write_text(json.dumps(info, indent=2), encoding="utf-8")
print(f"[finetune] Saved training_info.json: val_macro_f1={val_macro_f1:.4f}", flush=True)
print(f"[finetune] Done.", flush=True)
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Fine-tune an email classifier")
parser.add_argument(
"--model",
choices=list(_MODEL_CONFIG),
required=True,
help="Model key to fine-tune",
)
parser.add_argument(
"--epochs",
type=int,
default=5,
help="Number of training epochs (default: 5)",
)
args = parser.parse_args()
run_finetune(args.model, args.epochs)
```
- [ ] **Step 2: Run all finetune tests**
```bash
/devl/miniconda3/envs/job-seeker-classifiers/bin/pytest tests/test_finetune.py -v
```
Expected: All tests PASS (run_finetune itself is tested in the integration test — Task 11).
- [ ] **Step 3: Commit**
```bash
git add scripts/finetune_classifier.py
git commit -m "feat(avocet): add run_finetune() training loop and CLI"
```
---
### Task 11: Integration test — finetune on example data
**Files:**
- Modify: `tests/test_finetune.py`
The example file `data/email_score.jsonl.example` has 8 samples with 5 of 10 labels represented. The 5 missing labels trigger the `< 2 total samples` drop path.
- [ ] **Step 1: Append integration test**
Append to `tests/test_finetune.py`:
```python
# ---- Integration test ----
def test_integration_finetune_on_example_data(tmp_path):
"""Fine-tune deberta-small on example data for 1 epoch.
Uses data/email_score.jsonl.example (8 samples, 5 labels represented).
The 5 missing labels must trigger the < 2 samples drop warning.
Verifies training_info.json is written with correct keys.
NOTE: This test requires the job-seeker-classifiers conda env and downloads
the deberta-small model on first run (~100MB). Skip in CI if model not cached.
Mark with @pytest.mark.slow to exclude from default runs.
"""
import shutil
from scripts.finetune_classifier import run_finetune, _ROOT
from scripts import finetune_classifier as ft_mod
example_file = _ROOT / "data" / "email_score.jsonl.example"
if not example_file.exists():
pytest.skip("email_score.jsonl.example not found")
# Patch _ROOT to use tmp_path so model saves there, not production models/
orig_root = ft_mod._ROOT
ft_mod._ROOT = tmp_path
# Also copy the example file to tmp_path/data/
(tmp_path / "data").mkdir()
shutil.copy(example_file, tmp_path / "data" / "email_score.jsonl")
try:
import io
from contextlib import redirect_stdout
captured = io.StringIO()
with redirect_stdout(captured):
run_finetune("deberta-small", epochs=1)
output = captured.getvalue()
finally:
ft_mod._ROOT = orig_root
# 5 missing labels should each trigger a drop warning
from scripts.classifier_adapters import LABELS
assert "< 2 total samples" in output or "WARNING: Dropping class" in output
# training_info.json must exist with correct keys
info_path = tmp_path / "models" / "avocet-deberta-small" / "training_info.json"
assert info_path.exists(), "training_info.json not written"
import json
info = json.loads(info_path.read_text())
for key in ("name", "base_model_id", "timestamp", "epochs_run",
"val_macro_f1", "val_accuracy", "sample_count", "label_counts"):
assert key in info, f"Missing key: {key}"
assert info["name"] == "avocet-deberta-small"
assert info["epochs_run"] == 1
```
- [ ] **Step 2: Run unit tests only (fast path, no model download)**
```bash
/devl/miniconda3/envs/job-seeker-classifiers/bin/pytest tests/test_finetune.py -v -k "not integration"
```
Expected: All non-integration tests PASS.
- [ ] **Step 3: Run integration test (requires model download ~100MB)**
```bash
/devl/miniconda3/envs/job-seeker-classifiers/bin/pytest tests/test_finetune.py::test_integration_finetune_on_example_data -v -s
```
Expected: PASS. Check output for drop warnings for missing labels.
- [ ] **Step 4: Commit**
```bash
git add tests/test_finetune.py
git commit -m "test(avocet): add integration test for finetune_classifier on example data"
```
---
## Chunk 3: API Endpoints + BenchmarkView UI
### Task 12: API endpoints — write failing tests
**Files:**
- Modify: `tests/test_api.py`
- [ ] **Step 1: Append finetune endpoint tests**
Append to `tests/test_api.py`:
```python
# ---- /api/finetune/status tests ----
def test_finetune_status_returns_empty_when_no_models_dir(client):
"""GET /api/finetune/status must return [] if models/ does not exist."""
r = client.get("/api/finetune/status")
assert r.status_code == 200
assert r.json() == []
def test_finetune_status_returns_training_info(client, tmp_path):
"""GET /api/finetune/status must return one entry per training_info.json found."""
import json
from app import api as api_module
# Create a fake models dir under tmp_path (data dir)
models_dir = api_module._DATA_DIR.parent / "models"
model_dir = models_dir / "avocet-deberta-small"
model_dir.mkdir(parents=True)
info = {
"name": "avocet-deberta-small",
"base_model_id": "cross-encoder/nli-deberta-v3-small",
"val_macro_f1": 0.712,
"timestamp": "2026-03-15T12:00:00Z",
"sample_count": 401,
}
(model_dir / "training_info.json").write_text(json.dumps(info))
r = client.get("/api/finetune/status")
assert r.status_code == 200
data = r.json()
assert len(data) == 1
assert data[0]["name"] == "avocet-deberta-small"
assert data[0]["val_macro_f1"] == pytest.approx(0.712)
def test_finetune_run_streams_sse_events(client):
"""GET /api/finetune/run must return text/event-stream content type."""
import subprocess
from unittest.mock import patch, MagicMock
mock_proc = MagicMock()
mock_proc.stdout = iter(["Training epoch 1\n", "Done\n"])
mock_proc.returncode = 0
mock_proc.wait = MagicMock()
with patch("subprocess.Popen", return_value=mock_proc):
r = client.get("/api/finetune/run?model=deberta-small&epochs=1")
assert r.status_code == 200
assert "text/event-stream" in r.headers.get("content-type", "")
def test_finetune_run_emits_complete_on_success(client):
"""GET /api/finetune/run must emit a complete event on clean exit."""
import subprocess
from unittest.mock import patch, MagicMock
mock_proc = MagicMock()
mock_proc.stdout = iter(["progress line\n"])
mock_proc.returncode = 0
mock_proc.wait = MagicMock()
with patch("subprocess.Popen", return_value=mock_proc):
r = client.get("/api/finetune/run?model=deberta-small&epochs=1")
assert '{"type": "complete"}' in r.text
def test_finetune_run_emits_error_on_nonzero_exit(client):
"""GET /api/finetune/run must emit an error event on non-zero exit."""
import subprocess
from unittest.mock import patch, MagicMock
mock_proc = MagicMock()
mock_proc.stdout = iter([])
mock_proc.returncode = 1
mock_proc.wait = MagicMock()
with patch("subprocess.Popen", return_value=mock_proc):
r = client.get("/api/finetune/run?model=deberta-small&epochs=1")
assert '"type": "error"' in r.text
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_api.py -k "finetune" -v
```
Expected: 404 or connection errors — endpoints not yet defined.
---
### Task 13: Implement finetune API endpoints
**Files:**
- Modify: `app/api.py`
- [ ] **Step 1: Add finetune endpoints to api.py**
In `app/api.py`, add after the benchmark endpoints section (after the `run_benchmark` function, before the `fetch_stream` function):
```python
# ---------------------------------------------------------------------------
# Fine-tune endpoints
# ---------------------------------------------------------------------------
@app.get("/api/finetune/status")
def get_finetune_status():
"""Scan models/ for training_info.json files. Returns [] if none exist."""
models_dir = _ROOT / "models"
if not models_dir.exists():
return []
results = []
for sub in models_dir.iterdir():
if not sub.is_dir():
continue
info_path = sub / "training_info.json"
if not info_path.exists():
continue
try:
info = json.loads(info_path.read_text(encoding="utf-8"))
results.append(info)
except Exception:
pass
return results
@app.get("/api/finetune/run")
def run_finetune(model: str = "deberta-small", epochs: int = 5):
"""Spawn finetune_classifier.py and stream stdout as SSE progress events."""
import subprocess
python_bin = "/devl/miniconda3/envs/job-seeker-classifiers/bin/python"
script = str(_ROOT / "scripts" / "finetune_classifier.py")
cmd = [python_bin, script, "--model", model, "--epochs", str(epochs)]
def generate():
try:
proc = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
bufsize=1,
cwd=str(_ROOT),
)
for line in proc.stdout:
line = line.rstrip()
if line:
yield f"data: {json.dumps({'type': 'progress', 'message': line})}\n\n"
proc.wait()
if proc.returncode == 0:
yield f"data: {json.dumps({'type': 'complete'})}\n\n"
else:
yield f"data: {json.dumps({'type': 'error', 'message': f'Process exited with code {proc.returncode}'})}\n\n"
except Exception as exc:
yield f"data: {json.dumps({'type': 'error', 'message': str(exc)})}\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
)
```
- [ ] **Step 2: Run finetune API tests**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_api.py -k "finetune" -v
```
Expected: All 5 finetune tests PASS.
- [ ] **Step 3: Run full API test suite**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_api.py -v
```
Expected: All tests PASS.
- [ ] **Step 4: Commit**
```bash
git add app/api.py tests/test_api.py
git commit -m "feat(avocet): add /api/finetune/status and /api/finetune/run endpoints"
```
---
### Task 14: BenchmarkView.vue — trained models badge row + fine-tune section
**Files:**
- Modify: `web/src/views/BenchmarkView.vue`
The BenchmarkView already has:
- Macro-F1 bar chart
- Latency bar chart
- Per-label F1 heatmap
- Benchmark run button with SSE log
Add:
1. **Trained models badge row** at the top (conditional on `fineTunedModels.length > 0`)
2. **Fine-tune section** (collapsible, at the bottom): model dropdown, epoch input, run button → SSE log, on `complete` auto-trigger benchmark run
- [ ] **Step 1: Read current BenchmarkView.vue**
```bash
cat web/src/views/BenchmarkView.vue
```
(Use this to understand the existing structure before editing — identify where to insert each new section.)
- [ ] **Step 2: Add fineTunedModels state and fetch logic**
In the `<script setup>` section, add after the existing reactive state:
```ts
// Fine-tuned models
const fineTunedModels = ref<Array<{
name: string
base_model: string
val_macro_f1: number
timestamp: string
sample_count: number
}>>([])
const finetune = reactive({
model: 'deberta-small',
epochs: 5,
running: false,
log: [] as string[],
es: null as EventSource | null,
})
async function fetchFineTunedModels() {
try {
const r = await fetch('/api/finetune/status')
fineTunedModels.value = await r.json()
} catch { /* silent */ }
}
function runFinetune() {
if (finetune.running) return
finetune.running = true
finetune.log = []
finetune.es?.close()
const url = `/api/finetune/run?model=${finetune.model}&epochs=${finetune.epochs}`
finetune.es = new EventSource(url)
finetune.es.onmessage = (e) => {
const msg = JSON.parse(e.data)
if (msg.type === 'progress') {
finetune.log.push(msg.message)
} else if (msg.type === 'complete') {
finetune.running = false
finetune.es?.close()
fetchFineTunedModels()
runBenchmark() // auto-trigger benchmark to update charts
} else if (msg.type === 'error') {
finetune.running = false
finetune.es?.close()
finetune.log.push(`ERROR: ${msg.message}`)
}
}
}
```
Add `fetchFineTunedModels()` to the `onMounted` call alongside the existing `fetchResults()`.
- [ ] **Step 3: Add trained models badge row to template**
In the `<template>`, add at the very top of the main content area (before the chart sections), conditional on `fineTunedModels.length > 0`:
```html
<!-- Trained models badge row -->
<div v-if="fineTunedModels.length > 0" class="trained-models-row">
<span class="trained-label">Trained models:</span>
<span
v-for="m in fineTunedModels"
:key="m.name"
class="trained-badge"
:title="`Base: ${m.base_model} | ${m.sample_count} samples | ${m.timestamp}`"
>
{{ m.name }}
<span class="trained-f1">F1 {{ (m.val_macro_f1 * 100).toFixed(1) }}%</span>
</span>
</div>
```
- [ ] **Step 4: Add fine-tune collapsible section to template**
Add at the bottom of the main content area, after the benchmark log section:
```html
<!-- Fine-tune section -->
<details class="finetune-section">
<summary class="finetune-summary">Fine-tune a model</summary>
<div class="finetune-controls">
<label class="ft-label">
Model
<select v-model="finetune.model" class="ft-select">
<option value="deberta-small">deberta-small (100M, fast)</option>
<option value="bge-m3">bge-m3 (600M, slow — stop Peregrine vLLM first)</option>
</select>
</label>
<label class="ft-label">
Epochs
<input
v-model.number="finetune.epochs"
type="number"
min="1"
max="20"
class="ft-epochs"
/>
</label>
<button
class="ft-run-btn"
:disabled="finetune.running"
@click="runFinetune"
>
{{ finetune.running ? 'Training…' : 'Run fine-tune' }}
</button>
</div>
<div v-if="finetune.log.length > 0" class="ft-log">
<div v-for="(line, i) in finetune.log" :key="i" class="ft-log-line">{{ line }}</div>
</div>
</details>
```
- [ ] **Step 5: Add styles**
Add to the `<style scoped>` section:
```css
/* Trained models badge row */
.trained-models-row {
display: flex;
flex-wrap: wrap;
align-items: center;
gap: 0.5rem;
padding: 0.75rem 1rem;
background: var(--color-surface-raised, #e4ebf5);
border-radius: 0.5rem;
margin-bottom: 1rem;
}
.trained-label {
font-size: 0.8rem;
font-weight: 600;
color: var(--color-text-secondary, #6b7a99);
text-transform: uppercase;
letter-spacing: 0.04em;
}
.trained-badge {
display: inline-flex;
align-items: center;
gap: 0.4rem;
padding: 0.25rem 0.6rem;
background: var(--app-primary, #2A6080);
color: white;
border-radius: 1rem;
font-size: 0.82rem;
cursor: default;
}
.trained-f1 {
background: rgba(255,255,255,0.2);
border-radius: 0.75rem;
padding: 0.1rem 0.4rem;
font-size: 0.75rem;
font-weight: 700;
}
/* Fine-tune section */
.finetune-section {
border: 1px solid var(--color-border, #d0d7e8);
border-radius: 0.5rem;
padding: 0;
margin-top: 1.5rem;
}
.finetune-summary {
padding: 0.75rem 1rem;
cursor: pointer;
font-weight: 600;
color: var(--color-text, #1a2338);
list-style: none;
user-select: none;
}
.finetune-summary::-webkit-details-marker { display: none; }
.finetune-summary::before {
content: '▶ ';
font-size: 0.7rem;
color: var(--color-text-secondary, #6b7a99);
}
details[open] .finetune-summary::before { content: '▼ '; }
.finetune-controls {
display: flex;
flex-wrap: wrap;
gap: 1rem;
align-items: flex-end;
padding: 0.75rem 1rem 1rem;
border-top: 1px solid var(--color-border, #d0d7e8);
}
.ft-label {
display: flex;
flex-direction: column;
gap: 0.3rem;
font-size: 0.82rem;
font-weight: 600;
color: var(--color-text-secondary, #6b7a99);
}
.ft-select {
padding: 0.35rem 0.6rem;
border: 1px solid var(--color-border, #d0d7e8);
border-radius: 0.375rem;
background: var(--color-surface, #f0f4fb);
font-size: 0.9rem;
color: var(--color-text, #1a2338);
min-width: 260px;
}
.ft-epochs {
width: 70px;
padding: 0.35rem 0.5rem;
border: 1px solid var(--color-border, #d0d7e8);
border-radius: 0.375rem;
background: var(--color-surface, #f0f4fb);
font-size: 0.9rem;
color: var(--color-text, #1a2338);
text-align: center;
}
.ft-run-btn {
padding: 0.45rem 1.2rem;
background: var(--app-primary, #2A6080);
color: white;
border: none;
border-radius: 0.375rem;
font-size: 0.9rem;
font-weight: 600;
cursor: pointer;
transition: opacity 0.15s;
}
.ft-run-btn:disabled {
opacity: 0.55;
cursor: not-allowed;
}
.ft-log {
margin: 0 1rem 1rem;
padding: 0.5rem 0.75rem;
background: var(--color-surface, #f0f4fb);
border-radius: 0.375rem;
max-height: 260px;
overflow-y: auto;
font-family: var(--font-mono, monospace);
font-size: 0.78rem;
}
.ft-log-line {
line-height: 1.6;
color: var(--color-text, #1a2338);
white-space: pre-wrap;
word-break: break-all;
}
```
- [ ] **Step 6: Build and verify**
```bash
cd /Library/Development/CircuitForge/avocet/web && npm run build
```
Expected: Build succeeds with no errors.
- [ ] **Step 7: Start dev server and verify in browser**
```bash
./manage.sh start-api
```
Open http://localhost:8503 → navigate to Benchmark:
- Badge row not visible (no trained models yet — correct)
- Fine-tune section visible as collapsed `<details>`
- Click to expand → dropdown and epoch input visible
- Without a model trained, status returns `[]` (correct)
- [ ] **Step 8: Commit**
```bash
git add web/src/views/BenchmarkView.vue
git commit -m "feat(avocet): add fine-tune section and trained models badge row to BenchmarkView"
```
---
### Task 15: Final verification — full test suite
- [ ] **Step 1: Run full test suite**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v -k "not integration"
```
Expected: All non-integration tests PASS.
- [ ] **Step 2: Run in classifier env (catches transformers-specific tests)**
```bash
/devl/miniconda3/envs/job-seeker-classifiers/bin/pytest tests/ -v -k "not integration"
```
Expected: All non-integration tests PASS.
- [ ] **Step 3: Build Vue SPA**
```bash
cd /Library/Development/CircuitForge/avocet/web && npm run build
```
Expected: No TypeScript or build errors.
- [ ] **Step 4: Final commit**
```bash
git add -A
git status # verify nothing unexpected staged
git commit -m "feat(avocet): finetune classifier feature complete"
```