58 KiB
Fine-tune Email Classifier Implementation Plan
For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Fine-tune deberta-small and bge-m3 on the labeled dataset, surface trained models in the benchmark harness, and expose a UI-triggerable training workflow with SSE streaming logs.
Architecture: A new CLI script (scripts/finetune_classifier.py) handles data prep, weighted training, and checkpoint saving. A new FineTunedAdapter in classifier_adapters.py loads saved checkpoints for inference. benchmark_classifier.py auto-discovers these adapters at startup via training_info.json files. Two GET endpoints in api.py expose status and streaming run. BenchmarkView.vue adds a badge row and collapsible fine-tune section.
Tech Stack: transformers 4.57.3, torch 2.10.0, accelerate 1.12.0, scikit-learn (new), FastAPI SSE, Vue 3 + EventSource
File Structure
| File | Action | Responsibility |
|---|---|---|
environment.yml |
Modify | Add scikit-learn dependency |
scripts/classifier_adapters.py |
Modify | Add FineTunedAdapter class |
scripts/benchmark_classifier.py |
Modify | Add _MODELS_DIR, discover_finetuned_models(), merge into model registry at startup |
scripts/finetune_classifier.py |
Create | Full training pipeline: data prep, class weights, WeightedTrainer, CLI |
app/api.py |
Modify | Add GET /api/finetune/status and GET /api/finetune/run |
web/src/views/BenchmarkView.vue |
Modify | Add trained models badge row + collapsible fine-tune section |
tests/test_classifier_adapters.py |
Modify | Add FineTunedAdapter unit tests |
tests/test_benchmark_classifier.py |
Modify | Add auto-discovery unit tests |
tests/test_finetune.py |
Create | Unit tests for data pipeline, WeightedTrainer, compute_metrics_for_trainer |
tests/test_api.py |
Modify | Add tests for /api/finetune/status and /api/finetune/run |
Chunk 1: Foundation — FineTunedAdapter + Auto-discovery
Task 1: Add scikit-learn to environment.yml
Files:
-
Modify:
environment.yml -
Step 1: Add scikit-learn
Edit environment.yml — add scikit-learn>=1.4 in the pip section after accelerate:
- scikit-learn>=1.4
- Step 2: Verify environment.yml is valid YAML
python -c "import yaml; yaml.safe_load(open('environment.yml'))" && echo OK
Expected: OK
- Step 3: Commit
git add environment.yml
git commit -m "chore(avocet): add scikit-learn to classifier env"
Task 2: FineTunedAdapter — write failing tests
Files:
-
Modify:
tests/test_classifier_adapters.py -
Step 1: Write the failing tests
Append to tests/test_classifier_adapters.py:
# ---- FineTunedAdapter tests ----
def test_finetuned_adapter_classify_calls_pipeline_with_sep_format(tmp_path):
"""classify() must format input as 'subject [SEP] body[:400]' — not the zero-shot format."""
from unittest.mock import MagicMock, patch
from scripts.classifier_adapters import FineTunedAdapter
mock_result = [{"label": "digest", "score": 0.95}]
mock_pipe_instance = MagicMock(return_value=mock_result)
mock_pipe_factory = MagicMock(return_value=mock_pipe_instance)
adapter = FineTunedAdapter("avocet-deberta-small", str(tmp_path))
with patch("scripts.classifier_adapters.pipeline", mock_pipe_factory):
result = adapter.classify("Test subject", "Test body")
assert result == "digest"
call_args = mock_pipe_instance.call_args[0][0]
assert "[SEP]" in call_args
assert "Test subject" in call_args
assert "Test body" in call_args
def test_finetuned_adapter_truncates_body_to_400():
"""Body must be truncated to 400 chars in the [SEP] format."""
from unittest.mock import MagicMock, patch
from scripts.classifier_adapters import FineTunedAdapter, LABELS
long_body = "x" * 800
mock_result = [{"label": "neutral", "score": 0.9}]
mock_pipe_instance = MagicMock(return_value=mock_result)
mock_pipe_factory = MagicMock(return_value=mock_pipe_instance)
adapter = FineTunedAdapter("avocet-deberta-small", "/fake/path")
with patch("scripts.classifier_adapters.pipeline", mock_pipe_factory):
adapter.classify("Subject", long_body)
call_text = mock_pipe_instance.call_args[0][0]
# "Subject [SEP] " prefix + 400 body chars = 414 chars max
assert len(call_text) <= 420
def test_finetuned_adapter_returns_label_string():
"""classify() must return a plain string, not a dict."""
from unittest.mock import MagicMock, patch
from scripts.classifier_adapters import FineTunedAdapter
mock_result = [{"label": "interview_scheduled", "score": 0.87}]
mock_pipe_instance = MagicMock(return_value=mock_result)
mock_pipe_factory = MagicMock(return_value=mock_pipe_instance)
adapter = FineTunedAdapter("avocet-deberta-small", "/fake/path")
with patch("scripts.classifier_adapters.pipeline", mock_pipe_factory):
result = adapter.classify("S", "B")
assert isinstance(result, str)
assert result == "interview_scheduled"
def test_finetuned_adapter_lazy_loads_pipeline():
"""Pipeline factory must not be called until classify() is first called."""
from unittest.mock import MagicMock, patch
from scripts.classifier_adapters import FineTunedAdapter
mock_pipe_factory = MagicMock(return_value=MagicMock(return_value=[{"label": "neutral", "score": 0.9}]))
with patch("scripts.classifier_adapters.pipeline", mock_pipe_factory):
adapter = FineTunedAdapter("avocet-deberta-small", "/fake/path")
assert not mock_pipe_factory.called
adapter.classify("s", "b")
assert mock_pipe_factory.called
def test_finetuned_adapter_unload_clears_pipeline():
"""unload() must set _pipeline to None so memory is released."""
from unittest.mock import MagicMock, patch
from scripts.classifier_adapters import FineTunedAdapter
mock_pipe_factory = MagicMock(return_value=MagicMock(return_value=[{"label": "neutral", "score": 0.9}]))
with patch("scripts.classifier_adapters.pipeline", mock_pipe_factory):
adapter = FineTunedAdapter("avocet-deberta-small", "/fake/path")
adapter.classify("s", "b")
assert adapter._pipeline is not None
adapter.unload()
assert adapter._pipeline is None
- Step 2: Run tests to verify they fail
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py -k "finetuned" -v
Expected: ImportError or AttributeError — FineTunedAdapter not yet defined.
Task 3: FineTunedAdapter — implement
Files:
-
Modify:
scripts/classifier_adapters.py -
Step 1: Add FineTunedAdapter to
__all__
In scripts/classifier_adapters.py, add "FineTunedAdapter" to __all__.
- Step 2: Implement FineTunedAdapter
Append after RerankerAdapter:
class FineTunedAdapter(ClassifierAdapter):
"""Loads a fine-tuned checkpoint from a local models/ directory.
Uses pipeline("text-classification") for a single forward pass.
Input format: 'subject [SEP] body[:400]' — must match training format exactly.
Expected inference speed: ~10–20ms/email vs 111–338ms for zero-shot.
"""
def __init__(self, name: str, model_dir: str) -> None:
self._name = name
self._model_dir = model_dir
self._pipeline: Any = None
@property
def name(self) -> str:
return self._name
@property
def model_id(self) -> str:
return self._model_dir
def load(self) -> None:
import scripts.classifier_adapters as _mod # noqa: PLC0415
_pipe_fn = _mod.pipeline
if _pipe_fn is None:
raise ImportError("transformers not installed")
self._pipeline = _pipe_fn("text-classification", model=self._model_dir)
def unload(self) -> None:
self._pipeline = None
def classify(self, subject: str, body: str) -> str:
if self._pipeline is None:
self.load()
text = f"{subject} [SEP] {body[:400]}"
result = self._pipeline(text)
return result[0]["label"]
- Step 3: Run tests to verify they pass
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py -k "finetuned" -v
Expected: 5 tests PASS.
- Step 4: Run full adapter test suite to verify no regressions
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py -v
Expected: All tests PASS.
- Step 5: Commit
git add scripts/classifier_adapters.py tests/test_classifier_adapters.py
git commit -m "feat(avocet): add FineTunedAdapter for local checkpoint inference"
Task 4: Auto-discovery in benchmark_classifier.py — write failing tests
Files:
-
Modify:
tests/test_benchmark_classifier.py -
Step 1: Write the failing tests
Append to tests/test_benchmark_classifier.py:
# ---- Auto-discovery tests ----
def test_discover_finetuned_models_finds_training_info_files(tmp_path):
"""discover_finetuned_models() must return one entry per training_info.json found."""
import json
from scripts.benchmark_classifier import discover_finetuned_models
# Create two fake model directories
for name in ("avocet-deberta-small", "avocet-bge-m3"):
model_dir = tmp_path / name
model_dir.mkdir()
info = {
"name": name,
"base_model_id": "cross-encoder/nli-deberta-v3-small",
"timestamp": "2026-03-15T12:00:00Z",
"val_macro_f1": 0.72,
"val_accuracy": 0.80,
"sample_count": 401,
}
(model_dir / "training_info.json").write_text(json.dumps(info))
results = discover_finetuned_models(tmp_path)
assert len(results) == 2
names = {r["name"] for r in results}
assert "avocet-deberta-small" in names
assert "avocet-bge-m3" in names
def test_discover_finetuned_models_returns_empty_when_no_models_dir():
"""discover_finetuned_models() must return [] silently if models/ doesn't exist."""
from pathlib import Path
from scripts.benchmark_classifier import discover_finetuned_models
results = discover_finetuned_models(Path("/nonexistent/path/models"))
assert results == []
def test_discover_finetuned_models_skips_dirs_without_training_info(tmp_path):
"""Subdirs without training_info.json are silently skipped."""
from scripts.benchmark_classifier import discover_finetuned_models
# A dir WITHOUT training_info.json
(tmp_path / "some-other-dir").mkdir()
results = discover_finetuned_models(tmp_path)
assert results == []
def test_active_models_includes_discovered_finetuned(tmp_path):
"""The active models dict must include FineTunedAdapter entries for discovered models."""
import json
from unittest.mock import patch
from scripts.benchmark_classifier import _active_models
from scripts.classifier_adapters import FineTunedAdapter
model_dir = tmp_path / "avocet-deberta-small"
model_dir.mkdir()
(model_dir / "training_info.json").write_text(json.dumps({
"name": "avocet-deberta-small",
"base_model_id": "cross-encoder/nli-deberta-v3-small",
"val_macro_f1": 0.72,
"sample_count": 401,
}))
with patch("scripts.benchmark_classifier._MODELS_DIR", tmp_path):
models = _active_models(include_slow=False)
assert "avocet-deberta-small" in models
assert isinstance(models["avocet-deberta-small"]["adapter_instance"], FineTunedAdapter)
- Step 2: Run tests to verify they fail
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_benchmark_classifier.py -k "discover or active_models" -v
Expected: ImportError — discover_finetuned_models and _MODELS_DIR not yet defined.
Task 5: Auto-discovery — implement in benchmark_classifier.py
Files:
-
Modify:
scripts/benchmark_classifier.py -
Step 1: Add imports and _MODELS_DIR
Near the top of scripts/benchmark_classifier.py, after the existing imports, add:
from scripts.classifier_adapters import FineTunedAdapter
And define _MODELS_DIR (after _ROOT is defined — find where _ROOT = Path(__file__).parent.parent is, or add it):
_ROOT = Path(__file__).parent.parent
_MODELS_DIR = _ROOT / "models"
(If _ROOT already exists in the file, only add _MODELS_DIR.)
- Step 2: Add discover_finetuned_models()
Add after the MODEL_REGISTRY dict:
def discover_finetuned_models(models_dir: Path | None = None) -> list[dict]:
"""Scan models/ for subdirs containing training_info.json.
Returns a list of training_info dicts, each with an added 'model_dir' key.
Returns [] silently if models_dir does not exist.
"""
if models_dir is None:
models_dir = _MODELS_DIR
if not models_dir.exists():
return []
found = []
for sub in models_dir.iterdir():
if not sub.is_dir():
continue
info_path = sub / "training_info.json"
if not info_path.exists():
continue
info = json.loads(info_path.read_text(encoding="utf-8"))
info["model_dir"] = str(sub)
found.append(info)
return found
- Step 3: Add _active_models() function
Add after discover_finetuned_models():
def _active_models(include_slow: bool = False) -> dict[str, dict]:
"""Return the active model registry, merged with any discovered fine-tuned models."""
active = {
key: {**entry, "adapter_instance": entry["adapter"](
key,
entry["model_id"],
**entry.get("kwargs", {}),
)}
for key, entry in MODEL_REGISTRY.items()
if include_slow or entry.get("default", False)
}
for info in discover_finetuned_models():
name = info["name"]
active[name] = {
"adapter_instance": FineTunedAdapter(name, info["model_dir"]),
"params": "fine-tuned",
"default": True,
}
return active
- Step 4: Run tests to verify they pass
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_benchmark_classifier.py -k "discover or active_models" -v
Expected: 4 tests PASS.
- Step 5: Run full benchmark test suite
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_benchmark_classifier.py -v
Expected: All tests PASS. (Existing tests that construct adapters directly from MODEL_REGISTRY still work because we only added new functions.)
- Step 6: Commit
git add scripts/benchmark_classifier.py tests/test_benchmark_classifier.py
git commit -m "feat(avocet): auto-discover fine-tuned models in benchmark harness"
Chunk 2: Training Script — finetune_classifier.py
Task 6: Data loading and class weights — write failing tests
Files:
-
Create:
tests/test_finetune.py -
Step 1: Create test file with data pipeline tests
Create tests/test_finetune.py:
"""Tests for finetune_classifier — no model downloads required."""
from __future__ import annotations
import json
import pytest
# ---- Data loading tests ----
def test_load_and_prepare_data_drops_non_canonical_labels(tmp_path):
"""Rows with labels not in LABELS must be silently dropped."""
from scripts.finetune_classifier import load_and_prepare_data
from scripts.classifier_adapters import LABELS
rows = [
{"subject": "s1", "body": "b1", "label": "digest"},
{"subject": "s2", "body": "b2", "label": "profile_alert"}, # non-canonical
{"subject": "s3", "body": "b3", "label": "neutral"},
]
score_file = tmp_path / "email_score.jsonl"
score_file.write_text("\n".join(json.dumps(r) for r in rows))
texts, labels = load_and_prepare_data(score_file)
assert len(texts) == 2
assert all(l in LABELS for l in labels)
def test_load_and_prepare_data_formats_input_as_sep():
"""Input text must be 'subject [SEP] body[:400]'."""
import json
from pathlib import Path
from scripts.finetune_classifier import load_and_prepare_data
import tempfile, os
with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False) as f:
f.write(json.dumps({"subject": "Hello", "body": "World" * 100, "label": "neutral"}) + "\n")
fname = f.name
try:
texts, labels = load_and_prepare_data(Path(fname))
finally:
os.unlink(fname)
assert texts[0].startswith("Hello [SEP] ")
assert len(texts[0]) <= len("Hello [SEP] ") + 400 + 5 # small buffer for truncation
def test_load_and_prepare_data_raises_on_missing_file():
"""FileNotFoundError must be raised with actionable message."""
from pathlib import Path
from scripts.finetune_classifier import load_and_prepare_data
with pytest.raises(FileNotFoundError, match="email_score.jsonl"):
load_and_prepare_data(Path("/nonexistent/email_score.jsonl"))
def test_load_and_prepare_data_drops_class_with_fewer_than_2_samples(tmp_path, capsys):
"""Classes with < 2 total samples must be dropped with a warning."""
from scripts.finetune_classifier import load_and_prepare_data
rows = [
{"subject": "s1", "body": "b", "label": "digest"},
{"subject": "s2", "body": "b", "label": "digest"},
{"subject": "s3", "body": "b", "label": "new_lead"}, # only 1 sample — drop
]
score_file = tmp_path / "email_score.jsonl"
score_file.write_text("\n".join(json.dumps(r) for r in rows))
texts, labels = load_and_prepare_data(score_file)
captured = capsys.readouterr()
assert "new_lead" not in labels
assert "new_lead" in captured.out # warning printed
# ---- Class weights tests ----
def test_compute_class_weights_returns_tensor_for_each_class():
"""compute_class_weights must return a float tensor of length n_classes."""
import torch
from scripts.finetune_classifier import compute_class_weights
label_ids = [0, 0, 0, 1, 1, 2] # 3 classes, imbalanced
weights = compute_class_weights(label_ids, n_classes=3)
assert isinstance(weights, torch.Tensor)
assert weights.shape == (3,)
assert all(w > 0 for w in weights)
def test_compute_class_weights_upweights_minority():
"""Minority classes must receive higher weight than majority classes."""
from scripts.finetune_classifier import compute_class_weights
# Class 0: 10 samples, Class 1: 2 samples
label_ids = [0] * 10 + [1] * 2
weights = compute_class_weights(label_ids, n_classes=2)
assert weights[1] > weights[0]
# ---- compute_metrics_for_trainer tests ----
def test_compute_metrics_for_trainer_returns_macro_f1_key():
"""Must return a dict with 'macro_f1' key."""
import numpy as np
from scripts.finetune_classifier import compute_metrics_for_trainer
from transformers import EvalPrediction
logits = np.array([[2.0, 0.1], [0.1, 2.0], [2.0, 0.1]])
labels = np.array([0, 1, 0])
pred = EvalPrediction(predictions=logits, label_ids=labels)
result = compute_metrics_for_trainer(pred)
assert "macro_f1" in result
assert result["macro_f1"] == pytest.approx(1.0)
def test_compute_metrics_for_trainer_returns_accuracy_key():
"""Must also return 'accuracy' key."""
import numpy as np
from scripts.finetune_classifier import compute_metrics_for_trainer
from transformers import EvalPrediction
logits = np.array([[2.0, 0.1], [0.1, 2.0]])
labels = np.array([0, 1])
pred = EvalPrediction(predictions=logits, label_ids=labels)
result = compute_metrics_for_trainer(pred)
assert "accuracy" in result
assert result["accuracy"] == pytest.approx(1.0)
- Step 2: Run tests to verify they fail
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_finetune.py -k "load_and_prepare or class_weights or compute_metrics_for_trainer" -v
Expected: ModuleNotFoundError — scripts.finetune_classifier not yet created.
Task 7: Implement data loading and class weights in finetune_classifier.py
Files:
-
Create:
scripts/finetune_classifier.py -
Step 1: Create finetune_classifier.py with data loading + class weights
Create scripts/finetune_classifier.py:
"""Fine-tune email classifiers on the labeled dataset.
CLI entry point. All prints use flush=True so stdout is SSE-streamable.
Usage:
python scripts/finetune_classifier.py --model deberta-small [--epochs 5]
Supported --model values: deberta-small, bge-m3
"""
from __future__ import annotations
import argparse
import json
import sys
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import torch
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
EvalPrediction,
Trainer,
TrainingArguments,
EarlyStoppingCallback,
)
sys.path.insert(0, str(Path(__file__).parent.parent))
from scripts.classifier_adapters import LABELS
_ROOT = Path(__file__).parent.parent
# ---------------------------------------------------------------------------
# Model registry
# ---------------------------------------------------------------------------
_MODEL_CONFIG: dict[str, dict[str, Any]] = {
"deberta-small": {
"base_model_id": "cross-encoder/nli-deberta-v3-small",
"max_tokens": 512,
"fp16": False,
"batch_size": 16,
"grad_accum": 1,
"gradient_checkpointing": False,
},
"bge-m3": {
"base_model_id": "MoritzLaurer/bge-m3-zeroshot-v2.0",
"max_tokens": 512,
"fp16": True,
"batch_size": 4,
"grad_accum": 4,
"gradient_checkpointing": True,
},
}
# ---------------------------------------------------------------------------
# Data preparation
# ---------------------------------------------------------------------------
def load_and_prepare_data(score_file: Path) -> tuple[list[str], list[str]]:
"""Load email_score.jsonl and return (texts, labels) ready for training.
- Drops rows with non-canonical labels (warns).
- Drops classes with < 2 total samples (warns).
- Warns (but continues) for classes with < 5 training samples.
- Input text format: 'subject [SEP] body[:400]'
"""
if not score_file.exists():
raise FileNotFoundError(
f"Score file not found: {score_file}\n"
"Run the label tool first to create email_score.jsonl"
)
lines = score_file.read_text(encoding="utf-8").splitlines()
rows = [json.loads(l) for l in lines if l.strip()]
# Drop non-canonical labels
canonical = set(LABELS)
kept = []
for r in rows:
lbl = r.get("label", "")
if lbl not in canonical:
print(f"[data] Dropping row with non-canonical label: {lbl!r}", flush=True)
continue
kept.append(r)
# Count samples per class
from collections import Counter
counts = Counter(r["label"] for r in kept)
# Drop classes with < 2 total samples
drop_classes = {lbl for lbl, cnt in counts.items() if cnt < 2}
for lbl in sorted(drop_classes):
print(
f"[data] WARNING: Dropping class {lbl!r} — only {counts[lbl]} total sample(s). "
"Need at least 2 for stratified split.",
flush=True,
)
kept = [r for r in kept if r["label"] not in drop_classes]
# Warn for classes with < 5 samples (after drops)
counts = Counter(r["label"] for r in kept)
for lbl, cnt in sorted(counts.items()):
if cnt < 5:
print(
f"[data] WARNING: Class {lbl!r} has only {cnt} sample(s). "
"Eval F1 for this class will be unreliable.",
flush=True,
)
texts = [f"{r['subject']} [SEP] {r['body'][:400]}" for r in kept]
labels = [r["label"] for r in kept]
return texts, labels
# ---------------------------------------------------------------------------
# Class weights
# ---------------------------------------------------------------------------
def compute_class_weights(label_ids: list[int], n_classes: int) -> torch.Tensor:
"""Compute per-class weights: total / (n_classes * class_count).
Returns a CPU float tensor of shape (n_classes,).
"""
from collections import Counter
counts = Counter(label_ids)
total = len(label_ids)
weights = []
for i in range(n_classes):
cnt = counts.get(i, 1) # avoid division by zero for unseen classes
weights.append(total / (n_classes * cnt))
return torch.tensor(weights, dtype=torch.float32)
# ---------------------------------------------------------------------------
# compute_metrics callback for Trainer
# ---------------------------------------------------------------------------
def compute_metrics_for_trainer(eval_pred: EvalPrediction) -> dict:
"""Trainer callback: EvalPrediction → {macro_f1, accuracy}.
Distinct from compute_metrics() in classifier_adapters.py (which operates
on string predictions). This one operates on numpy logits + label_ids.
"""
logits, labels = eval_pred
preds = logits.argmax(axis=-1)
return {
"macro_f1": f1_score(labels, preds, average="macro", zero_division=0),
"accuracy": accuracy_score(labels, preds),
}
- Step 2: Run data pipeline tests
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_finetune.py -k "load_and_prepare or class_weights or compute_metrics_for_trainer" -v
Expected: All 7 tests PASS. (Note: compute_metrics_for_trainer test requires transformers — run in job-seeker-classifiers env if needed.)
/devl/miniconda3/envs/job-seeker-classifiers/bin/pytest tests/test_finetune.py -k "load_and_prepare or class_weights or compute_metrics_for_trainer" -v
Expected: All 7 tests PASS.
- Step 3: Commit
git add scripts/finetune_classifier.py tests/test_finetune.py
git commit -m "feat(avocet): add finetune data pipeline + class weights + compute_metrics"
Task 8: WeightedTrainer — write failing tests
Files:
-
Modify:
tests/test_finetune.py -
Step 1: Append WeightedTrainer tests
Append to tests/test_finetune.py:
# ---- WeightedTrainer tests ----
def test_weighted_trainer_compute_loss_returns_scalar():
"""compute_loss must return a scalar tensor when return_outputs=False."""
import torch
from unittest.mock import MagicMock
from scripts.finetune_classifier import WeightedTrainer
# Minimal mock model that returns logits
n_classes = 3
batch = 4
logits = torch.randn(batch, n_classes)
mock_outputs = MagicMock()
mock_outputs.logits = logits
mock_model = MagicMock(return_value=mock_outputs)
# Build a trainer with class weights
weights = torch.ones(n_classes)
trainer = WeightedTrainer.__new__(WeightedTrainer)
trainer.class_weights = weights
inputs = {
"input_ids": torch.zeros(batch, 10, dtype=torch.long),
"labels": torch.randint(0, n_classes, (batch,)),
}
loss = trainer.compute_loss(mock_model, inputs, return_outputs=False)
assert isinstance(loss, torch.Tensor)
assert loss.ndim == 0 # scalar
def test_weighted_trainer_compute_loss_accepts_kwargs():
"""compute_loss must not raise TypeError when called with num_items_in_batch kwarg.
Transformers 4.38+ passes this extra kwarg — **kwargs absorbs it.
"""
import torch
from unittest.mock import MagicMock
from scripts.finetune_classifier import WeightedTrainer
n_classes = 3
batch = 2
logits = torch.randn(batch, n_classes)
mock_outputs = MagicMock()
mock_outputs.logits = logits
mock_model = MagicMock(return_value=mock_outputs)
trainer = WeightedTrainer.__new__(WeightedTrainer)
trainer.class_weights = torch.ones(n_classes)
inputs = {
"input_ids": torch.zeros(batch, 5, dtype=torch.long),
"labels": torch.randint(0, n_classes, (batch,)),
}
# Must not raise TypeError
loss = trainer.compute_loss(mock_model, inputs, return_outputs=False,
num_items_in_batch=batch)
assert isinstance(loss, torch.Tensor)
def test_weighted_trainer_weighted_loss_differs_from_unweighted():
"""Weighted loss must differ from uniform-weight loss for imbalanced inputs."""
import torch
from unittest.mock import MagicMock
from scripts.finetune_classifier import WeightedTrainer
n_classes = 2
batch = 4
# All labels are class 0 (majority class scenario)
labels = torch.zeros(batch, dtype=torch.long)
logits = torch.zeros(batch, n_classes) # neutral logits
mock_outputs = MagicMock()
mock_outputs.logits = logits
# Uniform weights
trainer_uniform = WeightedTrainer.__new__(WeightedTrainer)
trainer_uniform.class_weights = torch.ones(n_classes)
inputs_uniform = {"input_ids": torch.zeros(batch, 5, dtype=torch.long), "labels": labels.clone()}
loss_uniform = trainer_uniform.compute_loss(MagicMock(return_value=mock_outputs),
inputs_uniform)
# Heavily imbalanced weights: class 1 much more important
trainer_weighted = WeightedTrainer.__new__(WeightedTrainer)
trainer_weighted.class_weights = torch.tensor([0.1, 10.0])
inputs_weighted = {"input_ids": torch.zeros(batch, 5, dtype=torch.long), "labels": labels.clone()}
mock_outputs2 = MagicMock()
mock_outputs2.logits = logits.clone()
loss_weighted = trainer_weighted.compute_loss(MagicMock(return_value=mock_outputs2),
inputs_weighted)
assert not torch.isclose(loss_uniform, loss_weighted)
def test_weighted_trainer_compute_loss_returns_outputs_when_requested():
"""compute_loss with return_outputs=True must return (loss, outputs) tuple."""
import torch
from unittest.mock import MagicMock
from scripts.finetune_classifier import WeightedTrainer
n_classes = 3
batch = 2
logits = torch.randn(batch, n_classes)
mock_outputs = MagicMock()
mock_outputs.logits = logits
mock_model = MagicMock(return_value=mock_outputs)
trainer = WeightedTrainer.__new__(WeightedTrainer)
trainer.class_weights = torch.ones(n_classes)
inputs = {
"input_ids": torch.zeros(batch, 5, dtype=torch.long),
"labels": torch.randint(0, n_classes, (batch,)),
}
result = trainer.compute_loss(mock_model, inputs, return_outputs=True)
assert isinstance(result, tuple)
loss, outputs = result
assert isinstance(loss, torch.Tensor)
- Step 2: Run tests to verify they fail
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_finetune.py -k "weighted_trainer" -v
Expected: ImportError — WeightedTrainer not yet defined.
Task 9: Implement WeightedTrainer
Files:
-
Modify:
scripts/finetune_classifier.py -
Step 1: Add WeightedTrainer class
Append to scripts/finetune_classifier.py after compute_metrics_for_trainer:
# ---------------------------------------------------------------------------
# Weighted Trainer
# ---------------------------------------------------------------------------
class WeightedTrainer(Trainer):
"""Trainer subclass that applies per-class weights to cross-entropy loss.
Handles class imbalance by down-weighting majority classes and up-weighting
minority classes. Attach class_weights (CPU float tensor) before training.
"""
def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
# **kwargs is required — absorbs num_items_in_batch added in Transformers 4.38.
# Do not remove it; removing it causes TypeError on the first training step.
labels = inputs.pop("labels")
outputs = model(**inputs)
# Move class_weights to the same device as logits — required for GPU training.
# class_weights is created on CPU; logits are on cuda:0 during training.
weight = self.class_weights.to(outputs.logits.device)
loss = F.cross_entropy(outputs.logits, labels, weight=weight)
return (loss, outputs) if return_outputs else loss
- Step 2: Run WeightedTrainer tests
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_finetune.py -k "weighted_trainer" -v
Expected: 4 tests PASS.
- Step 3: Run full test_finetune.py
/devl/miniconda3/envs/job-seeker-classifiers/bin/pytest tests/test_finetune.py -v
Expected: All tests PASS.
- Step 4: Commit
git add scripts/finetune_classifier.py tests/test_finetune.py
git commit -m "feat(avocet): add WeightedTrainer with device-aware class weights"
Task 10: Implement run_finetune() and CLI
Files:
-
Modify:
scripts/finetune_classifier.py -
Step 1: Add run_finetune() and CLI to finetune_classifier.py
Append to scripts/finetune_classifier.py:
# ---------------------------------------------------------------------------
# Training dataset wrapper
# ---------------------------------------------------------------------------
from torch.utils.data import Dataset as TorchDataset
class _EmailDataset(TorchDataset):
def __init__(self, encodings: dict, label_ids: list[int]) -> None:
self.encodings = encodings
self.label_ids = label_ids
def __len__(self) -> int:
return len(self.label_ids)
def __getitem__(self, idx: int) -> dict:
item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
item["labels"] = torch.tensor(self.label_ids[idx], dtype=torch.long)
return item
# ---------------------------------------------------------------------------
# Main training function
# ---------------------------------------------------------------------------
def run_finetune(model_key: str, epochs: int = 5) -> None:
"""Fine-tune the specified model on data/email_score.jsonl.
Saves model + tokenizer + training_info.json to models/avocet-{model_key}/.
All prints use flush=True for SSE streaming.
"""
if model_key not in _MODEL_CONFIG:
raise ValueError(f"Unknown model key: {model_key!r}. Choose from: {list(_MODEL_CONFIG)}")
config = _MODEL_CONFIG[model_key]
base_model_id = config["base_model_id"]
output_dir = _ROOT / "models" / f"avocet-{model_key}"
print(f"[finetune] Model: {model_key} ({base_model_id})", flush=True)
print(f"[finetune] Output: {output_dir}", flush=True)
if output_dir.exists():
print(f"[finetune] WARNING: {output_dir} already exists — will overwrite.", flush=True)
# --- Data ---
score_file = _ROOT / "data" / "email_score.jsonl"
print(f"[finetune] Loading data from {score_file} ...", flush=True)
texts, str_labels = load_and_prepare_data(score_file)
present_labels = sorted(set(str_labels))
label2id = {l: i for i, l in enumerate(present_labels)}
id2label = {i: l for l, i in label2id.items()}
n_classes = len(present_labels)
label_ids = [label2id[l] for l in str_labels]
print(f"[finetune] {len(texts)} samples, {n_classes} classes", flush=True)
# Stratified 80/20 split
(train_texts, val_texts,
train_label_ids, val_label_ids) = train_test_split(
texts, label_ids,
test_size=0.2,
stratify=label_ids,
random_state=42,
)
print(f"[finetune] Train: {len(train_texts)}, Val: {len(val_texts)}", flush=True)
# Warn for classes with < 5 training samples
from collections import Counter
train_counts = Counter(train_label_ids)
for cls_id, cnt in train_counts.items():
if cnt < 5:
print(
f"[finetune] WARNING: Class {id2label[cls_id]!r} has {cnt} training sample(s). "
"Eval F1 for this class will be unreliable.",
flush=True,
)
# --- Tokenize ---
print(f"[finetune] Loading tokenizer ...", flush=True)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
train_enc = tokenizer(train_texts, truncation=True,
max_length=config["max_tokens"], padding=True)
val_enc = tokenizer(val_texts, truncation=True,
max_length=config["max_tokens"], padding=True)
train_dataset = _EmailDataset(train_enc, train_label_ids)
val_dataset = _EmailDataset(val_enc, val_label_ids)
# --- Class weights ---
class_weights = compute_class_weights(train_label_ids, n_classes)
print(f"[finetune] Class weights: {dict(zip(present_labels, class_weights.tolist()))}", flush=True)
# --- Model ---
print(f"[finetune] Loading model ...", flush=True)
model = AutoModelForSequenceClassification.from_pretrained(
base_model_id,
num_labels=n_classes,
ignore_mismatched_sizes=True, # NLI head (3-class) → new head (n_classes)
id2label=id2label,
label2id=label2id,
)
if config["gradient_checkpointing"]:
model.gradient_checkpointing_enable()
# --- TrainingArguments ---
training_args = TrainingArguments(
output_dir=str(output_dir),
num_train_epochs=epochs,
per_device_train_batch_size=config["batch_size"],
per_device_eval_batch_size=config["batch_size"],
gradient_accumulation_steps=config["grad_accum"],
learning_rate=2e-5,
lr_scheduler_type="linear",
warmup_ratio=0.1,
fp16=config["fp16"],
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="macro_f1",
greater_is_better=True,
logging_steps=10,
report_to="none",
save_total_limit=2,
)
trainer = WeightedTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics_for_trainer,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
trainer.class_weights = class_weights
# --- Train ---
print(f"[finetune] Starting training ({epochs} epochs) ...", flush=True)
train_result = trainer.train()
print(f"[finetune] Training complete. Steps: {train_result.global_step}", flush=True)
# --- Evaluate ---
print(f"[finetune] Evaluating best checkpoint ...", flush=True)
metrics = trainer.evaluate()
val_macro_f1 = metrics.get("eval_macro_f1", 0.0)
val_accuracy = metrics.get("eval_accuracy", 0.0)
print(f"[finetune] Val macro-F1: {val_macro_f1:.4f}, Accuracy: {val_accuracy:.4f}", flush=True)
# --- Save model + tokenizer ---
print(f"[finetune] Saving model to {output_dir} ...", flush=True)
trainer.save_model(str(output_dir))
tokenizer.save_pretrained(str(output_dir))
# --- Write training_info.json ---
from collections import Counter
label_counts = dict(Counter(str_labels))
info = {
"name": f"avocet-{model_key}",
"base_model_id": base_model_id,
"timestamp": datetime.now(timezone.utc).isoformat(),
"epochs_run": epochs,
"val_macro_f1": round(val_macro_f1, 4),
"val_accuracy": round(val_accuracy, 4),
"sample_count": len(train_texts),
"label_counts": label_counts,
}
info_path = output_dir / "training_info.json"
info_path.write_text(json.dumps(info, indent=2), encoding="utf-8")
print(f"[finetune] Saved training_info.json: val_macro_f1={val_macro_f1:.4f}", flush=True)
print(f"[finetune] Done.", flush=True)
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Fine-tune an email classifier")
parser.add_argument(
"--model",
choices=list(_MODEL_CONFIG),
required=True,
help="Model key to fine-tune",
)
parser.add_argument(
"--epochs",
type=int,
default=5,
help="Number of training epochs (default: 5)",
)
args = parser.parse_args()
run_finetune(args.model, args.epochs)
- Step 2: Run all finetune tests
/devl/miniconda3/envs/job-seeker-classifiers/bin/pytest tests/test_finetune.py -v
Expected: All tests PASS (run_finetune itself is tested in the integration test — Task 11).
- Step 3: Commit
git add scripts/finetune_classifier.py
git commit -m "feat(avocet): add run_finetune() training loop and CLI"
Task 11: Integration test — finetune on example data
Files:
- Modify:
tests/test_finetune.py
The example file data/email_score.jsonl.example has 8 samples with 5 of 10 labels represented. The 5 missing labels trigger the < 2 total samples drop path.
- Step 1: Append integration test
Append to tests/test_finetune.py:
# ---- Integration test ----
def test_integration_finetune_on_example_data(tmp_path):
"""Fine-tune deberta-small on example data for 1 epoch.
Uses data/email_score.jsonl.example (8 samples, 5 labels represented).
The 5 missing labels must trigger the < 2 samples drop warning.
Verifies training_info.json is written with correct keys.
NOTE: This test requires the job-seeker-classifiers conda env and downloads
the deberta-small model on first run (~100MB). Skip in CI if model not cached.
Mark with @pytest.mark.slow to exclude from default runs.
"""
import shutil
from scripts.finetune_classifier import run_finetune, _ROOT
from scripts import finetune_classifier as ft_mod
example_file = _ROOT / "data" / "email_score.jsonl.example"
if not example_file.exists():
pytest.skip("email_score.jsonl.example not found")
# Patch _ROOT to use tmp_path so model saves there, not production models/
orig_root = ft_mod._ROOT
ft_mod._ROOT = tmp_path
# Also copy the example file to tmp_path/data/
(tmp_path / "data").mkdir()
shutil.copy(example_file, tmp_path / "data" / "email_score.jsonl")
try:
import io
from contextlib import redirect_stdout
captured = io.StringIO()
with redirect_stdout(captured):
run_finetune("deberta-small", epochs=1)
output = captured.getvalue()
finally:
ft_mod._ROOT = orig_root
# 5 missing labels should each trigger a drop warning
from scripts.classifier_adapters import LABELS
assert "< 2 total samples" in output or "WARNING: Dropping class" in output
# training_info.json must exist with correct keys
info_path = tmp_path / "models" / "avocet-deberta-small" / "training_info.json"
assert info_path.exists(), "training_info.json not written"
import json
info = json.loads(info_path.read_text())
for key in ("name", "base_model_id", "timestamp", "epochs_run",
"val_macro_f1", "val_accuracy", "sample_count", "label_counts"):
assert key in info, f"Missing key: {key}"
assert info["name"] == "avocet-deberta-small"
assert info["epochs_run"] == 1
- Step 2: Run unit tests only (fast path, no model download)
/devl/miniconda3/envs/job-seeker-classifiers/bin/pytest tests/test_finetune.py -v -k "not integration"
Expected: All non-integration tests PASS.
- Step 3: Run integration test (requires model download ~100MB)
/devl/miniconda3/envs/job-seeker-classifiers/bin/pytest tests/test_finetune.py::test_integration_finetune_on_example_data -v -s
Expected: PASS. Check output for drop warnings for missing labels.
- Step 4: Commit
git add tests/test_finetune.py
git commit -m "test(avocet): add integration test for finetune_classifier on example data"
Chunk 3: API Endpoints + BenchmarkView UI
Task 12: API endpoints — write failing tests
Files:
-
Modify:
tests/test_api.py -
Step 1: Append finetune endpoint tests
Append to tests/test_api.py:
# ---- /api/finetune/status tests ----
def test_finetune_status_returns_empty_when_no_models_dir(client):
"""GET /api/finetune/status must return [] if models/ does not exist."""
r = client.get("/api/finetune/status")
assert r.status_code == 200
assert r.json() == []
def test_finetune_status_returns_training_info(client, tmp_path):
"""GET /api/finetune/status must return one entry per training_info.json found."""
import json
from app import api as api_module
# Create a fake models dir under tmp_path (data dir)
models_dir = api_module._DATA_DIR.parent / "models"
model_dir = models_dir / "avocet-deberta-small"
model_dir.mkdir(parents=True)
info = {
"name": "avocet-deberta-small",
"base_model_id": "cross-encoder/nli-deberta-v3-small",
"val_macro_f1": 0.712,
"timestamp": "2026-03-15T12:00:00Z",
"sample_count": 401,
}
(model_dir / "training_info.json").write_text(json.dumps(info))
r = client.get("/api/finetune/status")
assert r.status_code == 200
data = r.json()
assert len(data) == 1
assert data[0]["name"] == "avocet-deberta-small"
assert data[0]["val_macro_f1"] == pytest.approx(0.712)
def test_finetune_run_streams_sse_events(client):
"""GET /api/finetune/run must return text/event-stream content type."""
import subprocess
from unittest.mock import patch, MagicMock
mock_proc = MagicMock()
mock_proc.stdout = iter(["Training epoch 1\n", "Done\n"])
mock_proc.returncode = 0
mock_proc.wait = MagicMock()
with patch("subprocess.Popen", return_value=mock_proc):
r = client.get("/api/finetune/run?model=deberta-small&epochs=1")
assert r.status_code == 200
assert "text/event-stream" in r.headers.get("content-type", "")
def test_finetune_run_emits_complete_on_success(client):
"""GET /api/finetune/run must emit a complete event on clean exit."""
import subprocess
from unittest.mock import patch, MagicMock
mock_proc = MagicMock()
mock_proc.stdout = iter(["progress line\n"])
mock_proc.returncode = 0
mock_proc.wait = MagicMock()
with patch("subprocess.Popen", return_value=mock_proc):
r = client.get("/api/finetune/run?model=deberta-small&epochs=1")
assert '{"type": "complete"}' in r.text
def test_finetune_run_emits_error_on_nonzero_exit(client):
"""GET /api/finetune/run must emit an error event on non-zero exit."""
import subprocess
from unittest.mock import patch, MagicMock
mock_proc = MagicMock()
mock_proc.stdout = iter([])
mock_proc.returncode = 1
mock_proc.wait = MagicMock()
with patch("subprocess.Popen", return_value=mock_proc):
r = client.get("/api/finetune/run?model=deberta-small&epochs=1")
assert '"type": "error"' in r.text
- Step 2: Run tests to verify they fail
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_api.py -k "finetune" -v
Expected: 404 or connection errors — endpoints not yet defined.
Task 13: Implement finetune API endpoints
Files:
-
Modify:
app/api.py -
Step 1: Add finetune endpoints to api.py
In app/api.py, add after the benchmark endpoints section (after the run_benchmark function, before the fetch_stream function):
# ---------------------------------------------------------------------------
# Fine-tune endpoints
# ---------------------------------------------------------------------------
@app.get("/api/finetune/status")
def get_finetune_status():
"""Scan models/ for training_info.json files. Returns [] if none exist."""
models_dir = _ROOT / "models"
if not models_dir.exists():
return []
results = []
for sub in models_dir.iterdir():
if not sub.is_dir():
continue
info_path = sub / "training_info.json"
if not info_path.exists():
continue
try:
info = json.loads(info_path.read_text(encoding="utf-8"))
results.append(info)
except Exception:
pass
return results
@app.get("/api/finetune/run")
def run_finetune(model: str = "deberta-small", epochs: int = 5):
"""Spawn finetune_classifier.py and stream stdout as SSE progress events."""
import subprocess
python_bin = "/devl/miniconda3/envs/job-seeker-classifiers/bin/python"
script = str(_ROOT / "scripts" / "finetune_classifier.py")
cmd = [python_bin, script, "--model", model, "--epochs", str(epochs)]
def generate():
try:
proc = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
bufsize=1,
cwd=str(_ROOT),
)
for line in proc.stdout:
line = line.rstrip()
if line:
yield f"data: {json.dumps({'type': 'progress', 'message': line})}\n\n"
proc.wait()
if proc.returncode == 0:
yield f"data: {json.dumps({'type': 'complete'})}\n\n"
else:
yield f"data: {json.dumps({'type': 'error', 'message': f'Process exited with code {proc.returncode}'})}\n\n"
except Exception as exc:
yield f"data: {json.dumps({'type': 'error', 'message': str(exc)})}\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
)
- Step 2: Run finetune API tests
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_api.py -k "finetune" -v
Expected: All 5 finetune tests PASS.
- Step 3: Run full API test suite
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_api.py -v
Expected: All tests PASS.
- Step 4: Commit
git add app/api.py tests/test_api.py
git commit -m "feat(avocet): add /api/finetune/status and /api/finetune/run endpoints"
Task 14: BenchmarkView.vue — trained models badge row + fine-tune section
Files:
- Modify:
web/src/views/BenchmarkView.vue
The BenchmarkView already has:
- Macro-F1 bar chart
- Latency bar chart
- Per-label F1 heatmap
- Benchmark run button with SSE log
Add:
- Trained models badge row at the top (conditional on
fineTunedModels.length > 0) - Fine-tune section (collapsible, at the bottom): model dropdown, epoch input, run button → SSE log, on
completeauto-trigger benchmark run
- Step 1: Read current BenchmarkView.vue
cat web/src/views/BenchmarkView.vue
(Use this to understand the existing structure before editing — identify where to insert each new section.)
- Step 2: Add fineTunedModels state and fetch logic
In the <script setup> section, add after the existing reactive state:
// Fine-tuned models
const fineTunedModels = ref<Array<{
name: string
base_model: string
val_macro_f1: number
timestamp: string
sample_count: number
}>>([])
const finetune = reactive({
model: 'deberta-small',
epochs: 5,
running: false,
log: [] as string[],
es: null as EventSource | null,
})
async function fetchFineTunedModels() {
try {
const r = await fetch('/api/finetune/status')
fineTunedModels.value = await r.json()
} catch { /* silent */ }
}
function runFinetune() {
if (finetune.running) return
finetune.running = true
finetune.log = []
finetune.es?.close()
const url = `/api/finetune/run?model=${finetune.model}&epochs=${finetune.epochs}`
finetune.es = new EventSource(url)
finetune.es.onmessage = (e) => {
const msg = JSON.parse(e.data)
if (msg.type === 'progress') {
finetune.log.push(msg.message)
} else if (msg.type === 'complete') {
finetune.running = false
finetune.es?.close()
fetchFineTunedModels()
runBenchmark() // auto-trigger benchmark to update charts
} else if (msg.type === 'error') {
finetune.running = false
finetune.es?.close()
finetune.log.push(`ERROR: ${msg.message}`)
}
}
}
Add fetchFineTunedModels() to the onMounted call alongside the existing fetchResults().
- Step 3: Add trained models badge row to template
In the <template>, add at the very top of the main content area (before the chart sections), conditional on fineTunedModels.length > 0:
<!-- Trained models badge row -->
<div v-if="fineTunedModels.length > 0" class="trained-models-row">
<span class="trained-label">Trained models:</span>
<span
v-for="m in fineTunedModels"
:key="m.name"
class="trained-badge"
:title="`Base: ${m.base_model} | ${m.sample_count} samples | ${m.timestamp}`"
>
{{ m.name }}
<span class="trained-f1">F1 {{ (m.val_macro_f1 * 100).toFixed(1) }}%</span>
</span>
</div>
- Step 4: Add fine-tune collapsible section to template
Add at the bottom of the main content area, after the benchmark log section:
<!-- Fine-tune section -->
<details class="finetune-section">
<summary class="finetune-summary">Fine-tune a model</summary>
<div class="finetune-controls">
<label class="ft-label">
Model
<select v-model="finetune.model" class="ft-select">
<option value="deberta-small">deberta-small (100M, fast)</option>
<option value="bge-m3">bge-m3 (600M, slow — stop Peregrine vLLM first)</option>
</select>
</label>
<label class="ft-label">
Epochs
<input
v-model.number="finetune.epochs"
type="number"
min="1"
max="20"
class="ft-epochs"
/>
</label>
<button
class="ft-run-btn"
:disabled="finetune.running"
@click="runFinetune"
>
{{ finetune.running ? 'Training…' : 'Run fine-tune' }}
</button>
</div>
<div v-if="finetune.log.length > 0" class="ft-log">
<div v-for="(line, i) in finetune.log" :key="i" class="ft-log-line">{{ line }}</div>
</div>
</details>
- Step 5: Add styles
Add to the <style scoped> section:
/* Trained models badge row */
.trained-models-row {
display: flex;
flex-wrap: wrap;
align-items: center;
gap: 0.5rem;
padding: 0.75rem 1rem;
background: var(--color-surface-raised, #e4ebf5);
border-radius: 0.5rem;
margin-bottom: 1rem;
}
.trained-label {
font-size: 0.8rem;
font-weight: 600;
color: var(--color-text-secondary, #6b7a99);
text-transform: uppercase;
letter-spacing: 0.04em;
}
.trained-badge {
display: inline-flex;
align-items: center;
gap: 0.4rem;
padding: 0.25rem 0.6rem;
background: var(--app-primary, #2A6080);
color: white;
border-radius: 1rem;
font-size: 0.82rem;
cursor: default;
}
.trained-f1 {
background: rgba(255,255,255,0.2);
border-radius: 0.75rem;
padding: 0.1rem 0.4rem;
font-size: 0.75rem;
font-weight: 700;
}
/* Fine-tune section */
.finetune-section {
border: 1px solid var(--color-border, #d0d7e8);
border-radius: 0.5rem;
padding: 0;
margin-top: 1.5rem;
}
.finetune-summary {
padding: 0.75rem 1rem;
cursor: pointer;
font-weight: 600;
color: var(--color-text, #1a2338);
list-style: none;
user-select: none;
}
.finetune-summary::-webkit-details-marker { display: none; }
.finetune-summary::before {
content: '▶ ';
font-size: 0.7rem;
color: var(--color-text-secondary, #6b7a99);
}
details[open] .finetune-summary::before { content: '▼ '; }
.finetune-controls {
display: flex;
flex-wrap: wrap;
gap: 1rem;
align-items: flex-end;
padding: 0.75rem 1rem 1rem;
border-top: 1px solid var(--color-border, #d0d7e8);
}
.ft-label {
display: flex;
flex-direction: column;
gap: 0.3rem;
font-size: 0.82rem;
font-weight: 600;
color: var(--color-text-secondary, #6b7a99);
}
.ft-select {
padding: 0.35rem 0.6rem;
border: 1px solid var(--color-border, #d0d7e8);
border-radius: 0.375rem;
background: var(--color-surface, #f0f4fb);
font-size: 0.9rem;
color: var(--color-text, #1a2338);
min-width: 260px;
}
.ft-epochs {
width: 70px;
padding: 0.35rem 0.5rem;
border: 1px solid var(--color-border, #d0d7e8);
border-radius: 0.375rem;
background: var(--color-surface, #f0f4fb);
font-size: 0.9rem;
color: var(--color-text, #1a2338);
text-align: center;
}
.ft-run-btn {
padding: 0.45rem 1.2rem;
background: var(--app-primary, #2A6080);
color: white;
border: none;
border-radius: 0.375rem;
font-size: 0.9rem;
font-weight: 600;
cursor: pointer;
transition: opacity 0.15s;
}
.ft-run-btn:disabled {
opacity: 0.55;
cursor: not-allowed;
}
.ft-log {
margin: 0 1rem 1rem;
padding: 0.5rem 0.75rem;
background: var(--color-surface, #f0f4fb);
border-radius: 0.375rem;
max-height: 260px;
overflow-y: auto;
font-family: var(--font-mono, monospace);
font-size: 0.78rem;
}
.ft-log-line {
line-height: 1.6;
color: var(--color-text, #1a2338);
white-space: pre-wrap;
word-break: break-all;
}
- Step 6: Build and verify
cd /Library/Development/CircuitForge/avocet/web && npm run build
Expected: Build succeeds with no errors.
- Step 7: Start dev server and verify in browser
./manage.sh start-api
Open http://localhost:8503 → navigate to Benchmark:
-
Badge row not visible (no trained models yet — correct)
-
Fine-tune section visible as collapsed
<details> -
Click to expand → dropdown and epoch input visible
-
Without a model trained, status returns
[](correct) -
Step 8: Commit
git add web/src/views/BenchmarkView.vue
git commit -m "feat(avocet): add fine-tune section and trained models badge row to BenchmarkView"
Task 15: Final verification — full test suite
- Step 1: Run full test suite
/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v -k "not integration"
Expected: All non-integration tests PASS.
- Step 2: Run in classifier env (catches transformers-specific tests)
/devl/miniconda3/envs/job-seeker-classifiers/bin/pytest tests/ -v -k "not integration"
Expected: All non-integration tests PASS.
- Step 3: Build Vue SPA
cd /Library/Development/CircuitForge/avocet/web && npm run build
Expected: No TypeScript or build errors.
- Step 4: Final commit
git add -A
git status # verify nothing unexpected staged
git commit -m "feat(avocet): finetune classifier feature complete"