peregrine/docs/plans/2026-02-26-email-classifier-benchmark-plan.md

41 KiB
Raw Blame History

Email Classifier Benchmark Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Build a benchmark harness that evaluates 9 HuggingFace classifiers against our 6 email labels in two modes: live IMAP visual table (--compare) and labeled-JSONL metrics (--score).

Architecture: Standalone scripts (benchmark_classifier.py + classifier_adapters.py) in a new isolated conda env (job-seeker-classifiers). Three adapter types (ZeroShot NLI, GLiClass, Reranker) normalize each model's output to our 6 labels. IMAP fetching uses stdlib only — no dependency on imap_sync.py or LLMRouter.

Tech Stack: Python 3.11, transformers zero-shot pipeline, gliclass, FlagEmbedding, torch (CUDA auto-select), pytest, unittest.mock


Labels constant (referenced throughout)

LABELS = [
    "interview_scheduled", "offer_received", "rejected",
    "positive_response", "survey_received", "neutral",
]

Task 1: Conda environment

Files:

  • Create: scripts/classifier_service/environment.yml

Step 1: Create the environment file

name: job-seeker-classifiers
channels:
  - pytorch
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.11
  - pip
  - pip:
    - torch>=2.1.0
    - transformers>=4.40.0
    - accelerate>=0.26.0
    - sentencepiece>=0.1.99
    - protobuf>=4.25.0
    - gliclass>=0.1.0
    - FlagEmbedding>=1.2.0
    - pyyaml>=6.0
    - tqdm>=4.66.0
    - pytest>=8.0.0

Step 2: Create the environment

conda env create -f scripts/classifier_service/environment.yml

Expected: env job-seeker-classifiers created successfully.

Step 3: Verify torch + CUDA

conda run -n job-seeker-classifiers python -c "import torch; print(torch.cuda.is_available())"

Expected: True (if GPU available).

Step 4: Commit

git add scripts/classifier_service/environment.yml
git commit -m "feat: add job-seeker-classifiers conda env for HF classifier benchmark"

Task 2: Data directory + gitignore + example scoring file

Files:

  • Modify: .gitignore
  • Create: data/email_score.jsonl.example

Step 1: Update .gitignore

Add to .gitignore:

data/email_score.jsonl
data/email_compare_sample.jsonl

Step 2: Create the example scoring file

Create data/email_score.jsonl.example with fake-but-realistic emails:

{"subject": "Interview Invitation — Senior Engineer", "body": "Hi Alex, we'd love to schedule a 30-min phone screen. Are you available Thursday at 2pm? Please reply to confirm.", "label": "interview_scheduled"}
{"subject": "Your application to Acme Corp", "body": "Thank you for your interest in the Senior Engineer role. After careful consideration, we have decided to move forward with other candidates whose experience more closely matches our current needs.", "label": "rejected"}
{"subject": "Offer Letter — Product Manager at Initech", "body": "Dear Alex, we are thrilled to extend an offer of employment for the Product Manager position. Please find the attached offer letter outlining compensation and start date.", "label": "offer_received"}
{"subject": "Quick question about your background", "body": "Hi Alex, I came across your profile and would love to connect. We have a few roles that seem like a great match. Would you be open to a brief chat this week?", "label": "positive_response"}
{"subject": "Company Culture Survey — Acme Corp", "body": "Alex, as part of our evaluation process, we invite all candidates to complete our culture fit assessment. The survey takes approximately 15 minutes. Please click the link below.", "label": "survey_received"}
{"subject": "Application Received — DataCo", "body": "Thank you for submitting your application for the Data Engineer role at DataCo. We have received your materials and will be in touch if your qualifications match our needs.", "label": "neutral"}
{"subject": "Following up on your application", "body": "Hi Alex, I wanted to follow up on your recent application. Your background looks interesting and we'd like to learn more. Can we set up a quick call?", "label": "positive_response"}
{"subject": "We're moving forward with other candidates", "body": "Dear Alex, thank you for taking the time to interview with us. After thoughtful consideration, we have decided not to move forward with your candidacy at this time.", "label": "rejected"}

Step 3: Commit

git add .gitignore data/email_score.jsonl.example
git commit -m "feat: add scoring JSONL example and gitignore for benchmark data files"

Task 3: ClassifierAdapter ABC + compute_metrics()

Files:

  • Create: scripts/classifier_adapters.py
  • Create: tests/test_classifier_adapters.py

Step 1: Write the failing tests

Create tests/test_classifier_adapters.py:

"""Tests for classifier_adapters — no model downloads required."""
import pytest


def test_labels_constant_has_six_items():
    from scripts.classifier_adapters import LABELS
    assert len(LABELS) == 6
    assert "interview_scheduled" in LABELS
    assert "neutral" in LABELS


def test_compute_metrics_perfect_predictions():
    from scripts.classifier_adapters import compute_metrics, LABELS
    gold  = ["rejected", "interview_scheduled", "neutral"]
    preds = ["rejected", "interview_scheduled", "neutral"]
    m = compute_metrics(preds, gold, LABELS)
    assert m["rejected"]["f1"] == pytest.approx(1.0)
    assert m["__accuracy__"] == pytest.approx(1.0)
    assert m["__macro_f1__"] == pytest.approx(1.0)


def test_compute_metrics_all_wrong():
    from scripts.classifier_adapters import compute_metrics, LABELS
    gold  = ["rejected",  "rejected"]
    preds = ["neutral",   "interview_scheduled"]
    m = compute_metrics(preds, gold, LABELS)
    assert m["rejected"]["recall"] == pytest.approx(0.0)
    assert m["__accuracy__"] == pytest.approx(0.0)


def test_compute_metrics_partial():
    from scripts.classifier_adapters import compute_metrics, LABELS
    gold  = ["rejected", "neutral", "rejected"]
    preds = ["rejected", "neutral", "interview_scheduled"]
    m = compute_metrics(preds, gold, LABELS)
    assert m["rejected"]["precision"] == pytest.approx(1.0)
    assert m["rejected"]["recall"]    == pytest.approx(0.5)
    assert m["neutral"]["f1"]         == pytest.approx(1.0)
    assert m["__accuracy__"]          == pytest.approx(2 / 3)


def test_compute_metrics_empty():
    from scripts.classifier_adapters import compute_metrics, LABELS
    m = compute_metrics([], [], LABELS)
    assert m["__accuracy__"] == pytest.approx(0.0)


def test_classifier_adapter_is_abstract():
    from scripts.classifier_adapters import ClassifierAdapter
    with pytest.raises(TypeError):
        ClassifierAdapter()

Step 2: Run tests — expect FAIL

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py -v

Expected: ModuleNotFoundError: No module named 'scripts.classifier_adapters'

Step 3: Create scripts/classifier_adapters.py with ABC + metrics

"""Classifier adapters for email classification benchmark.

Each adapter wraps a HuggingFace model and normalizes output to LABELS.
Models load lazily on first classify() call; call unload() to free VRAM.
"""
from __future__ import annotations

import abc
from collections import defaultdict
from typing import Any

LABELS: list[str] = [
    "interview_scheduled",
    "offer_received",
    "rejected",
    "positive_response",
    "survey_received",
    "neutral",
]

# Natural-language descriptions used by the RerankerAdapter.
LABEL_DESCRIPTIONS: dict[str, str] = {
    "interview_scheduled": "scheduling an interview, phone screen, or video call",
    "offer_received": "a formal job offer or employment offer letter",
    "rejected": "application rejected or not moving forward with candidacy",
    "positive_response": "positive recruiter interest or request to connect",
    "survey_received": "invitation to complete a culture-fit survey or assessment",
    "neutral": "automated ATS confirmation or unrelated email",
}


def _cuda_available() -> bool:
    try:
        import torch
        return torch.cuda.is_available()
    except ImportError:
        return False


def compute_metrics(
    predictions: list[str],
    gold: list[str],
    labels: list[str],
) -> dict[str, Any]:
    """Return per-label precision/recall/F1 + macro_f1 + accuracy."""
    tp: dict[str, int] = defaultdict(int)
    fp: dict[str, int] = defaultdict(int)
    fn: dict[str, int] = defaultdict(int)

    for pred, true in zip(predictions, gold):
        if pred == true:
            tp[pred] += 1
        else:
            fp[pred] += 1
            fn[true] += 1

    result: dict[str, Any] = {}
    for label in labels:
        denom_p = tp[label] + fp[label]
        denom_r = tp[label] + fn[label]
        p = tp[label] / denom_p if denom_p else 0.0
        r = tp[label] / denom_r if denom_r else 0.0
        f1 = 2 * p * r / (p + r) if (p + r) else 0.0
        result[label] = {
            "precision": p,
            "recall": r,
            "f1": f1,
            "support": denom_r,
        }

    result["__macro_f1__"] = (
        sum(v["f1"] for v in result.values() if isinstance(v, dict)) / len(labels)
    )
    result["__accuracy__"] = sum(tp.values()) / len(predictions) if predictions else 0.0
    return result


class ClassifierAdapter(abc.ABC):
    """Abstract base for all email classifier adapters."""

    @property
    @abc.abstractmethod
    def name(self) -> str: ...

    @property
    @abc.abstractmethod
    def model_id(self) -> str: ...

    @abc.abstractmethod
    def load(self) -> None:
        """Download/load the model into memory."""

    @abc.abstractmethod
    def unload(self) -> None:
        """Release model from memory."""

    @abc.abstractmethod
    def classify(self, subject: str, body: str) -> str:
        """Return one of LABELS for the given email."""

Step 4: Run tests — expect PASS

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py -v

Expected: 6 tests pass.

Step 5: Commit

git add scripts/classifier_adapters.py tests/test_classifier_adapters.py
git commit -m "feat: ClassifierAdapter ABC + compute_metrics() with full test coverage"

Task 4: ZeroShotAdapter

Files:

  • Modify: scripts/classifier_adapters.py
  • Modify: tests/test_classifier_adapters.py

Step 1: Add failing tests

Append to tests/test_classifier_adapters.py:

def test_zeroshot_adapter_classify_mocked():
    from unittest.mock import MagicMock, patch
    from scripts.classifier_adapters import ZeroShotAdapter

    mock_pipeline = MagicMock()
    mock_pipeline.return_value = {
        "labels": ["rejected", "neutral", "interview_scheduled"],
        "scores": [0.85, 0.10, 0.05],
    }

    with patch("scripts.classifier_adapters.pipeline", mock_pipeline):
        adapter = ZeroShotAdapter("test-zs", "some/model")
        adapter.load()
        result = adapter.classify("We went with another candidate", "Thank you for applying.")

    assert result == "rejected"
    call_args = mock_pipeline.return_value.call_args
    assert "We went with another candidate" in call_args[0][0]


def test_zeroshot_adapter_unload_clears_pipeline():
    from unittest.mock import MagicMock, patch
    from scripts.classifier_adapters import ZeroShotAdapter

    with patch("scripts.classifier_adapters.pipeline", MagicMock()):
        adapter = ZeroShotAdapter("test-zs", "some/model")
        adapter.load()
        assert adapter._pipeline is not None
        adapter.unload()
        assert adapter._pipeline is None


def test_zeroshot_adapter_lazy_loads():
    """classify() loads the model automatically if not already loaded."""
    from unittest.mock import MagicMock, patch
    from scripts.classifier_adapters import ZeroShotAdapter

    mock_pipe_factory = MagicMock()
    mock_pipe_factory.return_value = MagicMock(return_value={
        "labels": ["neutral"], "scores": [1.0]
    })

    with patch("scripts.classifier_adapters.pipeline", mock_pipe_factory):
        adapter = ZeroShotAdapter("test-zs", "some/model")
        adapter.classify("subject", "body")  # no explicit load() call

    mock_pipe_factory.assert_called_once()

Step 2: Run tests — expect FAIL

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py::test_zeroshot_adapter_classify_mocked -v

Expected: AttributeError — ZeroShotAdapter not defined.

Step 3: Add import shim + ZeroShotAdapter to classifier_adapters.py

Add after the _cuda_available() helper:

# Lazy import shim — lets tests patch 'scripts.classifier_adapters.pipeline'
try:
    from transformers import pipeline  # type: ignore[assignment]
except ImportError:
    pipeline = None  # type: ignore[assignment]

Add after ClassifierAdapter:

class ZeroShotAdapter(ClassifierAdapter):
    """Wraps any transformers zero-shot-classification pipeline."""

    def __init__(self, name: str, model_id: str) -> None:
        self._name = name
        self._model_id = model_id
        self._pipeline: Any = None

    @property
    def name(self) -> str:
        return self._name

    @property
    def model_id(self) -> str:
        return self._model_id

    def load(self) -> None:
        from transformers import pipeline as _pipeline  # noqa: PLC0415
        device = 0 if _cuda_available() else -1  # 0 = first GPU, -1 = CPU
        self._pipeline = _pipeline(
            "zero-shot-classification",
            model=self._model_id,
            device=device,
        )

    def unload(self) -> None:
        self._pipeline = None

    def classify(self, subject: str, body: str) -> str:
        if self._pipeline is None:
            self.load()
        text = f"Subject: {subject}\n\n{body[:600]}"
        result = self._pipeline(text, LABELS, multi_label=False)
        return result["labels"][0]

Step 4: Run tests — expect PASS

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py -v

Expected: 9 tests pass.

Step 5: Commit

git add scripts/classifier_adapters.py tests/test_classifier_adapters.py
git commit -m "feat: ZeroShotAdapter — wraps transformers zero-shot-classification pipeline"

Task 5: GLiClassAdapter

Files:

  • Modify: scripts/classifier_adapters.py
  • Modify: tests/test_classifier_adapters.py

Step 1: Add failing tests

Append to tests/test_classifier_adapters.py:

def test_gliclass_adapter_classify_mocked():
    from unittest.mock import MagicMock, patch
    from scripts.classifier_adapters import GLiClassAdapter

    mock_pipeline_instance = MagicMock()
    mock_pipeline_instance.return_value = [[
        {"label": "interview_scheduled", "score": 0.91},
        {"label": "neutral", "score": 0.05},
        {"label": "rejected", "score": 0.04},
    ]]

    with patch("scripts.classifier_adapters.GLiClassModel") as _mc, \
         patch("scripts.classifier_adapters.AutoTokenizer") as _mt, \
         patch("scripts.classifier_adapters.ZeroShotClassificationPipeline",
               return_value=mock_pipeline_instance):
        adapter = GLiClassAdapter("test-gli", "some/gliclass-model")
        adapter.load()
        result = adapter.classify("Interview invitation", "Let's schedule a call.")

    assert result == "interview_scheduled"


def test_gliclass_adapter_returns_highest_score():
    from unittest.mock import MagicMock, patch
    from scripts.classifier_adapters import GLiClassAdapter

    mock_pipeline_instance = MagicMock()
    mock_pipeline_instance.return_value = [[
        {"label": "neutral", "score": 0.02},
        {"label": "offer_received", "score": 0.88},
        {"label": "rejected", "score": 0.10},
    ]]

    with patch("scripts.classifier_adapters.GLiClassModel"), \
         patch("scripts.classifier_adapters.AutoTokenizer"), \
         patch("scripts.classifier_adapters.ZeroShotClassificationPipeline",
               return_value=mock_pipeline_instance):
        adapter = GLiClassAdapter("test-gli", "some/model")
        adapter.load()
        result = adapter.classify("Offer letter enclosed", "Dear Alex, we are pleased to offer...")

    assert result == "offer_received"

Step 2: Run tests — expect FAIL

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py::test_gliclass_adapter_classify_mocked -v

Expected: AttributeError — GLiClassAdapter not defined.

Step 3: Add gliclass import shims + GLiClassAdapter

Add import shims near top of scripts/classifier_adapters.py (after the pipeline shim):

try:
    from gliclass import GLiClassModel, ZeroShotClassificationPipeline  # type: ignore
    from transformers import AutoTokenizer
except ImportError:
    GLiClassModel = None  # type: ignore
    ZeroShotClassificationPipeline = None  # type: ignore
    AutoTokenizer = None  # type: ignore

Add class after ZeroShotAdapter:

class GLiClassAdapter(ClassifierAdapter):
    """Wraps knowledgator GLiClass models via the gliclass library."""

    def __init__(self, name: str, model_id: str) -> None:
        self._name = name
        self._model_id = model_id
        self._pipeline: Any = None

    @property
    def name(self) -> str:
        return self._name

    @property
    def model_id(self) -> str:
        return self._model_id

    def load(self) -> None:
        if GLiClassModel is None:
            raise ImportError("gliclass not installed — run: pip install gliclass")
        device = "cuda:0" if _cuda_available() else "cpu"
        model = GLiClassModel.from_pretrained(self._model_id)
        tokenizer = AutoTokenizer.from_pretrained(self._model_id)
        self._pipeline = ZeroShotClassificationPipeline(
            model,
            tokenizer,
            classification_type="single-label",
            device=device,
        )

    def unload(self) -> None:
        self._pipeline = None

    def classify(self, subject: str, body: str) -> str:
        if self._pipeline is None:
            self.load()
        text = f"Subject: {subject}\n\n{body[:600]}"
        # threshold=0.0 ensures all labels are scored; we pick the max.
        results = self._pipeline(text, LABELS, threshold=0.0)[0]
        return max(results, key=lambda r: r["score"])["label"]

Step 4: Run tests — expect PASS

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py -v

Expected: 11 tests pass.

Step 5: Commit

git add scripts/classifier_adapters.py tests/test_classifier_adapters.py
git commit -m "feat: GLiClassAdapter — wraps gliclass zero-shot pipeline"

Task 6: RerankerAdapter

Files:

  • Modify: scripts/classifier_adapters.py
  • Modify: tests/test_classifier_adapters.py

Step 1: Add failing tests

Append to tests/test_classifier_adapters.py:

def test_reranker_adapter_picks_highest_score():
    from unittest.mock import MagicMock, patch
    from scripts.classifier_adapters import RerankerAdapter, LABELS

    mock_reranker = MagicMock()
    # Scores for each label pair — "rejected" (index 2) gets the highest
    mock_reranker.compute_score.return_value = [0.1, 0.05, 0.85, 0.05, 0.02, 0.03]

    with patch("scripts.classifier_adapters.FlagReranker", return_value=mock_reranker):
        adapter = RerankerAdapter("test-rr", "BAAI/bge-reranker-v2-m3")
        adapter.load()
        result = adapter.classify(
            "We regret to inform you",
            "After careful consideration we are moving forward with other candidates.",
        )

    assert result == "rejected"
    pairs = mock_reranker.compute_score.call_args[0][0]
    assert len(pairs) == len(LABELS)


def test_reranker_adapter_descriptions_cover_all_labels():
    from scripts.classifier_adapters import LABEL_DESCRIPTIONS, LABELS
    assert set(LABEL_DESCRIPTIONS.keys()) == set(LABELS)

Step 2: Run tests — expect FAIL

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py::test_reranker_adapter_picks_highest_score -v

Expected: AttributeError — RerankerAdapter not defined.

Step 3: Add FlagEmbedding import shim + RerankerAdapter

Add import shim in scripts/classifier_adapters.py:

try:
    from FlagEmbedding import FlagReranker  # type: ignore
except ImportError:
    FlagReranker = None  # type: ignore

Add class after GLiClassAdapter:

class RerankerAdapter(ClassifierAdapter):
    """Uses a BGE reranker to score (email, label_description) pairs."""

    def __init__(self, name: str, model_id: str) -> None:
        self._name = name
        self._model_id = model_id
        self._reranker: Any = None

    @property
    def name(self) -> str:
        return self._name

    @property
    def model_id(self) -> str:
        return self._model_id

    def load(self) -> None:
        if FlagReranker is None:
            raise ImportError("FlagEmbedding not installed — run: pip install FlagEmbedding")
        self._reranker = FlagReranker(self._model_id, use_fp16=_cuda_available())

    def unload(self) -> None:
        self._reranker = None

    def classify(self, subject: str, body: str) -> str:
        if self._reranker is None:
            self.load()
        text = f"Subject: {subject}\n\n{body[:600]}"
        pairs = [[text, LABEL_DESCRIPTIONS[label]] for label in LABELS]
        scores: list[float] = self._reranker.compute_score(pairs, normalize=True)
        return LABELS[scores.index(max(scores))]

Step 4: Run tests — expect PASS

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py -v

Expected: 13 tests pass.

Step 5: Commit

git add scripts/classifier_adapters.py tests/test_classifier_adapters.py
git commit -m "feat: RerankerAdapter — scores (email, label_description) pairs via BGE reranker"

Task 7: MODEL_REGISTRY + --list-models + CLI skeleton

Files:

  • Create: scripts/benchmark_classifier.py
  • Create: tests/test_benchmark_classifier.py

Step 1: Write failing tests

Create tests/test_benchmark_classifier.py:

"""Tests for benchmark_classifier — no model downloads required."""
import pytest


def test_registry_has_nine_models():
    from scripts.benchmark_classifier import MODEL_REGISTRY
    assert len(MODEL_REGISTRY) == 9


def test_registry_default_count():
    from scripts.benchmark_classifier import MODEL_REGISTRY
    defaults = [k for k, v in MODEL_REGISTRY.items() if v["default"]]
    assert len(defaults) == 5


def test_registry_entries_have_required_keys():
    from scripts.benchmark_classifier import MODEL_REGISTRY
    from scripts.classifier_adapters import ClassifierAdapter
    for name, entry in MODEL_REGISTRY.items():
        assert "adapter" in entry, f"{name} missing 'adapter'"
        assert "model_id" in entry, f"{name} missing 'model_id'"
        assert "params" in entry, f"{name} missing 'params'"
        assert "default" in entry, f"{name} missing 'default'"
        assert issubclass(entry["adapter"], ClassifierAdapter), \
            f"{name} adapter must be a ClassifierAdapter subclass"


def test_load_scoring_jsonl(tmp_path):
    from scripts.benchmark_classifier import load_scoring_jsonl
    import json
    f = tmp_path / "score.jsonl"
    rows = [
        {"subject": "Hi", "body": "Body text", "label": "neutral"},
        {"subject": "Interview", "body": "Schedule a call", "label": "interview_scheduled"},
    ]
    f.write_text("\n".join(json.dumps(r) for r in rows))
    result = load_scoring_jsonl(str(f))
    assert len(result) == 2
    assert result[0]["label"] == "neutral"


def test_load_scoring_jsonl_missing_file():
    from scripts.benchmark_classifier import load_scoring_jsonl
    with pytest.raises(FileNotFoundError):
        load_scoring_jsonl("/nonexistent/path.jsonl")

Step 2: Run tests — expect FAIL

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_benchmark_classifier.py -v

Expected: ModuleNotFoundError: No module named 'scripts.benchmark_classifier'

Step 3: Create benchmark_classifier.py with registry + skeleton

#!/usr/bin/env python
"""
Email classifier benchmark — compare HuggingFace models against our 6 labels.

Usage:
    # List available models
    conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --list-models

    # Score against labeled JSONL
    conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --score

    # Visual comparison on live IMAP emails
    conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --compare --limit 20

    # Include slow/large models
    conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --score --include-slow
"""
from __future__ import annotations

import argparse
import json
import sys
from pathlib import Path
from typing import Any

sys.path.insert(0, str(Path(__file__).parent.parent))

from scripts.classifier_adapters import (
    LABELS,
    ClassifierAdapter,
    GLiClassAdapter,
    RerankerAdapter,
    ZeroShotAdapter,
    compute_metrics,
)

# ---------------------------------------------------------------------------
# Model registry
# ---------------------------------------------------------------------------

MODEL_REGISTRY: dict[str, dict[str, Any]] = {
    "deberta-zeroshot": {
        "adapter": ZeroShotAdapter,
        "model_id": "MoritzLaurer/DeBERTa-v3-large-zeroshot-v2.0",
        "params": "400M",
        "default": True,
    },
    "deberta-small": {
        "adapter": ZeroShotAdapter,
        "model_id": "cross-encoder/nli-deberta-v3-small",
        "params": "100M",
        "default": True,
    },
    "gliclass-large": {
        "adapter": GLiClassAdapter,
        "model_id": "knowledgator/gliclass-instruct-large-v1.0",
        "params": "400M",
        "default": True,
    },
    "bart-mnli": {
        "adapter": ZeroShotAdapter,
        "model_id": "facebook/bart-large-mnli",
        "params": "400M",
        "default": True,
    },
    "bge-m3-zeroshot": {
        "adapter": ZeroShotAdapter,
        "model_id": "MoritzLaurer/bge-m3-zeroshot-v2.0",
        "params": "600M",
        "default": True,
    },
    "bge-reranker": {
        "adapter": RerankerAdapter,
        "model_id": "BAAI/bge-reranker-v2-m3",
        "params": "600M",
        "default": False,
    },
    "deberta-xlarge": {
        "adapter": ZeroShotAdapter,
        "model_id": "microsoft/deberta-xlarge-mnli",
        "params": "750M",
        "default": False,
    },
    "mdeberta-mnli": {
        "adapter": ZeroShotAdapter,
        "model_id": "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli",
        "params": "300M",
        "default": False,
    },
    "xlm-roberta-anli": {
        "adapter": ZeroShotAdapter,
        "model_id": "vicgalle/xlm-roberta-large-xnli-anli",
        "params": "600M",
        "default": False,
    },
}

# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------

def load_scoring_jsonl(path: str) -> list[dict[str, str]]:
    """Load labeled examples from a JSONL file for benchmark scoring."""
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(
            f"Scoring file not found: {path}\n"
            f"Copy data/email_score.jsonl.example → data/email_score.jsonl and label your emails."
        )
    rows = []
    with p.open() as f:
        for line in f:
            line = line.strip()
            if line:
                rows.append(json.loads(line))
    return rows


def _active_models(include_slow: bool) -> dict[str, dict[str, Any]]:
    return {k: v for k, v in MODEL_REGISTRY.items() if v["default"] or include_slow}


# ---------------------------------------------------------------------------
# Subcommands
# ---------------------------------------------------------------------------

def cmd_list_models(_args: argparse.Namespace) -> None:
    print(f"\n{'Name':<20} {'Params':<8} {'Default':<20} {'Adapter':<15} Model ID")
    print("-" * 100)
    for name, entry in MODEL_REGISTRY.items():
        adapter_name = entry["adapter"].__name__
        default_flag = "yes" if entry["default"] else "(--include-slow)"
        print(f"{name:<20} {entry['params']:<8} {default_flag:<20} {adapter_name:<15} {entry['model_id']}")
    print()


def cmd_score(_args: argparse.Namespace) -> None:
    raise NotImplementedError("--score implemented in Task 8")


def cmd_compare(_args: argparse.Namespace) -> None:
    raise NotImplementedError("--compare implemented in Task 9")


# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------

def main() -> None:
    parser = argparse.ArgumentParser(
        description="Benchmark HuggingFace email classifiers against our 6 labels."
    )
    parser.add_argument("--list-models", action="store_true", help="Show model registry and exit")
    parser.add_argument("--score", action="store_true", help="Score against labeled JSONL")
    parser.add_argument("--compare", action="store_true", help="Visual table on live IMAP emails")
    parser.add_argument("--score-file", default="data/email_score.jsonl", help="Path to labeled JSONL")
    parser.add_argument("--limit", type=int, default=20, help="Max emails for --compare")
    parser.add_argument("--days", type=int, default=90, help="Days back for IMAP search")
    parser.add_argument("--include-slow", action="store_true", help="Include non-default heavy models")
    parser.add_argument("--models", nargs="+", help="Override: run only these model names")

    args = parser.parse_args()

    if args.list_models:
        cmd_list_models(args)
    elif args.score:
        cmd_score(args)
    elif args.compare:
        cmd_compare(args)
    else:
        parser.print_help()


if __name__ == "__main__":
    main()

Step 4: Run tests — expect PASS

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_benchmark_classifier.py -v

Expected: 5 tests pass.

Step 5: Smoke-test --list-models

conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --list-models

Expected: table of 9 models printed without error.

Step 6: Commit

git add scripts/benchmark_classifier.py tests/test_benchmark_classifier.py
git commit -m "feat: benchmark_classifier skeleton — MODEL_REGISTRY, --list-models, CLI"

Task 8: --score mode

Files:

  • Modify: scripts/benchmark_classifier.py
  • Modify: tests/test_benchmark_classifier.py

Step 1: Add failing tests

Append to tests/test_benchmark_classifier.py:

def test_run_scoring_with_mock_adapters(tmp_path):
    """run_scoring() returns per-model metrics using mock adapters."""
    import json
    from unittest.mock import MagicMock
    from scripts.benchmark_classifier import run_scoring

    score_file = tmp_path / "score.jsonl"
    rows = [
        {"subject": "Interview", "body": "Let's schedule", "label": "interview_scheduled"},
        {"subject": "Sorry", "body": "We went with others", "label": "rejected"},
        {"subject": "Offer", "body": "We are pleased", "label": "offer_received"},
    ]
    score_file.write_text("\n".join(json.dumps(r) for r in rows))

    perfect = MagicMock()
    perfect.name = "perfect"
    perfect.classify.side_effect = lambda s, b: (
        "interview_scheduled" if "Interview" in s else
        "rejected" if "Sorry" in s else "offer_received"
    )

    bad = MagicMock()
    bad.name = "bad"
    bad.classify.return_value = "neutral"

    results = run_scoring([perfect, bad], str(score_file))

    assert results["perfect"]["__accuracy__"] == pytest.approx(1.0)
    assert results["bad"]["__accuracy__"] == pytest.approx(0.0)
    assert "latency_ms" in results["perfect"]


def test_run_scoring_handles_classify_error(tmp_path):
    """run_scoring() falls back to 'neutral' on exception and continues."""
    import json
    from unittest.mock import MagicMock
    from scripts.benchmark_classifier import run_scoring

    score_file = tmp_path / "score.jsonl"
    score_file.write_text(json.dumps({"subject": "Hi", "body": "Body", "label": "neutral"}))

    broken = MagicMock()
    broken.name = "broken"
    broken.classify.side_effect = RuntimeError("model crashed")

    results = run_scoring([broken], str(score_file))
    assert "broken" in results

Step 2: Run tests — expect FAIL

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_benchmark_classifier.py::test_run_scoring_with_mock_adapters -v

Expected: ImportErrorrun_scoring not defined.

Step 3: Implement run_scoring() and cmd_score()

Add import time at the top of benchmark_classifier.py. Then add run_scoring():

def run_scoring(
    adapters: list[ClassifierAdapter],
    score_file: str,
) -> dict[str, Any]:
    """Run all adapters against a labeled JSONL. Returns per-adapter metrics."""
    import time
    rows = load_scoring_jsonl(score_file)
    gold = [r["label"] for r in rows]
    results: dict[str, Any] = {}

    for adapter in adapters:
        preds: list[str] = []
        t0 = time.monotonic()
        for row in rows:
            try:
                pred = adapter.classify(row["subject"], row["body"])
            except Exception as exc:
                print(f"  [{adapter.name}] ERROR on '{row['subject'][:40]}': {exc}", flush=True)
                pred = "neutral"
            preds.append(pred)
        elapsed_ms = (time.monotonic() - t0) * 1000
        metrics = compute_metrics(preds, gold, LABELS)
        metrics["latency_ms"] = round(elapsed_ms / len(rows), 1)
        results[adapter.name] = metrics
        adapter.unload()

    return results

Replace the cmd_score stub:

def cmd_score(args: argparse.Namespace) -> None:
    active = _active_models(args.include_slow)
    if args.models:
        active = {k: v for k, v in active.items() if k in args.models}

    adapters = [
        entry["adapter"](name, entry["model_id"])
        for name, entry in active.items()
    ]

    print(f"\nScoring {len(adapters)} model(s) against {args.score_file}\n")
    results = run_scoring(adapters, args.score_file)

    # Summary table
    col = 12
    print(f"{'Model':<22}" + f"{'macro-F1':>{col}} {'Accuracy':>{col}} {'ms/email':>{col}}")
    print("-" * (22 + col * 3 + 2))
    for name, m in results.items():
        print(
            f"{name:<22}"
            f"{m['__macro_f1__']:>{col}.3f}"
            f"{m['__accuracy__']:>{col}.3f}"
            f"{m['latency_ms']:>{col}.1f}"
        )

    # Per-label F1 breakdown
    print("\nPer-label F1:")
    names = list(results.keys())
    print(f"{'Label':<25}" + "".join(f"{n[:11]:>{col}}" for n in names))
    print("-" * (25 + col * len(names)))
    for label in LABELS:
        row_str = f"{label:<25}"
        for m in results.values():
            row_str += f"{m[label]['f1']:>{col}.3f}"
        print(row_str)
    print()

Step 4: Run tests — expect PASS

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_benchmark_classifier.py -v

Expected: 7 tests pass.

Step 5: Commit

git add scripts/benchmark_classifier.py tests/test_benchmark_classifier.py
git commit -m "feat: --score mode with macro-F1, accuracy, latency, and per-label F1 table"

Task 9: --compare mode (stdlib IMAP + table output)

Files:

  • Modify: scripts/benchmark_classifier.py

Step 1: Add IMAP fetch helpers

Add after the _active_models() helper in benchmark_classifier.py:

import email as _email_lib
import imaplib
from datetime import datetime, timedelta

_BROAD_TERMS = [
    "interview", "opportunity", "offer letter",
    "job offer", "application", "recruiting",
]


def _load_imap_config() -> dict[str, Any]:
    import yaml
    cfg_path = Path(__file__).parent.parent / "config" / "email.yaml"
    with cfg_path.open() as f:
        return yaml.safe_load(f)


def _imap_connect(cfg: dict[str, Any]) -> imaplib.IMAP4_SSL:
    conn = imaplib.IMAP4_SSL(cfg["host"], cfg.get("port", 993))
    conn.login(cfg["username"], cfg["password"])
    return conn


def _decode_part(part: Any) -> str:
    charset = part.get_content_charset() or "utf-8"
    try:
        return part.get_payload(decode=True).decode(charset, errors="replace")
    except Exception:
        return ""


def _parse_uid(conn: imaplib.IMAP4_SSL, uid: bytes) -> dict[str, str] | None:
    try:
        _, data = conn.uid("fetch", uid, "(RFC822)")
        raw = data[0][1]
        msg = _email_lib.message_from_bytes(raw)
        subject = str(msg.get("subject", "")).strip()
        body = ""
        if msg.is_multipart():
            for part in msg.walk():
                if part.get_content_type() == "text/plain":
                    body = _decode_part(part)
                    break
        else:
            body = _decode_part(msg)
        return {"subject": subject, "body": body}
    except Exception:
        return None


def _fetch_imap_sample(limit: int, days: int) -> list[dict[str, str]]:
    cfg = _load_imap_config()
    conn = _imap_connect(cfg)
    since = (datetime.now() - timedelta(days=days)).strftime("%d-%b-%Y")
    conn.select("INBOX")

    seen_uids: dict[bytes, None] = {}
    for term in _BROAD_TERMS:
        _, data = conn.uid("search", None, f'(SUBJECT "{term}" SINCE {since})')
        for uid in (data[0] or b"").split():
            seen_uids[uid] = None

    sample = list(seen_uids.keys())[:limit]
    emails = []
    for uid in sample:
        parsed = _parse_uid(conn, uid)
        if parsed:
            emails.append(parsed)
    try:
        conn.logout()
    except Exception:
        pass
    return emails

Step 2: Replace cmd_compare stub

def cmd_compare(args: argparse.Namespace) -> None:
    active = _active_models(args.include_slow)
    if args.models:
        active = {k: v for k, v in active.items() if k in args.models}

    print(f"Fetching up to {args.limit} emails from IMAP …")
    emails = _fetch_imap_sample(args.limit, args.days)
    print(f"Fetched {len(emails)} emails. Loading {len(active)} model(s) …\n")

    adapters = [
        entry["adapter"](name, entry["model_id"])
        for name, entry in active.items()
    ]
    model_names = [a.name for a in adapters]

    col = 22
    subj_w = 50
    print(f"{'Subject':<{subj_w}}" + "".join(f"{n:<{col}}" for n in model_names))
    print("-" * (subj_w + col * len(model_names)))

    for row in emails:
        short_subj = row["subject"][:subj_w - 1] if len(row["subject"]) > subj_w else row["subject"]
        line = f"{short_subj:<{subj_w}}"
        for adapter in adapters:
            try:
                label = adapter.classify(row["subject"], row["body"])
            except Exception as exc:
                label = f"ERR:{str(exc)[:8]}"
            line += f"{label:<{col}}"
        print(line, flush=True)

    for adapter in adapters:
        adapter.unload()
    print()

Step 3: Run full test suite

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_benchmark_classifier.py tests/test_classifier_adapters.py -v

Expected: all 13 tests pass.

Step 4: Commit

git add scripts/benchmark_classifier.py
git commit -m "feat: --compare mode — stdlib IMAP fetch + side-by-side model label table"

Task 10: First real benchmark run

No code changes — first live execution.

Step 1: Create your labeled scoring file

cp data/email_score.jsonl.example data/email_score.jsonl

Open data/email_score.jsonl and replace the fake examples with at least 10 real emails from your inbox. Format per line:

{"subject": "actual subject", "body": "first 600 chars of body", "label": "one_of_six_labels"}

Valid labels: interview_scheduled, offer_received, rejected, positive_response, survey_received, neutral

Step 2: Run --score with default models

conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --score

Models download on first run (~400600MB each) — allow a few minutes.

Step 3: Run --compare on live IMAP

conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --compare --limit 15

Step 4: Run slow models (optional)

conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --score --include-slow

Step 5: Capture results (optional)

conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --score \
  > docs/plans/2026-02-26-benchmark-results.txt 2>&1
git add docs/plans/2026-02-26-benchmark-results.txt
git commit -m "docs: initial HF classifier benchmark results"

Quick Reference

# Create env (once)
conda env create -f scripts/classifier_service/environment.yml

# List models
conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --list-models

# Score against labeled data (5 default models)
conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --score

# Live IMAP visual table
conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --compare --limit 20

# Single model only
conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --score --models deberta-zeroshot

# Run all tests (job-seeker env — mocks only, no downloads)
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_classifier_adapters.py tests/test_benchmark_classifier.py -v