Email Classifier Benchmark — Design

Date: 2026-02-26 Status: Approved

Problem

The current classify_stage_signal() in scripts/imap_sync.py uses llama3.1:8b via Ollama for 6-label email classification. This is slow, requires a running Ollama instance, and accuracy is unverified against alternatives. This design establishes a benchmark harness to evaluate HuggingFace-native classifiers as potential replacements.

Labels

interview_scheduled  offer_received  rejected
positive_response    survey_received  neutral

Approach: Standalone Benchmark Script (Approach B)

Two new files; nothing in imap_sync.py changes until a winner is chosen.

scripts/
  benchmark_classifier.py     — CLI entry point
  classifier_adapters.py      — adapter classes (reusable by imap_sync later)

data/
  email_eval.jsonl            — labeled ground truth (gitignored — contains email content)
  email_eval.jsonl.example    — committed example with fake emails

scripts/classifier_service/
  environment.yml             — new conda env: job-seeker-classifiers

Adapter Pattern

ClassifierAdapter (ABC)
  .classify(subject, body) → str   # one of the 6 labels
  .name → str
  .model_id → str
  .load() / .unload()              # explicit lifecycle

ZeroShotAdapter(ClassifierAdapter)
  # uses transformers pipeline("zero-shot-classification")
  # candidate_labels = list of 6 labels
  # works for: DeBERTa, BART-MNLI, BGE-M3-ZeroShot, XLM-RoBERTa

GLiClassAdapter(ClassifierAdapter)
  # uses gliclass library (pip install gliclass)
  # GLiClassModel + ZeroShotClassificationPipeline
  # works for: gliclass-instruct-large-v1.0

RerankerAdapter(ClassifierAdapter)
  # uses FlagEmbedding reranker.compute_score()
  # scores (email_text, label_description) pairs; highest = predicted label
  # works for: bge-reranker-v2-m3

Model Registry

Short name	Model	Params	Adapter	Default
`deberta-zeroshot`	MoritzLaurer/DeBERTa-v3-large-zeroshot-v2.0	400M	ZeroShot	✅
`deberta-small`	cross-encoder/nli-deberta-v3-small	100M	ZeroShot	✅
`gliclass-large`	knowledgator/gliclass-instruct-large-v1.0	400M	GLiClass	✅
`bart-mnli`	facebook/bart-large-mnli	400M	ZeroShot	✅
`bge-m3-zeroshot`	MoritzLaurer/bge-m3-zeroshot-v2.0	600M	ZeroShot	✅
`bge-reranker`	BAAI/bge-reranker-v2-m3	600M	Reranker	❌ (`--include-slow`)
`deberta-xlarge`	microsoft/deberta-xlarge-mnli	750M	ZeroShot	❌ (`--include-slow`)
`mdeberta-mnli`	MoritzLaurer/mDeBERTa-v3-base-mnli-xnli	300M	ZeroShot	❌ (`--include-slow`)
`xlm-roberta-anli`	vicgalle/xlm-roberta-large-xnli-anli	600M	ZeroShot	❌ (`--include-slow`)

CLI Modes

`--compare` (live IMAP, visual table)

Extends the pattern of test_email_classify.py. Pulls emails via IMAP, shows a table:

Subject                                              | Phrase | llama3 | deberta-zs | deberta-sm | gliclass | bart | bge-m3

Phrase-filter column shows BLOCK/pass (same gate as production)
llama3 column = current production baseline
HF model columns follow

`--eval` (ground-truth evaluation)

Reads data/email_eval.jsonl, runs all models, reports per-label and aggregate metrics:

Per-label: precision, recall, F1
Aggregate: macro-F1, accuracy
Latency: ms/email per model

JSONL format:

{"subject": "Interview invitation", "body": "We'd like to schedule...", "label": "interview_scheduled"}
{"subject": "Your application", "body": "We regret to inform you...", "label": "rejected"}

`--list-models`

Prints the registry with sizes, adapter types, and default/slow flags.

Conda Environment

New env job-seeker-classifiers — isolated from job-seeker (no torch there).

Key deps:

torch (CUDA-enabled)
transformers
gliclass
FlagEmbedding (for bge-reranker only)
sentence-transformers (optional, for future embedding-based approaches)

GPU

Auto-select (device="cuda" when available, CPU fallback). No GPU pinning — models load one at a time so VRAM pressure is sequential, not cumulative.

Error Handling

Model load failures: skip that column, print warning, continue
Classification errors: show ERR in cell, continue
IMAP failures: propagate (same as existing harness)
Missing eval file: clear error message pointing to data/email_eval.jsonl.example

What Does Not Change (Yet)

scripts/imap_sync.py — production classifier unchanged
scripts/llm_router.py — unchanged
staging.db schema — unchanged

After benchmark results are reviewed, a separate PR will wire the winning model into classify_stage_signal() as an opt-in backend in llm_router.py.

4.7 KiB Raw Blame History