docs: email classifier benchmark design — adapter pattern, 9-model registry, compare+eval modes

2026-02-26 22:56:11 -08:00 · 2026-02-26 22:56:11 -08:00 · a7fe4d9ff4
commit a7fe4d9ff4
parent ae7c985fab
1 changed files with 132 additions and 0 deletions
--- a/docs/plans/2026-02-26-email-classifier-benchmark-design.md
+++ b/docs/plans/2026-02-26-email-classifier-benchmark-design.md
@ -0,0 +1,132 @@
+# Email Classifier Benchmark — Design
+
+**Date:** 2026-02-26
+**Status:** Approved
+
+## Problem
+
+The current `classify_stage_signal()` in `scripts/imap_sync.py` uses `llama3.1:8b` via
+Ollama for 6-label email classification. This is slow, requires a running Ollama instance,
+and accuracy is unverified against alternatives. This design establishes a benchmark harness
+to evaluate HuggingFace-native classifiers as potential replacements.
+
+## Labels
+
+```
+interview_scheduled  offer_received  rejected
+positive_response    survey_received  neutral
+```
+
+## Approach: Standalone Benchmark Script (Approach B)
+
+Two new files; nothing in `imap_sync.py` changes until a winner is chosen.
+
+```
+scripts/
+  benchmark_classifier.py     — CLI entry point
+  classifier_adapters.py      — adapter classes (reusable by imap_sync later)
+
+data/
+  email_eval.jsonl            — labeled ground truth (gitignored — contains email content)
+  email_eval.jsonl.example    — committed example with fake emails
+
+scripts/classifier_service/
+  environment.yml             — new conda env: job-seeker-classifiers
+```
+
+## Adapter Pattern
+
+```
+ClassifierAdapter (ABC)
+  .classify(subject, body) → str   # one of the 6 labels
+  .name → str
+  .model_id → str
+  .load() / .unload()              # explicit lifecycle
+
+ZeroShotAdapter(ClassifierAdapter)
+  # uses transformers pipeline("zero-shot-classification")
+  # candidate_labels = list of 6 labels
+  # works for: DeBERTa, BART-MNLI, BGE-M3-ZeroShot, XLM-RoBERTa
+
+GLiClassAdapter(ClassifierAdapter)
+  # uses gliclass library (pip install gliclass)
+  # GLiClassModel + ZeroShotClassificationPipeline
+  # works for: gliclass-instruct-large-v1.0
+
+RerankerAdapter(ClassifierAdapter)
+  # uses FlagEmbedding reranker.compute_score()
+  # scores (email_text, label_description) pairs; highest = predicted label
+  # works for: bge-reranker-v2-m3
+```
+
+## Model Registry
+
+| Short name | Model | Params | Adapter | Default |
+|------------|-------|--------|---------|---------|
+| `deberta-zeroshot` | MoritzLaurer/DeBERTa-v3-large-zeroshot-v2.0 | 400M | ZeroShot | ✅ |
+| `deberta-small` | cross-encoder/nli-deberta-v3-small | 100M | ZeroShot | ✅ |
+| `gliclass-large` | knowledgator/gliclass-instruct-large-v1.0 | 400M | GLiClass | ✅ |
+| `bart-mnli` | facebook/bart-large-mnli | 400M | ZeroShot | ✅ |
+| `bge-m3-zeroshot` | MoritzLaurer/bge-m3-zeroshot-v2.0 | 600M | ZeroShot | ✅ |
+| `bge-reranker` | BAAI/bge-reranker-v2-m3 | 600M | Reranker | ❌ (`--include-slow`) |
+| `deberta-xlarge` | microsoft/deberta-xlarge-mnli | 750M | ZeroShot | ❌ (`--include-slow`) |
+| `mdeberta-mnli` | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 300M | ZeroShot | ❌ (`--include-slow`) |
+| `xlm-roberta-anli` | vicgalle/xlm-roberta-large-xnli-anli | 600M | ZeroShot | ❌ (`--include-slow`) |
+
+## CLI Modes
+
+### `--compare` (live IMAP, visual table)
+Extends the pattern of `test_email_classify.py`. Pulls emails via IMAP, shows a table:
+```
+Subject                                              | Phrase | llama3 | deberta-zs | deberta-sm | gliclass | bart | bge-m3
+```
+- Phrase-filter column shows BLOCK/pass (same gate as production)
+- `llama3` column = current production baseline
+- HF model columns follow
+
+### `--eval` (ground-truth evaluation)
+Reads `data/email_eval.jsonl`, runs all models, reports per-label and aggregate metrics:
+- Per-label: precision, recall, F1
+- Aggregate: macro-F1, accuracy
+- Latency: ms/email per model
+
+JSONL format:
+```jsonl
+{"subject": "Interview invitation", "body": "We'd like to schedule...", "label": "interview_scheduled"}
+{"subject": "Your application", "body": "We regret to inform you...", "label": "rejected"}
+```
+
+### `--list-models`
+Prints the registry with sizes, adapter types, and default/slow flags.
+
+## Conda Environment
+
+New env `job-seeker-classifiers` — isolated from `job-seeker` (no torch there).
+
+Key deps:
+- `torch` (CUDA-enabled)
+- `transformers`
+- `gliclass`
+- `FlagEmbedding` (for bge-reranker only)
+- `sentence-transformers` (optional, for future embedding-based approaches)
+
+## GPU
+
+Auto-select (`device="cuda"` when available, CPU fallback). No GPU pinning — models
+load one at a time so VRAM pressure is sequential, not cumulative.
+
+## Error Handling
+
+- Model load failures: skip that column, print warning, continue
+- Classification errors: show `ERR` in cell, continue
+- IMAP failures: propagate (same as existing harness)
+- Missing eval file: clear error message pointing to `data/email_eval.jsonl.example`
+
+## What Does Not Change (Yet)
+
+- `scripts/imap_sync.py` — production classifier unchanged
+- `scripts/llm_router.py` — unchanged
+- `staging.db` schema — unchanged
+
+After benchmark results are reviewed, a separate PR will wire the winning model
+into `classify_stage_signal()` as an opt-in backend in `llm_router.py`.