From 7bbcde3f6927e8432ed68fb38a0ee31c78da7cab Mon Sep 17 00:00:00 2001 From: pyr0ball Date: Thu, 26 Feb 2026 22:56:11 -0800 Subject: [PATCH] =?UTF-8?q?docs:=20email=20classifier=20benchmark=20design?= =?UTF-8?q?=20=E2=80=94=20adapter=20pattern,=209-model=20registry,=20compa?= =?UTF-8?q?re+eval=20modes?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...02-26-email-classifier-benchmark-design.md | 132 ++++++++++++++++++ 1 file changed, 132 insertions(+) create mode 100644 docs/plans/2026-02-26-email-classifier-benchmark-design.md diff --git a/docs/plans/2026-02-26-email-classifier-benchmark-design.md b/docs/plans/2026-02-26-email-classifier-benchmark-design.md new file mode 100644 index 0000000..23ba35f --- /dev/null +++ b/docs/plans/2026-02-26-email-classifier-benchmark-design.md @@ -0,0 +1,132 @@ +# Email Classifier Benchmark — Design + +**Date:** 2026-02-26 +**Status:** Approved + +## Problem + +The current `classify_stage_signal()` in `scripts/imap_sync.py` uses `llama3.1:8b` via +Ollama for 6-label email classification. This is slow, requires a running Ollama instance, +and accuracy is unverified against alternatives. This design establishes a benchmark harness +to evaluate HuggingFace-native classifiers as potential replacements. + +## Labels + +``` +interview_scheduled offer_received rejected +positive_response survey_received neutral +``` + +## Approach: Standalone Benchmark Script (Approach B) + +Two new files; nothing in `imap_sync.py` changes until a winner is chosen. + +``` +scripts/ + benchmark_classifier.py — CLI entry point + classifier_adapters.py — adapter classes (reusable by imap_sync later) + +data/ + email_eval.jsonl — labeled ground truth (gitignored — contains email content) + email_eval.jsonl.example — committed example with fake emails + +scripts/classifier_service/ + environment.yml — new conda env: job-seeker-classifiers +``` + +## Adapter Pattern + +``` +ClassifierAdapter (ABC) + .classify(subject, body) → str # one of the 6 labels + .name → str + .model_id → str + .load() / .unload() # explicit lifecycle + +ZeroShotAdapter(ClassifierAdapter) + # uses transformers pipeline("zero-shot-classification") + # candidate_labels = list of 6 labels + # works for: DeBERTa, BART-MNLI, BGE-M3-ZeroShot, XLM-RoBERTa + +GLiClassAdapter(ClassifierAdapter) + # uses gliclass library (pip install gliclass) + # GLiClassModel + ZeroShotClassificationPipeline + # works for: gliclass-instruct-large-v1.0 + +RerankerAdapter(ClassifierAdapter) + # uses FlagEmbedding reranker.compute_score() + # scores (email_text, label_description) pairs; highest = predicted label + # works for: bge-reranker-v2-m3 +``` + +## Model Registry + +| Short name | Model | Params | Adapter | Default | +|------------|-------|--------|---------|---------| +| `deberta-zeroshot` | MoritzLaurer/DeBERTa-v3-large-zeroshot-v2.0 | 400M | ZeroShot | ✅ | +| `deberta-small` | cross-encoder/nli-deberta-v3-small | 100M | ZeroShot | ✅ | +| `gliclass-large` | knowledgator/gliclass-instruct-large-v1.0 | 400M | GLiClass | ✅ | +| `bart-mnli` | facebook/bart-large-mnli | 400M | ZeroShot | ✅ | +| `bge-m3-zeroshot` | MoritzLaurer/bge-m3-zeroshot-v2.0 | 600M | ZeroShot | ✅ | +| `bge-reranker` | BAAI/bge-reranker-v2-m3 | 600M | Reranker | ❌ (`--include-slow`) | +| `deberta-xlarge` | microsoft/deberta-xlarge-mnli | 750M | ZeroShot | ❌ (`--include-slow`) | +| `mdeberta-mnli` | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 300M | ZeroShot | ❌ (`--include-slow`) | +| `xlm-roberta-anli` | vicgalle/xlm-roberta-large-xnli-anli | 600M | ZeroShot | ❌ (`--include-slow`) | + +## CLI Modes + +### `--compare` (live IMAP, visual table) +Extends the pattern of `test_email_classify.py`. Pulls emails via IMAP, shows a table: +``` +Subject | Phrase | llama3 | deberta-zs | deberta-sm | gliclass | bart | bge-m3 +``` +- Phrase-filter column shows BLOCK/pass (same gate as production) +- `llama3` column = current production baseline +- HF model columns follow + +### `--eval` (ground-truth evaluation) +Reads `data/email_eval.jsonl`, runs all models, reports per-label and aggregate metrics: +- Per-label: precision, recall, F1 +- Aggregate: macro-F1, accuracy +- Latency: ms/email per model + +JSONL format: +```jsonl +{"subject": "Interview invitation", "body": "We'd like to schedule...", "label": "interview_scheduled"} +{"subject": "Your application", "body": "We regret to inform you...", "label": "rejected"} +``` + +### `--list-models` +Prints the registry with sizes, adapter types, and default/slow flags. + +## Conda Environment + +New env `job-seeker-classifiers` — isolated from `job-seeker` (no torch there). + +Key deps: +- `torch` (CUDA-enabled) +- `transformers` +- `gliclass` +- `FlagEmbedding` (for bge-reranker only) +- `sentence-transformers` (optional, for future embedding-based approaches) + +## GPU + +Auto-select (`device="cuda"` when available, CPU fallback). No GPU pinning — models +load one at a time so VRAM pressure is sequential, not cumulative. + +## Error Handling + +- Model load failures: skip that column, print warning, continue +- Classification errors: show `ERR` in cell, continue +- IMAP failures: propagate (same as existing harness) +- Missing eval file: clear error message pointing to `data/email_eval.jsonl.example` + +## What Does Not Change (Yet) + +- `scripts/imap_sync.py` — production classifier unchanged +- `scripts/llm_router.py` — unchanged +- `staging.db` schema — unchanged + +After benchmark results are reviewed, a separate PR will wire the winning model +into `classify_stage_signal()` as an opt-in backend in `llm_router.py`.