docs: email classifier benchmark design — adapter pattern, 9-model registry, compare+eval modes
This commit is contained in:
parent
ae7c985fab
commit
a7fe4d9ff4
1 changed files with 132 additions and 0 deletions
132
docs/plans/2026-02-26-email-classifier-benchmark-design.md
Normal file
132
docs/plans/2026-02-26-email-classifier-benchmark-design.md
Normal file
|
|
@ -0,0 +1,132 @@
|
|||
# Email Classifier Benchmark — Design
|
||||
|
||||
**Date:** 2026-02-26
|
||||
**Status:** Approved
|
||||
|
||||
## Problem
|
||||
|
||||
The current `classify_stage_signal()` in `scripts/imap_sync.py` uses `llama3.1:8b` via
|
||||
Ollama for 6-label email classification. This is slow, requires a running Ollama instance,
|
||||
and accuracy is unverified against alternatives. This design establishes a benchmark harness
|
||||
to evaluate HuggingFace-native classifiers as potential replacements.
|
||||
|
||||
## Labels
|
||||
|
||||
```
|
||||
interview_scheduled offer_received rejected
|
||||
positive_response survey_received neutral
|
||||
```
|
||||
|
||||
## Approach: Standalone Benchmark Script (Approach B)
|
||||
|
||||
Two new files; nothing in `imap_sync.py` changes until a winner is chosen.
|
||||
|
||||
```
|
||||
scripts/
|
||||
benchmark_classifier.py — CLI entry point
|
||||
classifier_adapters.py — adapter classes (reusable by imap_sync later)
|
||||
|
||||
data/
|
||||
email_eval.jsonl — labeled ground truth (gitignored — contains email content)
|
||||
email_eval.jsonl.example — committed example with fake emails
|
||||
|
||||
scripts/classifier_service/
|
||||
environment.yml — new conda env: job-seeker-classifiers
|
||||
```
|
||||
|
||||
## Adapter Pattern
|
||||
|
||||
```
|
||||
ClassifierAdapter (ABC)
|
||||
.classify(subject, body) → str # one of the 6 labels
|
||||
.name → str
|
||||
.model_id → str
|
||||
.load() / .unload() # explicit lifecycle
|
||||
|
||||
ZeroShotAdapter(ClassifierAdapter)
|
||||
# uses transformers pipeline("zero-shot-classification")
|
||||
# candidate_labels = list of 6 labels
|
||||
# works for: DeBERTa, BART-MNLI, BGE-M3-ZeroShot, XLM-RoBERTa
|
||||
|
||||
GLiClassAdapter(ClassifierAdapter)
|
||||
# uses gliclass library (pip install gliclass)
|
||||
# GLiClassModel + ZeroShotClassificationPipeline
|
||||
# works for: gliclass-instruct-large-v1.0
|
||||
|
||||
RerankerAdapter(ClassifierAdapter)
|
||||
# uses FlagEmbedding reranker.compute_score()
|
||||
# scores (email_text, label_description) pairs; highest = predicted label
|
||||
# works for: bge-reranker-v2-m3
|
||||
```
|
||||
|
||||
## Model Registry
|
||||
|
||||
| Short name | Model | Params | Adapter | Default |
|
||||
|------------|-------|--------|---------|---------|
|
||||
| `deberta-zeroshot` | MoritzLaurer/DeBERTa-v3-large-zeroshot-v2.0 | 400M | ZeroShot | ✅ |
|
||||
| `deberta-small` | cross-encoder/nli-deberta-v3-small | 100M | ZeroShot | ✅ |
|
||||
| `gliclass-large` | knowledgator/gliclass-instruct-large-v1.0 | 400M | GLiClass | ✅ |
|
||||
| `bart-mnli` | facebook/bart-large-mnli | 400M | ZeroShot | ✅ |
|
||||
| `bge-m3-zeroshot` | MoritzLaurer/bge-m3-zeroshot-v2.0 | 600M | ZeroShot | ✅ |
|
||||
| `bge-reranker` | BAAI/bge-reranker-v2-m3 | 600M | Reranker | ❌ (`--include-slow`) |
|
||||
| `deberta-xlarge` | microsoft/deberta-xlarge-mnli | 750M | ZeroShot | ❌ (`--include-slow`) |
|
||||
| `mdeberta-mnli` | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 300M | ZeroShot | ❌ (`--include-slow`) |
|
||||
| `xlm-roberta-anli` | vicgalle/xlm-roberta-large-xnli-anli | 600M | ZeroShot | ❌ (`--include-slow`) |
|
||||
|
||||
## CLI Modes
|
||||
|
||||
### `--compare` (live IMAP, visual table)
|
||||
Extends the pattern of `test_email_classify.py`. Pulls emails via IMAP, shows a table:
|
||||
```
|
||||
Subject | Phrase | llama3 | deberta-zs | deberta-sm | gliclass | bart | bge-m3
|
||||
```
|
||||
- Phrase-filter column shows BLOCK/pass (same gate as production)
|
||||
- `llama3` column = current production baseline
|
||||
- HF model columns follow
|
||||
|
||||
### `--eval` (ground-truth evaluation)
|
||||
Reads `data/email_eval.jsonl`, runs all models, reports per-label and aggregate metrics:
|
||||
- Per-label: precision, recall, F1
|
||||
- Aggregate: macro-F1, accuracy
|
||||
- Latency: ms/email per model
|
||||
|
||||
JSONL format:
|
||||
```jsonl
|
||||
{"subject": "Interview invitation", "body": "We'd like to schedule...", "label": "interview_scheduled"}
|
||||
{"subject": "Your application", "body": "We regret to inform you...", "label": "rejected"}
|
||||
```
|
||||
|
||||
### `--list-models`
|
||||
Prints the registry with sizes, adapter types, and default/slow flags.
|
||||
|
||||
## Conda Environment
|
||||
|
||||
New env `job-seeker-classifiers` — isolated from `job-seeker` (no torch there).
|
||||
|
||||
Key deps:
|
||||
- `torch` (CUDA-enabled)
|
||||
- `transformers`
|
||||
- `gliclass`
|
||||
- `FlagEmbedding` (for bge-reranker only)
|
||||
- `sentence-transformers` (optional, for future embedding-based approaches)
|
||||
|
||||
## GPU
|
||||
|
||||
Auto-select (`device="cuda"` when available, CPU fallback). No GPU pinning — models
|
||||
load one at a time so VRAM pressure is sequential, not cumulative.
|
||||
|
||||
## Error Handling
|
||||
|
||||
- Model load failures: skip that column, print warning, continue
|
||||
- Classification errors: show `ERR` in cell, continue
|
||||
- IMAP failures: propagate (same as existing harness)
|
||||
- Missing eval file: clear error message pointing to `data/email_eval.jsonl.example`
|
||||
|
||||
## What Does Not Change (Yet)
|
||||
|
||||
- `scripts/imap_sync.py` — production classifier unchanged
|
||||
- `scripts/llm_router.py` — unchanged
|
||||
- `staging.db` schema — unchanged
|
||||
|
||||
After benchmark results are reviewed, a separate PR will wire the winning model
|
||||
into `classify_stage_signal()` as an opt-in backend in `llm_router.py`.
|
||||
Loading…
Reference in a new issue