pyr0ball a06b133a6e docs(avocet): document email field schemas and normalization layer

2026-03-03 18:43:41 -08:00

7.4 KiB

Raw Blame History

Avocet — Email Classifier Training Tool

What it is

Shared infrastructure for building and benchmarking email classifiers across the CircuitForge menagerie. Named for the avocet's sweeping-bill technique — it sweeps through email streams and filters out categories.

Pipeline:

Scrape (IMAP, wide search, multi-account) → data/email_label_queue.jsonl
                ↓
Label (card-stack UI)                      → data/email_score.jsonl
                ↓
Benchmark (HuggingFace NLI/reranker)       → per-model macro-F1 + latency

Environment

Python env: conda run -n job-seeker <cmd> for basic use (streamlit, yaml, stdlib only)
Classifier env: conda run -n job-seeker-classifiers <cmd> for benchmark (transformers, FlagEmbedding, gliclass)
Run tests: /devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v (direct binary — conda run pytest can spawn runaway processes)
Create classifier env: conda env create -f environment.yml

Label Tool (app/label_tool.py)

Card-stack Streamlit UI for manually labeling recruitment emails.

conda run -n job-seeker streamlit run app/label_tool.py --server.port 8503

Config: config/label_tool.yaml (gitignored — copy from .example, or use ⚙️ Settings tab)
Queue: data/email_label_queue.jsonl (gitignored)
Output: data/email_score.jsonl (gitignored)
Four tabs: 🃏 Label, 📥 Fetch, 📊 Stats, ⚙️ Settings
Keyboard shortcuts: 1–9 = label, 0 = Other (wildcard, prompts free-text input), S = skip, U = undo
Dedup: MD5 of (subject + body[:100]) — cross-account safe

Settings Tab (⚙️)

Add / edit / remove IMAP accounts via form UI — no manual YAML editing required
Per-account fields: display name, host, port, SSL toggle, username, password (masked), folder, days back
🔌 Test connection button per account — connects, logs in, selects folder, reports message count
Global: max emails per account per fetch
💾 Save writes config/label_tool.yaml; ↩ Reload discards unsaved changes
_sync_settings_to_state() collects widget values before any add/remove to avoid index-key drift

Benchmark (scripts/benchmark_classifier.py)

# List available models
conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --list-models

# Score against labeled JSONL
conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --score

# Visual comparison on live IMAP emails
conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --compare --limit 20

# Include slow/large models
conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --score --include-slow

# Export DB-labeled emails (⚠️ LLM-generated labels — review first)
conda run -n job-seeker-classifiers python scripts/benchmark_classifier.py --export-db --db /path/to/staging.db

Labels (peregrine defaults — configurable per product)

Label	Key	Meaning
`interview_scheduled`	1	Phone screen, video call, or on-site invitation
`offer_received`	2	Formal job offer or offer letter
`rejected`	3	Application declined or not moving forward
`positive_response`	4	Recruiter interest or request to connect
`survey_received`	5	Culture-fit survey or assessment invitation
`neutral`	6	ATS confirmation (application received, etc.)
`event_rescheduled`	7	Interview or event moved to a new time
`digest`	8	Job digest or multi-listing email (scrapeable)
`new_lead`	9	Unsolicited recruiter outreach or cold contact
`hired`	h	Offer accepted, onboarding, welcome email, start date

Model Registry (13 models, 7 defaults)

See scripts/benchmark_classifier.py:MODEL_REGISTRY. Default models run without --include-slow. Add --models deberta-small deberta-small-2pass to test a specific subset.

Config Files

config/label_tool.yaml — gitignored; multi-account IMAP config
config/label_tool.yaml.example — committed template

Data Files

data/email_score.jsonl — gitignored; manually-labeled ground truth
data/email_score.jsonl.example — committed sample for CI
data/email_label_queue.jsonl — gitignored; IMAP fetch queue

Key Design Notes

ZeroShotAdapter.load() instantiates the pipeline object; classify() calls the object. Tests patch scripts.classifier_adapters.pipeline (the module-level factory) with a two-level mock: mock_factory.return_value = MagicMock(return_value={...}).
two_pass=True on ZeroShotAdapter: first pass ranks all 6 labels; second pass re-runs with only top-2, forcing a binary choice. 2× cost, better confidence.
--compare uses the first account in label_tool.yaml for live IMAP emails.
DB export labels are llama3.1:8b-generated — treat as noisy, not gold truth.

Vue Label UI (app/api.py + web/)

FastAPI on port 8503 serves both the REST API and the built Vue SPA (web/dist/).

./manage.sh start-api    # build Vue SPA + start FastAPI (binds 0.0.0.0:8503 — LAN accessible)
./manage.sh stop-api
./manage.sh open-api     # xdg-open http://localhost:8503

Logs: log/api.log

Email Field Schema — IMPORTANT

Two schemas exist. The normalization layer in app/api.py bridges them automatically.

JSONL on-disk schema (written by `label_tool.py` and `label_tool.py`'s IMAP fetch)

Field	Type	Notes
`subject`	str	Email subject line
`body`	str	Plain-text body, truncated at 800 chars; HTML stripped by `_strip_html()`
`from_addr`	str	Sender address string (`"Name <addr>"`)
`date`	str	Raw RFC 2822 date string
`account`	str	Display name of the IMAP account that fetched it
(no `id`)	—	Dedup key is MD5 of `(subject + body[:100])` — never stored on disk

Vue API schema (returned by `GET /api/queue`, required by POST endpoints)

Field	Type	Notes
`id`	str	MD5 content hash, or stored `id` if item has one
`subject`	str	Unchanged
`body`	str	Unchanged
`from`	str	Mapped from `from_addr` (or `from` if already present)
`date`	str	Unchanged
`source`	str	Mapped from `account` (or `source` if already present)

Normalization layer (`_normalize()` in `app/api.py`)

_normalize(item) handles the mapping and ID generation. All GET /api/queue responses pass through it. Mutating endpoints (/api/label, /api/skip, /api/discard) look up items via _normalize(x)["id"], so both real data (no id, uses content hash) and test fixtures (explicit id field) work transparently.

Peregrine integration

Peregrine's staging.db uses different field names again:

staging.db column	Maps to avocet JSONL field
`subject`	`subject`
`body`	`body` (may contain HTML — run through `_strip_html()` before queuing)
`from_address`	`from_addr`
`received_date`	`date`
`account` or source context	`account`

When exporting from Peregrine's DB for avocet labeling, transform to the JSONL schema above (not the Vue API schema). The --export-db flag in benchmark_classifier.py does this. Any new export path should also call _strip_html() on the body before writing.

Relationship to Peregrine

Avocet started as peregrine/tools/label_tool.py + peregrine/scripts/classifier_adapters.py. Peregrine retains copies during stabilization; once avocet is proven, peregrine will import from here.

7.4 KiB Raw Blame History Unescape Escape