docs: digest email parser design — LinkedIn/Adzuna/Ladders registry + Avocet bucket

2026-03-05 12:56:53 -08:00 · 2026-03-05 12:56:53 -08:00 · cc0b8d716c
commit cc0b8d716c
parent 9229f9ce69
1 changed files with 242 additions and 0 deletions
--- a/docs/plans/2026-03-05-digest-parsers-design.md
+++ b/docs/plans/2026-03-05-digest-parsers-design.md
@ -0,0 +1,242 @@
+# Digest Email Parsers — Design
+
+**Date:** 2026-03-05
+**Products:** Peregrine (primary), Avocet (bucket)
+**Status:** Design approved, ready for implementation planning
+
+---
+
+## Problem
+
+Peregrine's `imap_sync.py` can extract leads from digest emails, but only for LinkedIn — the
+parser is hardcoded inline with no extension point. Adzuna and The Ladders digest emails are
+unhandled. Additionally, any digest email from an unknown sender is silently dropped with no
+way to collect samples for building new parsers.
+
+---
+
+## Solution Overview
+
+Two complementary changes:
+
+1. **`peregrine/scripts/digest_parsers.py`** — a standalone parser module with a sender registry
+   and dispatcher. `imap_sync.py` calls a single function; the registry handles dispatch.
+   LinkedIn parser moves here; Adzuna and Ladders parsers are built against real IMAP samples.
+
+2. **Avocet digest bucket** — when a user labels an email as `digest` in the Avocet label UI,
+   the email is appended to `data/digest_samples.jsonl`. This file is the corpus for building
+   and testing new parsers for senders not yet in the registry.
+
+---
+
+## Architecture
+
+### Production path (Peregrine)
+
+```
+imap_sync._scan_unmatched_leads()
+    │
+    ├─ parse_digest(from_addr, body)
+    │       │
+    │       ├─ None  → unknown sender → fall through to LLM extraction (unchanged)
+    │       ├─ []    → known sender, nothing found → skip
+    │       └─ [...] → jobs found → insert_job() + submit_task("scrape_url")
+    │
+    └─ continue  (digest email consumed; does not reach LLM path)
+```
+
+### Sample collection path (Avocet)
+
+```
+Avocet label UI
+    │
+    └─ label == "digest"
+            │
+            └─ append to data/digest_samples.jsonl
+                    │
+                    └─ used as reference for building new parsers
+```
+
+---
+
+## Module: `peregrine/scripts/digest_parsers.py`
+
+### Parser interface
+
+Each parser function:
+
+```python
+def parse_<source>(body: str) -> list[dict]
+```
+
+Returns zero or more job dicts:
+
+```python
+{
+    "title":    str,   # job title
+    "company":  str,   # company name
+    "location": str,   # location string (may be empty)
+    "url":      str,   # canonical URL, tracking params stripped
+    "source":   str,   # "linkedin" | "adzuna" | "theladders"
+}
+```
+
+### Dispatcher
+
+```python
+DIGEST_PARSERS: dict[str, tuple[str, Callable[[str], list[dict]]]] = {
+    "jobalerts@linkedin.com":  ("linkedin",   parse_linkedin),
+    "noreply@adzuna.com":      ("adzuna",     parse_adzuna),
+    "noreply@theladders.com":  ("theladders", parse_theladders),
+}
+
+def parse_digest(from_addr: str, body: str) -> list[dict] | None:
+    """
+    Dispatch to the appropriate parser based on sender address.
+
+    Returns:
+        None        — no parser matched (not a known digest sender)
+        []          — parser matched, no extractable jobs found
+        [dict, ...] — one dict per job card extracted
+    """
+    addr = from_addr.lower()
+    for sender, (source, parse_fn) in DIGEST_PARSERS.items():
+        if sender in addr:
+            return parse_fn(body)
+    return None
+```
+
+Sender matching is a substring check, tolerant of display-name wrappers
+(`"LinkedIn <jobalerts@linkedin.com>"` matches correctly).
+
+### Parsers
+
+**`parse_linkedin`** — moved verbatim from `imap_sync.parse_linkedin_alert()`, renamed.
+No behavior change.
+
+**`parse_adzuna`** — built against real Adzuna digest email bodies pulled from the
+configured IMAP account during implementation. Expected format: job blocks separated
+by consistent delimiters with title, company, location, and a trackable URL per block.
+
+**`parse_theladders`** — same approach. The Ladders already has a web scraper in
+`scripts/custom_boards/theladders.py`; URL canonicalization patterns from there apply here.
+
+---
+
+## Changes to `imap_sync.py`
+
+Replace the LinkedIn-specific block in `_scan_unmatched_leads()` (~lines 561–585):
+
+**Before:**
+```python
+if _LINKEDIN_ALERT_SENDER in parsed["from_addr"].lower():
+    cards = parse_linkedin_alert(parsed["body"])
+    for card in cards:
+        # ... LinkedIn-specific insert ...
+    known_message_ids.add(mid)
+    continue
+```
+
+**After:**
+```python
+from scripts.digest_parsers import parse_digest  # top of file
+
+cards = parse_digest(parsed["from_addr"], parsed["body"])
+if cards is not None:
+    for card in cards:
+        if card["url"] in existing_urls:
+            continue
+        job_id = insert_job(db_path, {
+            "title":      card["title"],
+            "company":    card["company"],
+            "url":        card["url"],
+            "source":     card["source"],
+            "location":   card["location"],
+            "is_remote":  0,
+            "salary":     "",
+            "description": "",
+            "date_found": datetime.now().isoformat()[:10],
+        })
+        if job_id:
+            submit_task(db_path, "scrape_url", job_id)
+            existing_urls.add(card["url"])
+            new_leads += 1
+            print(f"[imap] digest ({card['source']}) → {card['company']} — {card['title']}")
+    known_message_ids.add(mid)
+    continue
+```
+
+`parse_digest` returning `None` falls through to the existing LLM extraction path — all
+non-digest recruitment emails are completely unaffected.
+
+---
+
+## Avocet: Digest Bucket
+
+### File
+
+`avocet/data/digest_samples.jsonl` — gitignored. An `.example` entry is committed.
+
+Schema matches the existing label queue (JSONL on-disk schema):
+
+```json
+{"subject": "...", "body": "...", "from_addr": "...", "date": "...", "account": "..."}
+```
+
+### Trigger
+
+In `app/label_tool.py` and `app/api.py`: when a `digest` label is applied, append the
+email to `digest_samples.jsonl` alongside the normal write to `email_score.jsonl`.
+
+No Peregrine dependency — if the file path doesn't exist the `data/` directory is created
+automatically. Avocet remains fully standalone.
+
+### Usage
+
+When a new digest sender appears in the wild:
+1. Label representative emails as `digest` in Avocet → samples land in `digest_samples.jsonl`
+2. Inspect samples, write `parse_<source>(body)` in `digest_parsers.py`
+3. Add the sender string to `DIGEST_PARSERS`
+4. Add fixture test in `peregrine/tests/test_digest_parsers.py`
+
+---
+
+## Testing
+
+### `peregrine/tests/test_digest_parsers.py`
+
+- Fixture bodies sourced from real IMAP samples (anonymized company names / URLs acceptable)
+- Each parser: valid body → expected cards returned
+- Each parser: empty / malformed body → `[]`, no exception
+- Dispatcher: known sender → correct parser invoked
+- Dispatcher: unknown sender → `None`
+- URL canonicalization: tracking params stripped, canonical form asserted
+- Dedup within digest: same URL appearing twice in one email → one card
+
+### `avocet/tests/test_digest_bucket.py`
+
+- `digest` label → row appended to `digest_samples.jsonl`
+- Any other label → `digest_samples.jsonl` not touched
+- First write creates `data/` directory if absent
+
+---
+
+## Files Changed / Created
+
+| File | Change |
+|------|--------|
+| `peregrine/scripts/digest_parsers.py` | **New** — parser module |
+| `peregrine/scripts/imap_sync.py` | Replace inline LinkedIn block with `parse_digest()` call |
+| `peregrine/tests/test_digest_parsers.py` | **New** — parser unit tests |
+| `avocet/app/label_tool.py` | Append to `digest_samples.jsonl` on `digest` label |
+| `avocet/app/api.py` | Same — digest bucket write in label endpoint |
+| `avocet/tests/test_digest_bucket.py` | **New** — bucket write tests |
+| `avocet/data/digest_samples.jsonl.example` | **New** — committed sample for reference |
+
+---
+
+## Out of Scope
+
+- Avocet → Peregrine direct import trigger (deferred; bucket is sufficient for now)
+- `background_tasks` integration for digest re-processing (not needed with bucket approach)
+- HTML digest parsing (all three senders send plain-text alerts; revisit if needed)