docs: digest email parser design — LinkedIn/Adzuna/Ladders registry + Avocet bucket

2026-03-05 12:56:53 -08:00 · 2026-03-05 12:56:53 -08:00 · cc0b8d716c
commit cc0b8d716c
parent 9229f9ce69
1 changed files with 242 additions and 0 deletions
--- a/docs/plans/2026-03-05-digest-parsers-design.md
+++ b/docs/plans/2026-03-05-digest-parsers-design.md
@ -0,0 +1,242 @@
 # Digest Email Parsers — Design
 **Date:** 2026-03-05
 **Products:** Peregrine (primary), Avocet (bucket)
 **Status:** Design approved, ready for implementation planning
 ---
 ## Problem
 Peregrine's `imap_sync.py` can extract leads from digest emails, but only for LinkedIn — the
 parser is hardcoded inline with no extension point. Adzuna and The Ladders digest emails are
 unhandled. Additionally, any digest email from an unknown sender is silently dropped with no
 way to collect samples for building new parsers.
 ---
 ## Solution Overview
 Two complementary changes:
 1. **`peregrine/scripts/digest_parsers.py`** — a standalone parser module with a sender registry
   and dispatcher. `imap_sync.py` calls a single function; the registry handles dispatch.
   LinkedIn parser moves here; Adzuna and Ladders parsers are built against real IMAP samples.
 2. **Avocet digest bucket** — when a user labels an email as `digest` in the Avocet label UI,
   the email is appended to `data/digest_samples.jsonl`. This file is the corpus for building
   and testing new parsers for senders not yet in the registry.
 ---
 ## Architecture
 ### Production path (Peregrine)
 ```
 imap_sync._scan_unmatched_leads()
    │
    ├─ parse_digest(from_addr, body)
    │       │
    │       ├─ None  → unknown sender → fall through to LLM extraction (unchanged)
    │       ├─ []    → known sender, nothing found → skip
    │       └─ [...] → jobs found → insert_job() + submit_task("scrape_url")
    │
    └─ continue  (digest email consumed; does not reach LLM path)
 ```
 ### Sample collection path (Avocet)
 ```
 Avocet label UI
    │
    └─ label == "digest"
            │
            └─ append to data/digest_samples.jsonl
                    │
                    └─ used as reference for building new parsers
 ```
 ---
 ## Module: `peregrine/scripts/digest_parsers.py`
 ### Parser interface
 Each parser function:
 ```python
 def parse_<source>(body: str) -> list[dict]
 ```
 Returns zero or more job dicts:
 ```python
 {
    "title":    str,   # job title
    "company":  str,   # company name
    "location": str,   # location string (may be empty)
    "url":      str,   # canonical URL, tracking params stripped
    "source":   str,   # "linkedin" | "adzuna" | "theladders"
 }
 ```
 ### Dispatcher
 ```python
 DIGEST_PARSERS: dict[str, tuple[str, Callable[[str], list[dict]]]] = {
    "jobalerts@linkedin.com":  ("linkedin",   parse_linkedin),
    "noreply@adzuna.com":      ("adzuna",     parse_adzuna),
    "noreply@theladders.com":  ("theladders", parse_theladders),
 }
 def parse_digest(from_addr: str, body: str) -> list[dict] | None:
    """
    Dispatch to the appropriate parser based on sender address.
    Returns:
        None        — no parser matched (not a known digest sender)
        []          — parser matched, no extractable jobs found
        [dict, ...] — one dict per job card extracted
    """
    addr = from_addr.lower()
    for sender, (source, parse_fn) in DIGEST_PARSERS.items():
        if sender in addr:
            return parse_fn(body)
    return None
 ```
 Sender matching is a substring check, tolerant of display-name wrappers
 (`"LinkedIn <jobalerts@linkedin.com>"` matches correctly).
 ### Parsers
 **`parse_linkedin`** — moved verbatim from `imap_sync.parse_linkedin_alert()`, renamed.
 No behavior change.
 **`parse_adzuna`** — built against real Adzuna digest email bodies pulled from the
 configured IMAP account during implementation. Expected format: job blocks separated
 by consistent delimiters with title, company, location, and a trackable URL per block.
 **`parse_theladders`** — same approach. The Ladders already has a web scraper in
 `scripts/custom_boards/theladders.py`; URL canonicalization patterns from there apply here.
 ---
 ## Changes to `imap_sync.py`
 Replace the LinkedIn-specific block in `_scan_unmatched_leads()` (~lines 561–585):
 **Before:**
 ```python
 if _LINKEDIN_ALERT_SENDER in parsed["from_addr"].lower():
    cards = parse_linkedin_alert(parsed["body"])
    for card in cards:
        # ... LinkedIn-specific insert ...
    known_message_ids.add(mid)
    continue
 ```
 **After:**
 ```python
 from scripts.digest_parsers import parse_digest  # top of file
 cards = parse_digest(parsed["from_addr"], parsed["body"])
 if cards is not None:
    for card in cards:
        if card["url"] in existing_urls:
            continue
        job_id = insert_job(db_path, {
            "title":      card["title"],
            "company":    card["company"],
            "url":        card["url"],
            "source":     card["source"],
            "location":   card["location"],
            "is_remote":  0,
            "salary":     "",
            "description": "",
            "date_found": datetime.now().isoformat()[:10],
        })
        if job_id:
            submit_task(db_path, "scrape_url", job_id)
            existing_urls.add(card["url"])
            new_leads += 1
            print(f"[imap] digest ({card['source']}) → {card['company']} — {card['title']}")
    known_message_ids.add(mid)
    continue
 ```
 `parse_digest` returning `None` falls through to the existing LLM extraction path — all
 non-digest recruitment emails are completely unaffected.
 ---
 ## Avocet: Digest Bucket
 ### File
 `avocet/data/digest_samples.jsonl` — gitignored. An `.example` entry is committed.
 Schema matches the existing label queue (JSONL on-disk schema):
 ```json
 {"subject": "...", "body": "...", "from_addr": "...", "date": "...", "account": "..."}
 ```
 ### Trigger
 In `app/label_tool.py` and `app/api.py`: when a `digest` label is applied, append the
 email to `digest_samples.jsonl` alongside the normal write to `email_score.jsonl`.
 No Peregrine dependency — if the file path doesn't exist the `data/` directory is created
 automatically. Avocet remains fully standalone.
 ### Usage
 When a new digest sender appears in the wild:
 1. Label representative emails as `digest` in Avocet → samples land in `digest_samples.jsonl`
 2. Inspect samples, write `parse_<source>(body)` in `digest_parsers.py`
 3. Add the sender string to `DIGEST_PARSERS`
 4. Add fixture test in `peregrine/tests/test_digest_parsers.py`
 ---
 ## Testing
 ### `peregrine/tests/test_digest_parsers.py`
 - Fixture bodies sourced from real IMAP samples (anonymized company names / URLs acceptable)
 - Each parser: valid body → expected cards returned
 - Each parser: empty / malformed body → `[]`, no exception
 - Dispatcher: known sender → correct parser invoked
 - Dispatcher: unknown sender → `None`
 - URL canonicalization: tracking params stripped, canonical form asserted
 - Dedup within digest: same URL appearing twice in one email → one card
 ### `avocet/tests/test_digest_bucket.py`
 - `digest` label → row appended to `digest_samples.jsonl`
 - Any other label → `digest_samples.jsonl` not touched
 - First write creates `data/` directory if absent
 ---
 ## Files Changed / Created
 | File | Change |
 |------|--------|
 | `peregrine/scripts/digest_parsers.py` | **New** — parser module |
 | `peregrine/scripts/imap_sync.py` | Replace inline LinkedIn block with `parse_digest()` call |
 | `peregrine/tests/test_digest_parsers.py` | **New** — parser unit tests |
 | `avocet/app/label_tool.py` | Append to `digest_samples.jsonl` on `digest` label |
 | `avocet/app/api.py` | Same — digest bucket write in label endpoint |
 | `avocet/tests/test_digest_bucket.py` | **New** — bucket write tests |
 | `avocet/data/digest_samples.jsonl.example` | **New** — committed sample for reference |
 ---
 ## Out of Scope
 - Avocet → Peregrine direct import trigger (deferred; bucket is sufficient for now)
 - `background_tasks` integration for digest re-processing (not needed with bucket approach)
 - HTML digest parsing (all three senders send plain-text alerts; revisit if needed)