7.7 KiB
Digest Email Parsers — Design
Date: 2026-03-05 Products: Peregrine (primary), Avocet (bucket) Status: Design approved, ready for implementation planning
Problem
Peregrine's imap_sync.py can extract leads from digest emails, but only for LinkedIn — the
parser is hardcoded inline with no extension point. Adzuna and The Ladders digest emails are
unhandled. Additionally, any digest email from an unknown sender is silently dropped with no
way to collect samples for building new parsers.
Solution Overview
Two complementary changes:
-
peregrine/scripts/digest_parsers.py— a standalone parser module with a sender registry and dispatcher.imap_sync.pycalls a single function; the registry handles dispatch. LinkedIn parser moves here; Adzuna and Ladders parsers are built against real IMAP samples. -
Avocet digest bucket — when a user labels an email as
digestin the Avocet label UI, the email is appended todata/digest_samples.jsonl. This file is the corpus for building and testing new parsers for senders not yet in the registry.
Architecture
Production path (Peregrine)
imap_sync._scan_unmatched_leads()
│
├─ parse_digest(from_addr, body)
│ │
│ ├─ None → unknown sender → fall through to LLM extraction (unchanged)
│ ├─ [] → known sender, nothing found → skip
│ └─ [...] → jobs found → insert_job() + submit_task("scrape_url")
│
└─ continue (digest email consumed; does not reach LLM path)
Sample collection path (Avocet)
Avocet label UI
│
└─ label == "digest"
│
└─ append to data/digest_samples.jsonl
│
└─ used as reference for building new parsers
Module: peregrine/scripts/digest_parsers.py
Parser interface
Each parser function:
def parse_<source>(body: str) -> list[dict]
Returns zero or more job dicts:
{
"title": str, # job title
"company": str, # company name
"location": str, # location string (may be empty)
"url": str, # canonical URL, tracking params stripped
"source": str, # "linkedin" | "adzuna" | "theladders"
}
Dispatcher
DIGEST_PARSERS: dict[str, tuple[str, Callable[[str], list[dict]]]] = {
"jobalerts@linkedin.com": ("linkedin", parse_linkedin),
"noreply@adzuna.com": ("adzuna", parse_adzuna),
"noreply@theladders.com": ("theladders", parse_theladders),
}
def parse_digest(from_addr: str, body: str) -> list[dict] | None:
"""
Dispatch to the appropriate parser based on sender address.
Returns:
None — no parser matched (not a known digest sender)
[] — parser matched, no extractable jobs found
[dict, ...] — one dict per job card extracted
"""
addr = from_addr.lower()
for sender, (source, parse_fn) in DIGEST_PARSERS.items():
if sender in addr:
return parse_fn(body)
return None
Sender matching is a substring check, tolerant of display-name wrappers
("LinkedIn <jobalerts@linkedin.com>" matches correctly).
Parsers
parse_linkedin — moved verbatim from imap_sync.parse_linkedin_alert(), renamed.
No behavior change.
parse_adzuna — built against real Adzuna digest email bodies pulled from the
configured IMAP account during implementation. Expected format: job blocks separated
by consistent delimiters with title, company, location, and a trackable URL per block.
parse_theladders — same approach. The Ladders already has a web scraper in
scripts/custom_boards/theladders.py; URL canonicalization patterns from there apply here.
Changes to imap_sync.py
Replace the LinkedIn-specific block in _scan_unmatched_leads() (~lines 561–585):
Before:
if _LINKEDIN_ALERT_SENDER in parsed["from_addr"].lower():
cards = parse_linkedin_alert(parsed["body"])
for card in cards:
# ... LinkedIn-specific insert ...
known_message_ids.add(mid)
continue
After:
from scripts.digest_parsers import parse_digest # top of file
cards = parse_digest(parsed["from_addr"], parsed["body"])
if cards is not None:
for card in cards:
if card["url"] in existing_urls:
continue
job_id = insert_job(db_path, {
"title": card["title"],
"company": card["company"],
"url": card["url"],
"source": card["source"],
"location": card["location"],
"is_remote": 0,
"salary": "",
"description": "",
"date_found": datetime.now().isoformat()[:10],
})
if job_id:
submit_task(db_path, "scrape_url", job_id)
existing_urls.add(card["url"])
new_leads += 1
print(f"[imap] digest ({card['source']}) → {card['company']} — {card['title']}")
known_message_ids.add(mid)
continue
parse_digest returning None falls through to the existing LLM extraction path — all
non-digest recruitment emails are completely unaffected.
Avocet: Digest Bucket
File
avocet/data/digest_samples.jsonl — gitignored. An .example entry is committed.
Schema matches the existing label queue (JSONL on-disk schema):
{"subject": "...", "body": "...", "from_addr": "...", "date": "...", "account": "..."}
Trigger
In app/label_tool.py and app/api.py: when a digest label is applied, append the
email to digest_samples.jsonl alongside the normal write to email_score.jsonl.
No Peregrine dependency — if the file path doesn't exist the data/ directory is created
automatically. Avocet remains fully standalone.
Usage
When a new digest sender appears in the wild:
- Label representative emails as
digestin Avocet → samples land indigest_samples.jsonl - Inspect samples, write
parse_<source>(body)indigest_parsers.py - Add the sender string to
DIGEST_PARSERS - Add fixture test in
peregrine/tests/test_digest_parsers.py
Testing
peregrine/tests/test_digest_parsers.py
- Fixture bodies sourced from real IMAP samples (anonymized company names / URLs acceptable)
- Each parser: valid body → expected cards returned
- Each parser: empty / malformed body →
[], no exception - Dispatcher: known sender → correct parser invoked
- Dispatcher: unknown sender →
None - URL canonicalization: tracking params stripped, canonical form asserted
- Dedup within digest: same URL appearing twice in one email → one card
avocet/tests/test_digest_bucket.py
digestlabel → row appended todigest_samples.jsonl- Any other label →
digest_samples.jsonlnot touched - First write creates
data/directory if absent
Files Changed / Created
| File | Change |
|---|---|
peregrine/scripts/digest_parsers.py |
New — parser module |
peregrine/scripts/imap_sync.py |
Replace inline LinkedIn block with parse_digest() call |
peregrine/tests/test_digest_parsers.py |
New — parser unit tests |
avocet/app/label_tool.py |
Append to digest_samples.jsonl on digest label |
avocet/app/api.py |
Same — digest bucket write in label endpoint |
avocet/tests/test_digest_bucket.py |
New — bucket write tests |
avocet/data/digest_samples.jsonl.example |
New — committed sample for reference |
Out of Scope
- Avocet → Peregrine direct import trigger (deferred; bucket is sufficient for now)
background_tasksintegration for digest re-processing (not needed with bucket approach)- HTML digest parsing (all three senders send plain-text alerts; revisit if needed)