docs: digest email parser design — LinkedIn/Adzuna/Ladders registry + Avocet bucket
This commit is contained in:
parent
9229f9ce69
commit
cc0b8d716c
1 changed files with 242 additions and 0 deletions
242
docs/plans/2026-03-05-digest-parsers-design.md
Normal file
242
docs/plans/2026-03-05-digest-parsers-design.md
Normal file
|
|
@ -0,0 +1,242 @@
|
||||||
|
# Digest Email Parsers — Design
|
||||||
|
|
||||||
|
**Date:** 2026-03-05
|
||||||
|
**Products:** Peregrine (primary), Avocet (bucket)
|
||||||
|
**Status:** Design approved, ready for implementation planning
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
Peregrine's `imap_sync.py` can extract leads from digest emails, but only for LinkedIn — the
|
||||||
|
parser is hardcoded inline with no extension point. Adzuna and The Ladders digest emails are
|
||||||
|
unhandled. Additionally, any digest email from an unknown sender is silently dropped with no
|
||||||
|
way to collect samples for building new parsers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Solution Overview
|
||||||
|
|
||||||
|
Two complementary changes:
|
||||||
|
|
||||||
|
1. **`peregrine/scripts/digest_parsers.py`** — a standalone parser module with a sender registry
|
||||||
|
and dispatcher. `imap_sync.py` calls a single function; the registry handles dispatch.
|
||||||
|
LinkedIn parser moves here; Adzuna and Ladders parsers are built against real IMAP samples.
|
||||||
|
|
||||||
|
2. **Avocet digest bucket** — when a user labels an email as `digest` in the Avocet label UI,
|
||||||
|
the email is appended to `data/digest_samples.jsonl`. This file is the corpus for building
|
||||||
|
and testing new parsers for senders not yet in the registry.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Production path (Peregrine)
|
||||||
|
|
||||||
|
```
|
||||||
|
imap_sync._scan_unmatched_leads()
|
||||||
|
│
|
||||||
|
├─ parse_digest(from_addr, body)
|
||||||
|
│ │
|
||||||
|
│ ├─ None → unknown sender → fall through to LLM extraction (unchanged)
|
||||||
|
│ ├─ [] → known sender, nothing found → skip
|
||||||
|
│ └─ [...] → jobs found → insert_job() + submit_task("scrape_url")
|
||||||
|
│
|
||||||
|
└─ continue (digest email consumed; does not reach LLM path)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Sample collection path (Avocet)
|
||||||
|
|
||||||
|
```
|
||||||
|
Avocet label UI
|
||||||
|
│
|
||||||
|
└─ label == "digest"
|
||||||
|
│
|
||||||
|
└─ append to data/digest_samples.jsonl
|
||||||
|
│
|
||||||
|
└─ used as reference for building new parsers
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Module: `peregrine/scripts/digest_parsers.py`
|
||||||
|
|
||||||
|
### Parser interface
|
||||||
|
|
||||||
|
Each parser function:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def parse_<source>(body: str) -> list[dict]
|
||||||
|
```
|
||||||
|
|
||||||
|
Returns zero or more job dicts:
|
||||||
|
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
"title": str, # job title
|
||||||
|
"company": str, # company name
|
||||||
|
"location": str, # location string (may be empty)
|
||||||
|
"url": str, # canonical URL, tracking params stripped
|
||||||
|
"source": str, # "linkedin" | "adzuna" | "theladders"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Dispatcher
|
||||||
|
|
||||||
|
```python
|
||||||
|
DIGEST_PARSERS: dict[str, tuple[str, Callable[[str], list[dict]]]] = {
|
||||||
|
"jobalerts@linkedin.com": ("linkedin", parse_linkedin),
|
||||||
|
"noreply@adzuna.com": ("adzuna", parse_adzuna),
|
||||||
|
"noreply@theladders.com": ("theladders", parse_theladders),
|
||||||
|
}
|
||||||
|
|
||||||
|
def parse_digest(from_addr: str, body: str) -> list[dict] | None:
|
||||||
|
"""
|
||||||
|
Dispatch to the appropriate parser based on sender address.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
None — no parser matched (not a known digest sender)
|
||||||
|
[] — parser matched, no extractable jobs found
|
||||||
|
[dict, ...] — one dict per job card extracted
|
||||||
|
"""
|
||||||
|
addr = from_addr.lower()
|
||||||
|
for sender, (source, parse_fn) in DIGEST_PARSERS.items():
|
||||||
|
if sender in addr:
|
||||||
|
return parse_fn(body)
|
||||||
|
return None
|
||||||
|
```
|
||||||
|
|
||||||
|
Sender matching is a substring check, tolerant of display-name wrappers
|
||||||
|
(`"LinkedIn <jobalerts@linkedin.com>"` matches correctly).
|
||||||
|
|
||||||
|
### Parsers
|
||||||
|
|
||||||
|
**`parse_linkedin`** — moved verbatim from `imap_sync.parse_linkedin_alert()`, renamed.
|
||||||
|
No behavior change.
|
||||||
|
|
||||||
|
**`parse_adzuna`** — built against real Adzuna digest email bodies pulled from the
|
||||||
|
configured IMAP account during implementation. Expected format: job blocks separated
|
||||||
|
by consistent delimiters with title, company, location, and a trackable URL per block.
|
||||||
|
|
||||||
|
**`parse_theladders`** — same approach. The Ladders already has a web scraper in
|
||||||
|
`scripts/custom_boards/theladders.py`; URL canonicalization patterns from there apply here.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Changes to `imap_sync.py`
|
||||||
|
|
||||||
|
Replace the LinkedIn-specific block in `_scan_unmatched_leads()` (~lines 561–585):
|
||||||
|
|
||||||
|
**Before:**
|
||||||
|
```python
|
||||||
|
if _LINKEDIN_ALERT_SENDER in parsed["from_addr"].lower():
|
||||||
|
cards = parse_linkedin_alert(parsed["body"])
|
||||||
|
for card in cards:
|
||||||
|
# ... LinkedIn-specific insert ...
|
||||||
|
known_message_ids.add(mid)
|
||||||
|
continue
|
||||||
|
```
|
||||||
|
|
||||||
|
**After:**
|
||||||
|
```python
|
||||||
|
from scripts.digest_parsers import parse_digest # top of file
|
||||||
|
|
||||||
|
cards = parse_digest(parsed["from_addr"], parsed["body"])
|
||||||
|
if cards is not None:
|
||||||
|
for card in cards:
|
||||||
|
if card["url"] in existing_urls:
|
||||||
|
continue
|
||||||
|
job_id = insert_job(db_path, {
|
||||||
|
"title": card["title"],
|
||||||
|
"company": card["company"],
|
||||||
|
"url": card["url"],
|
||||||
|
"source": card["source"],
|
||||||
|
"location": card["location"],
|
||||||
|
"is_remote": 0,
|
||||||
|
"salary": "",
|
||||||
|
"description": "",
|
||||||
|
"date_found": datetime.now().isoformat()[:10],
|
||||||
|
})
|
||||||
|
if job_id:
|
||||||
|
submit_task(db_path, "scrape_url", job_id)
|
||||||
|
existing_urls.add(card["url"])
|
||||||
|
new_leads += 1
|
||||||
|
print(f"[imap] digest ({card['source']}) → {card['company']} — {card['title']}")
|
||||||
|
known_message_ids.add(mid)
|
||||||
|
continue
|
||||||
|
```
|
||||||
|
|
||||||
|
`parse_digest` returning `None` falls through to the existing LLM extraction path — all
|
||||||
|
non-digest recruitment emails are completely unaffected.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Avocet: Digest Bucket
|
||||||
|
|
||||||
|
### File
|
||||||
|
|
||||||
|
`avocet/data/digest_samples.jsonl` — gitignored. An `.example` entry is committed.
|
||||||
|
|
||||||
|
Schema matches the existing label queue (JSONL on-disk schema):
|
||||||
|
|
||||||
|
```json
|
||||||
|
{"subject": "...", "body": "...", "from_addr": "...", "date": "...", "account": "..."}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Trigger
|
||||||
|
|
||||||
|
In `app/label_tool.py` and `app/api.py`: when a `digest` label is applied, append the
|
||||||
|
email to `digest_samples.jsonl` alongside the normal write to `email_score.jsonl`.
|
||||||
|
|
||||||
|
No Peregrine dependency — if the file path doesn't exist the `data/` directory is created
|
||||||
|
automatically. Avocet remains fully standalone.
|
||||||
|
|
||||||
|
### Usage
|
||||||
|
|
||||||
|
When a new digest sender appears in the wild:
|
||||||
|
1. Label representative emails as `digest` in Avocet → samples land in `digest_samples.jsonl`
|
||||||
|
2. Inspect samples, write `parse_<source>(body)` in `digest_parsers.py`
|
||||||
|
3. Add the sender string to `DIGEST_PARSERS`
|
||||||
|
4. Add fixture test in `peregrine/tests/test_digest_parsers.py`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
### `peregrine/tests/test_digest_parsers.py`
|
||||||
|
|
||||||
|
- Fixture bodies sourced from real IMAP samples (anonymized company names / URLs acceptable)
|
||||||
|
- Each parser: valid body → expected cards returned
|
||||||
|
- Each parser: empty / malformed body → `[]`, no exception
|
||||||
|
- Dispatcher: known sender → correct parser invoked
|
||||||
|
- Dispatcher: unknown sender → `None`
|
||||||
|
- URL canonicalization: tracking params stripped, canonical form asserted
|
||||||
|
- Dedup within digest: same URL appearing twice in one email → one card
|
||||||
|
|
||||||
|
### `avocet/tests/test_digest_bucket.py`
|
||||||
|
|
||||||
|
- `digest` label → row appended to `digest_samples.jsonl`
|
||||||
|
- Any other label → `digest_samples.jsonl` not touched
|
||||||
|
- First write creates `data/` directory if absent
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files Changed / Created
|
||||||
|
|
||||||
|
| File | Change |
|
||||||
|
|------|--------|
|
||||||
|
| `peregrine/scripts/digest_parsers.py` | **New** — parser module |
|
||||||
|
| `peregrine/scripts/imap_sync.py` | Replace inline LinkedIn block with `parse_digest()` call |
|
||||||
|
| `peregrine/tests/test_digest_parsers.py` | **New** — parser unit tests |
|
||||||
|
| `avocet/app/label_tool.py` | Append to `digest_samples.jsonl` on `digest` label |
|
||||||
|
| `avocet/app/api.py` | Same — digest bucket write in label endpoint |
|
||||||
|
| `avocet/tests/test_digest_bucket.py` | **New** — bucket write tests |
|
||||||
|
| `avocet/data/digest_samples.jsonl.example` | **New** — committed sample for reference |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Out of Scope
|
||||||
|
|
||||||
|
- Avocet → Peregrine direct import trigger (deferred; bucket is sufficient for now)
|
||||||
|
- `background_tasks` integration for digest re-processing (not needed with bucket approach)
|
||||||
|
- HTML digest parsing (all three senders send plain-text alerts; revisit if needed)
|
||||||
Loading…
Reference in a new issue