peregrine/docs/plans/2026-03-05-digest-parsers-design.md

242 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Digest Email Parsers — Design
**Date:** 2026-03-05
**Products:** Peregrine (primary), Avocet (bucket)
**Status:** Design approved, ready for implementation planning
---
## Problem
Peregrine's `imap_sync.py` can extract leads from digest emails, but only for LinkedIn — the
parser is hardcoded inline with no extension point. Adzuna and The Ladders digest emails are
unhandled. Additionally, any digest email from an unknown sender is silently dropped with no
way to collect samples for building new parsers.
---
## Solution Overview
Two complementary changes:
1. **`peregrine/scripts/digest_parsers.py`** — a standalone parser module with a sender registry
and dispatcher. `imap_sync.py` calls a single function; the registry handles dispatch.
LinkedIn parser moves here; Adzuna and Ladders parsers are built against real IMAP samples.
2. **Avocet digest bucket** — when a user labels an email as `digest` in the Avocet label UI,
the email is appended to `data/digest_samples.jsonl`. This file is the corpus for building
and testing new parsers for senders not yet in the registry.
---
## Architecture
### Production path (Peregrine)
```
imap_sync._scan_unmatched_leads()
├─ parse_digest(from_addr, body)
│ │
│ ├─ None → unknown sender → fall through to LLM extraction (unchanged)
│ ├─ [] → known sender, nothing found → skip
│ └─ [...] → jobs found → insert_job() + submit_task("scrape_url")
└─ continue (digest email consumed; does not reach LLM path)
```
### Sample collection path (Avocet)
```
Avocet label UI
└─ label == "digest"
└─ append to data/digest_samples.jsonl
└─ used as reference for building new parsers
```
---
## Module: `peregrine/scripts/digest_parsers.py`
### Parser interface
Each parser function:
```python
def parse_<source>(body: str) -> list[dict]
```
Returns zero or more job dicts:
```python
{
"title": str, # job title
"company": str, # company name
"location": str, # location string (may be empty)
"url": str, # canonical URL, tracking params stripped
"source": str, # "linkedin" | "adzuna" | "theladders"
}
```
### Dispatcher
```python
DIGEST_PARSERS: dict[str, tuple[str, Callable[[str], list[dict]]]] = {
"jobalerts@linkedin.com": ("linkedin", parse_linkedin),
"noreply@adzuna.com": ("adzuna", parse_adzuna),
"noreply@theladders.com": ("theladders", parse_theladders),
}
def parse_digest(from_addr: str, body: str) -> list[dict] | None:
"""
Dispatch to the appropriate parser based on sender address.
Returns:
None — no parser matched (not a known digest sender)
[] — parser matched, no extractable jobs found
[dict, ...] — one dict per job card extracted
"""
addr = from_addr.lower()
for sender, (source, parse_fn) in DIGEST_PARSERS.items():
if sender in addr:
return parse_fn(body)
return None
```
Sender matching is a substring check, tolerant of display-name wrappers
(`"LinkedIn <jobalerts@linkedin.com>"` matches correctly).
### Parsers
**`parse_linkedin`** — moved verbatim from `imap_sync.parse_linkedin_alert()`, renamed.
No behavior change.
**`parse_adzuna`** — built against real Adzuna digest email bodies pulled from the
configured IMAP account during implementation. Expected format: job blocks separated
by consistent delimiters with title, company, location, and a trackable URL per block.
**`parse_theladders`** — same approach. The Ladders already has a web scraper in
`scripts/custom_boards/theladders.py`; URL canonicalization patterns from there apply here.
---
## Changes to `imap_sync.py`
Replace the LinkedIn-specific block in `_scan_unmatched_leads()` (~lines 561585):
**Before:**
```python
if _LINKEDIN_ALERT_SENDER in parsed["from_addr"].lower():
cards = parse_linkedin_alert(parsed["body"])
for card in cards:
# ... LinkedIn-specific insert ...
known_message_ids.add(mid)
continue
```
**After:**
```python
from scripts.digest_parsers import parse_digest # top of file
cards = parse_digest(parsed["from_addr"], parsed["body"])
if cards is not None:
for card in cards:
if card["url"] in existing_urls:
continue
job_id = insert_job(db_path, {
"title": card["title"],
"company": card["company"],
"url": card["url"],
"source": card["source"],
"location": card["location"],
"is_remote": 0,
"salary": "",
"description": "",
"date_found": datetime.now().isoformat()[:10],
})
if job_id:
submit_task(db_path, "scrape_url", job_id)
existing_urls.add(card["url"])
new_leads += 1
print(f"[imap] digest ({card['source']}) → {card['company']}{card['title']}")
known_message_ids.add(mid)
continue
```
`parse_digest` returning `None` falls through to the existing LLM extraction path — all
non-digest recruitment emails are completely unaffected.
---
## Avocet: Digest Bucket
### File
`avocet/data/digest_samples.jsonl` — gitignored. An `.example` entry is committed.
Schema matches the existing label queue (JSONL on-disk schema):
```json
{"subject": "...", "body": "...", "from_addr": "...", "date": "...", "account": "..."}
```
### Trigger
In `app/label_tool.py` and `app/api.py`: when a `digest` label is applied, append the
email to `digest_samples.jsonl` alongside the normal write to `email_score.jsonl`.
No Peregrine dependency — if the file path doesn't exist the `data/` directory is created
automatically. Avocet remains fully standalone.
### Usage
When a new digest sender appears in the wild:
1. Label representative emails as `digest` in Avocet → samples land in `digest_samples.jsonl`
2. Inspect samples, write `parse_<source>(body)` in `digest_parsers.py`
3. Add the sender string to `DIGEST_PARSERS`
4. Add fixture test in `peregrine/tests/test_digest_parsers.py`
---
## Testing
### `peregrine/tests/test_digest_parsers.py`
- Fixture bodies sourced from real IMAP samples (anonymized company names / URLs acceptable)
- Each parser: valid body → expected cards returned
- Each parser: empty / malformed body → `[]`, no exception
- Dispatcher: known sender → correct parser invoked
- Dispatcher: unknown sender → `None`
- URL canonicalization: tracking params stripped, canonical form asserted
- Dedup within digest: same URL appearing twice in one email → one card
### `avocet/tests/test_digest_bucket.py`
- `digest` label → row appended to `digest_samples.jsonl`
- Any other label → `digest_samples.jsonl` not touched
- First write creates `data/` directory if absent
---
## Files Changed / Created
| File | Change |
|------|--------|
| `peregrine/scripts/digest_parsers.py` | **New** — parser module |
| `peregrine/scripts/imap_sync.py` | Replace inline LinkedIn block with `parse_digest()` call |
| `peregrine/tests/test_digest_parsers.py` | **New** — parser unit tests |
| `avocet/app/label_tool.py` | Append to `digest_samples.jsonl` on `digest` label |
| `avocet/app/api.py` | Same — digest bucket write in label endpoint |
| `avocet/tests/test_digest_bucket.py` | **New** — bucket write tests |
| `avocet/data/digest_samples.jsonl.example` | **New** — committed sample for reference |
---
## Out of Scope
- Avocet → Peregrine direct import trigger (deferred; bucket is sufficient for now)
- `background_tasks` integration for digest re-processing (not needed with bucket approach)
- HTML digest parsing (all three senders send plain-text alerts; revisit if needed)