peregrine/docs/plans/2026-03-05-digest-parsers-plan.md

29 KiB

Digest Email Parsers Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Extract job listings from LinkedIn, Adzuna, and The Ladders digest emails into Peregrine leads, with an Avocet bucket that collects digest samples for future parser development.

Architecture: New peregrine/scripts/digest_parsers.py exposes a parse_digest(from_addr, body) dispatcher backed by a sender registry. imap_sync.py replaces its inline LinkedIn block with one dispatcher call. Avocet's two label paths (label_tool.py + api.py) append digest-labeled emails to data/digest_samples.jsonl. Adzuna and Ladders parsers are built from real IMAP samples fetched in Task 2.

Tech Stack: Python stdlib only — re, json, pathlib. No new dependencies.


Task 1: Create digest_parsers.py with dispatcher + LinkedIn parser

Files:

  • Create: peregrine/scripts/digest_parsers.py
  • Create: peregrine/tests/test_digest_parsers.py

Context: parse_linkedin_alert() currently lives inline in imap_sync.py. We move it here (renamed parse_linkedin) and wrap it in a dispatcher. All other parsers plug into the same registry.

Run all tests with:

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py -v

Step 1: Write the failing tests

Create peregrine/tests/test_digest_parsers.py:

"""Tests for digest email parser registry."""
import pytest
from scripts.digest_parsers import parse_digest, parse_linkedin

# ── LinkedIn fixture ──────────────────────────────────────────────────────────
# Mirrors the plain-text format LinkedIn Job Alert emails actually send.
# Each job block is separated by a line of 10+ dashes.
LINKEDIN_BODY = """\
Software Engineer
Acme Corp
San Francisco, CA

View job: https://www.linkedin.com/comm/jobs/view/1111111111/?refId=abc&trackingId=xyz

--------------------------------------------------
Senior Developer
Widget Inc
Remote

View job: https://www.linkedin.com/comm/jobs/view/2222222222/?refId=def
"""

LINKEDIN_BODY_EMPTY = "No jobs matched your alert this week."

LINKEDIN_BODY_NO_URL = """\
Software Engineer
Acme Corp
San Francisco, CA

--------------------------------------------------
"""


def test_dispatcher_linkedin_sender():
    cards = parse_digest("LinkedIn <jobalerts@linkedin.com>", LINKEDIN_BODY)
    assert cards is not None
    assert len(cards) == 2


def test_dispatcher_unknown_sender_returns_none():
    result = parse_digest("noreply@randomboard.com", LINKEDIN_BODY)
    assert result is None


def test_dispatcher_case_insensitive_sender():
    cards = parse_digest("JOBALERTS@LINKEDIN.COM", LINKEDIN_BODY)
    assert cards is not None


def test_parse_linkedin_returns_correct_fields():
    cards = parse_linkedin(LINKEDIN_BODY)
    assert cards[0]["title"] == "Software Engineer"
    assert cards[0]["company"] == "Acme Corp"
    assert cards[0]["location"] == "San Francisco, CA"
    assert cards[0]["source"] == "linkedin"


def test_parse_linkedin_url_canonicalized():
    """Tracking params stripped; canonical jobs/view/<id>/ form."""
    cards = parse_linkedin(LINKEDIN_BODY)
    assert cards[0]["url"] == "https://www.linkedin.com/jobs/view/1111111111/"
    assert "refId" not in cards[0]["url"]
    assert "trackingId" not in cards[0]["url"]


def test_parse_linkedin_empty_body_returns_empty_list():
    assert parse_linkedin(LINKEDIN_BODY_EMPTY) == []


def test_parse_linkedin_block_without_url_skipped():
    cards = parse_linkedin(LINKEDIN_BODY_NO_URL)
    assert cards == []

Step 2: Run tests to verify they fail

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py -v

Expected: ImportError: cannot import name 'parse_digest'


Step 3: Write digest_parsers.py

Create peregrine/scripts/digest_parsers.py:

"""Digest email parser registry for Peregrine.

Each parser extracts job listings from a known digest sender's plain-text body.
New parsers are added by decorating with @_register(sender_substring, source_name).

Usage:
    from scripts.digest_parsers import parse_digest

    cards = parse_digest(from_addr, body)
    # None  → unknown sender (fall through to LLM path)
    # []    → known sender, nothing extractable
    # [...] → list of {title, company, location, url, source} dicts
"""
from __future__ import annotations

import re
from typing import Callable

# ── Registry ──────────────────────────────────────────────────────────────────

# Maps sender substring (lowercased) → (source_name, parse_fn)
DIGEST_PARSERS: dict[str, tuple[str, Callable[[str], list[dict]]]] = {}


def _register(sender: str, source: str):
    """Decorator to register a parser for a given sender substring."""
    def decorator(fn: Callable[[str], list[dict]]):
        DIGEST_PARSERS[sender.lower()] = (source, fn)
        return fn
    return decorator


def parse_digest(from_addr: str, body: str) -> list[dict] | None:
    """Dispatch to the appropriate parser based on sender address.

    Returns:
        None        — no parser matched (caller should use LLM fallback)
        []          — known sender, no extractable jobs
        [dict, ...] — one dict per job card with keys:
                      title, company, location, url, source
    """
    addr = from_addr.lower()
    for sender, (source, parse_fn) in DIGEST_PARSERS.items():
        if sender in addr:
            return parse_fn(body)
    return None


# ── Shared helpers ─────────────────────────────────────────────────────────────

_LINKEDIN_SKIP_PHRASES = {
    "promoted", "easily apply", "apply now", "job alert",
    "unsubscribe", "linkedin corporation",
}


# ── LinkedIn Job Alert ─────────────────────────────────────────────────────────

@_register("jobalerts@linkedin.com", "linkedin")
def parse_linkedin(body: str) -> list[dict]:
    """Parse LinkedIn Job Alert digest email body.

    Blocks are separated by lines of 10+ dashes. Each block contains:
        Line 0: job title
        Line 1: company
        Line 2: location (optional)
        'View job: <url>'  →  canonicalized to /jobs/view/<id>/
    """
    jobs = []
    blocks = re.split(r"\n\s*-{10,}\s*\n", body)
    for block in blocks:
        lines = [ln.strip() for ln in block.strip().splitlines() if ln.strip()]

        url = None
        for line in lines:
            m = re.search(r"View job:\s*(https?://\S+)", line, re.IGNORECASE)
            if m:
                raw_url = m.group(1)
                job_id_m = re.search(r"/jobs/view/(\d+)", raw_url)
                if job_id_m:
                    url = f"https://www.linkedin.com/jobs/view/{job_id_m.group(1)}/"
                break
        if not url:
            continue

        content = [
            ln for ln in lines
            if not any(p in ln.lower() for p in _LINKEDIN_SKIP_PHRASES)
            and not ln.lower().startswith("view job:")
            and not ln.startswith("http")
        ]
        if len(content) < 2:
            continue

        jobs.append({
            "title":    content[0],
            "company":  content[1],
            "location": content[2] if len(content) > 2 else "",
            "url":      url,
            "source":   "linkedin",
        })
    return jobs


# ── Adzuna Job Alert ───────────────────────────────────────────────────────────

@_register("noreply@adzuna.com", "adzuna")
def parse_adzuna(body: str) -> list[dict]:
    """Parse Adzuna job alert digest email body.

    TODO: implement after reviewing samples in avocet/data/digest_samples.jsonl
    See Task 3 in docs/plans/2026-03-05-digest-parsers-plan.md
    """
    return []


# ── The Ladders Job Alert ──────────────────────────────────────────────────────

@_register("noreply@theladders.com", "theladders")
def parse_theladders(body: str) -> list[dict]:
    """Parse The Ladders job alert digest email body.

    TODO: implement after reviewing samples in avocet/data/digest_samples.jsonl
    See Task 4 in docs/plans/2026-03-05-digest-parsers-plan.md
    """
    return []

Step 4: Run tests to verify they pass

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py -v

Expected: all 8 tests PASS

Step 5: Commit

git add scripts/digest_parsers.py tests/test_digest_parsers.py
git commit -m "feat: digest parser registry + LinkedIn parser (moved from imap_sync)"

Task 2: Fetch digest samples from IMAP

Files:

  • Create: avocet/scripts/fetch_digest_samples.py

Context: We need real Adzuna and Ladders email bodies to write parsers against. This one-off script searches the configured IMAP account by sender domain and writes results to data/digest_samples.jsonl. Run it once; the output file feeds Tasks 3 and 4.


Step 1: Create the fetch script

Create avocet/scripts/fetch_digest_samples.py:

#!/usr/bin/env python3
"""Fetch digest email samples from IMAP into data/digest_samples.jsonl.

Searches for emails from known digest sender domains, deduplicates against
any existing samples, and appends new ones.

Usage:
    conda run -n job-seeker python scripts/fetch_digest_samples.py

Reads config/label_tool.yaml for IMAP credentials (first account used).
"""
from __future__ import annotations

import imaplib
import json
import sys
from pathlib import Path

import yaml

ROOT = Path(__file__).parent.parent
CONFIG = ROOT / "config" / "label_tool.yaml"
OUTPUT = ROOT / "data" / "digest_samples.jsonl"

# Sender domains to search — add new ones here as needed
DIGEST_SENDERS = [
    "adzuna.com",
    "theladders.com",
    "jobalerts@linkedin.com",
]

# Import shared helpers from avocet
sys.path.insert(0, str(ROOT))
from app.imap_fetch import _decode_str, _extract_body, entry_key  # noqa: E402


def _load_existing_keys() -> set[str]:
    if not OUTPUT.exists():
        return set()
    keys = set()
    for line in OUTPUT.read_text().splitlines():
        try:
            keys.add(entry_key(json.loads(line)))
        except Exception:
            pass
    return keys


def main() -> None:
    cfg = yaml.safe_load(CONFIG.read_text())
    accounts = cfg.get("accounts", [])
    if not accounts:
        print("No accounts configured in config/label_tool.yaml")
        sys.exit(1)

    acc = accounts[0]
    host = acc.get("host", "imap.gmail.com")
    port = int(acc.get("port", 993))
    use_ssl = acc.get("use_ssl", True)
    username = acc["username"]
    password = acc["password"]
    folder = acc.get("folder", "INBOX")
    days_back = int(acc.get("days_back", 90))

    from datetime import datetime, timedelta
    import email as _email_lib

    since = (datetime.now() - timedelta(days=days_back)).strftime("%d-%b-%Y")

    conn = (imaplib.IMAP4_SSL if use_ssl else imaplib.IMAP4)(host, port)
    conn.login(username, password)
    conn.select(folder, readonly=True)

    known_keys = _load_existing_keys()
    found: list[dict] = []
    seen_uids: dict[bytes, None] = {}

    for sender in DIGEST_SENDERS:
        try:
            _, data = conn.search(None, f'(FROM "{sender}" SINCE "{since}")')
            for uid in (data[0] or b"").split():
                seen_uids[uid] = None
        except Exception as exc:
            print(f"  search error for {sender!r}: {exc}")

    print(f"Found {len(seen_uids)} candidate UIDs across {len(DIGEST_SENDERS)} senders")

    for uid in seen_uids:
        try:
            _, raw_data = conn.fetch(uid, "(RFC822)")
            if not raw_data or not raw_data[0]:
                continue
            msg = _email_lib.message_from_bytes(raw_data[0][1])
            entry = {
                "subject":   _decode_str(msg.get("Subject", "")),
                "body":      _extract_body(msg)[:2000],  # larger cap for parser dev
                "from_addr": _decode_str(msg.get("From", "")),
                "date":      _decode_str(msg.get("Date", "")),
                "account":   acc.get("name", username),
            }
            k = entry_key(entry)
            if k not in known_keys:
                known_keys.add(k)
                found.append(entry)
        except Exception as exc:
            print(f"  fetch error uid {uid}: {exc}")

    conn.logout()

    if not found:
        print("No new digest samples found.")
        return

    OUTPUT.parent.mkdir(exist_ok=True)
    with OUTPUT.open("a", encoding="utf-8") as f:
        for entry in found:
            f.write(json.dumps(entry) + "\n")

    print(f"Wrote {len(found)} new samples to {OUTPUT}")


if __name__ == "__main__":
    main()

Step 2: Run the fetch script

cd /Library/Development/CircuitForge/avocet
conda run -n job-seeker python scripts/fetch_digest_samples.py

Expected output: Wrote N new samples to data/digest_samples.jsonl

Step 3: Inspect the samples

# View first few entries — look at from_addr and body for Adzuna and Ladders format
conda run -n job-seeker python -c "
import json
from pathlib import Path
for line in Path('data/digest_samples.jsonl').read_text().splitlines()[:10]:
    e = json.loads(line)
    print('FROM:', e['from_addr'])
    print('SUBJECT:', e['subject'])
    print('BODY[:500]:', e['body'][:500])
    print('---')
"

Note down:

  • The exact sender addresses for Adzuna and Ladders (update DIGEST_PARSERS in digest_parsers.py if different from noreply@adzuna.com / noreply@theladders.com)
  • The structure of each job block in the body (separator lines, field order, URL format)

Step 4: Commit

cd /Library/Development/CircuitForge/avocet
git add scripts/fetch_digest_samples.py
git commit -m "feat: fetch_digest_samples script for building new parsers"

Task 3: Build and test Adzuna parser

Files:

  • Modify: peregrine/scripts/digest_parsers.py — implement parse_adzuna
  • Modify: peregrine/tests/test_digest_parsers.py — add Adzuna fixtures + tests

Context: After running Task 2, you have real Adzuna email bodies in avocet/data/digest_samples.jsonl. Inspect them (see Task 2 Step 3), identify the structure, then write the test fixture from a real sample before implementing the parser.


Step 1: Write a failing Adzuna test

Inspect a real Adzuna sample from data/digest_samples.jsonl and identify:

  • How job blocks are separated (blank lines? dashes? headers?)
  • Field order (title first? company first?)
  • Where the job URL appears and what format it uses
  • Any noise lines to filter (unsubscribe, promo text, etc.)

Add to peregrine/tests/test_digest_parsers.py:

from scripts.digest_parsers import parse_adzuna

# Replace ADZUNA_BODY with a real excerpt from avocet/data/digest_samples.jsonl
# Copy 2-3 job blocks verbatim; replace real company names with "Test Co" etc. if desired
ADZUNA_BODY = """
<paste real Adzuna body excerpt here — 2-3 job blocks>
"""

def test_dispatcher_adzuna_sender():
    # Update sender string if real sender differs from noreply@adzuna.com
    cards = parse_digest("noreply@adzuna.com", ADZUNA_BODY)
    assert cards is not None
    assert len(cards) >= 1

def test_parse_adzuna_fields():
    cards = parse_adzuna(ADZUNA_BODY)
    assert cards[0]["title"]   # non-empty
    assert cards[0]["company"] # non-empty
    assert cards[0]["url"].startswith("http")
    assert cards[0]["source"] == "adzuna"

def test_parse_adzuna_url_no_tracking():
    """Adzuna URLs often contain tracking params — strip them."""
    cards = parse_adzuna(ADZUNA_BODY)
    # Adjust assertion to match actual URL format once you've seen real samples
    for card in cards:
        assert "utm_" not in card["url"]

def test_parse_adzuna_empty_body():
    assert parse_adzuna("No jobs this week.") == []

Step 2: Run tests to verify they fail

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py::test_parse_adzuna_fields -v

Expected: FAIL (stub returns [])

Step 3: Implement parse_adzuna in digest_parsers.py

Replace the stub body of parse_adzuna based on the actual email structure you observed. Pattern to follow (adapt field positions to match Adzuna's actual format):

@_register("noreply@adzuna.com", "adzuna")  # update sender if needed
def parse_adzuna(body: str) -> list[dict]:
    jobs = []
    # Split on whatever delimiter Adzuna uses between blocks
    # e.g.: blocks = re.split(r"\n\s*\n{2,}", body)  # double blank line
    # For each block, extract title, company, location, url
    # Strip tracking params from URL: re.sub(r"\?.*", "", url) or parse with urllib
    return jobs

If Adzuna sender differs from noreply@adzuna.com, update the @_register decorator and the DIGEST_PARSERS key in the registry (they're set by the decorator — just change the decorator argument).

Step 4: Run all digest tests

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py -v

Expected: all tests PASS

Step 5: Commit

cd /Library/Development/CircuitForge/peregrine
git add scripts/digest_parsers.py tests/test_digest_parsers.py
git commit -m "feat: Adzuna digest email parser"

Task 4: Build and test The Ladders parser

Files:

  • Modify: peregrine/scripts/digest_parsers.py — implement parse_theladders
  • Modify: peregrine/tests/test_digest_parsers.py — add Ladders fixtures + tests

Context: Same approach as Task 3. The Ladders already has a web scraper in scripts/custom_boards/theladders.py — check it for URL patterns that may apply here.


Step 1: Write failing Ladders tests

Inspect a real Ladders sample from avocet/data/digest_samples.jsonl. Add to test file:

from scripts.digest_parsers import parse_theladders

# Replace with real Ladders body excerpt
LADDERS_BODY = """
<paste real Ladders body excerpt here — 2-3 job blocks>
"""

def test_dispatcher_ladders_sender():
    cards = parse_digest("noreply@theladders.com", LADDERS_BODY)
    assert cards is not None
    assert len(cards) >= 1

def test_parse_theladders_fields():
    cards = parse_theladders(LADDERS_BODY)
    assert cards[0]["title"]
    assert cards[0]["company"]
    assert cards[0]["url"].startswith("http")
    assert cards[0]["source"] == "theladders"

def test_parse_theladders_empty_body():
    assert parse_theladders("No new jobs.") == []

Step 2: Run tests to verify they fail

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py::test_parse_theladders_fields -v

Expected: FAIL

Step 3: Implement parse_theladders

Replace the stub. The Ladders URLs often use redirect wrappers — canonicalize to the theladders.com/job/<id> form if possible, otherwise just strip tracking params.

Step 4: Run all digest tests

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py -v

Expected: all tests PASS

Step 5: Commit

git add scripts/digest_parsers.py tests/test_digest_parsers.py
git commit -m "feat: The Ladders digest email parser"

Task 5: Update imap_sync.py to use the dispatcher

Files:

  • Modify: peregrine/scripts/imap_sync.py

Context: The LinkedIn-specific block in _scan_unmatched_leads() (search for _LINKEDIN_ALERT_SENDER) gets replaced with a generic parse_digest() call. The existing behavior is preserved — only the dispatch mechanism changes.


Step 1: Add the import

At the top of imap_sync.py, alongside other local imports, add:

from scripts.digest_parsers import parse_digest

Step 2: Find the LinkedIn-specific block

Search for _LINKEDIN_ALERT_SENDER in imap_sync.py. The block looks like:

if _LINKEDIN_ALERT_SENDER in parsed["from_addr"].lower():
    cards = parse_linkedin_alert(parsed["body"])
    for card in cards:
        ...
    known_message_ids.add(mid)
    continue

Step 3: Replace with the generic dispatcher

# ── Digest email — dispatch to parser registry ────────────────────────
cards = parse_digest(parsed["from_addr"], parsed["body"])
if cards is not None:
    for card in cards:
        if card["url"] in existing_urls:
            continue
        job_id = insert_job(db_path, {
            "title":      card["title"],
            "company":    card["company"],
            "url":        card["url"],
            "source":     card["source"],
            "location":   card["location"],
            "is_remote":  0,
            "salary":     "",
            "description": "",
            "date_found": datetime.now().isoformat()[:10],
        })
        if job_id:
            submit_task(db_path, "scrape_url", job_id)
            existing_urls.add(card["url"])
            new_leads += 1
            print(f"[imap] digest ({card['source']}) → {card['company']}{card['title']}")
    known_message_ids.add(mid)
    continue

Step 4: Remove the now-unused parse_linkedin_alert import/definition

parse_linkedin_alert was defined in imap_sync.py. It's now parse_linkedin in digest_parsers.py. Delete the old function from imap_sync.py. Also remove _LINKEDIN_ALERT_SENDER constant if it's no longer referenced.

Step 5: Run the full test suite

/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v

Expected: all existing tests still pass; no regressions

Step 6: Commit

git add scripts/imap_sync.py
git commit -m "refactor: imap_sync uses digest_parsers dispatcher; remove inline LinkedIn parser"

Task 6: Avocet digest bucket

Files:

  • Modify: avocet/app/label_tool.py
  • Modify: avocet/app/api.py
  • Create: avocet/tests/test_digest_bucket.py
  • Create: avocet/data/digest_samples.jsonl.example

Context: When either label path (_do_label in the Streamlit UI or POST /api/label in the FastAPI app) assigns the digest label, the full email record is appended to data/digest_samples.jsonl. This is the sample corpus for building future parsers.


Step 1: Write failing tests

Create avocet/tests/test_digest_bucket.py:

"""Tests for digest sample bucket write behavior."""
import json
import pytest
from pathlib import Path
from unittest.mock import patch, MagicMock


# ── Helpers ───────────────────────────────────────────────────────────────────

def _read_bucket(tmp_path: Path) -> list[dict]:
    bucket = tmp_path / "data" / "digest_samples.jsonl"
    if not bucket.exists():
        return []
    return [json.loads(line) for line in bucket.read_text().splitlines() if line.strip()]


SAMPLE_ENTRY = {
    "subject":   "10 new jobs for you",
    "body":      "Software Engineer\nAcme Corp\nRemote\nView job: https://example.com/123",
    "from_addr": "noreply@adzuna.com",
    "date":      "Mon, 03 Mar 2026 09:00:00 +0000",
    "account":   "test@example.com",
}


# ── api.py bucket tests ───────────────────────────────────────────────────────

def test_api_digest_label_writes_to_bucket(tmp_path):
    from app.api import _append_digest_sample
    data_dir = tmp_path / "data"
    _append_digest_sample(SAMPLE_ENTRY, data_dir=data_dir)
    rows = _read_bucket(tmp_path)
    assert len(rows) == 1
    assert rows[0]["from_addr"] == "noreply@adzuna.com"


def test_api_non_digest_label_does_not_write(tmp_path):
    from app.api import _append_digest_sample
    data_dir = tmp_path / "data"
    # _append_digest_sample should only be called for digest; confirm it writes when called
    # Confirm that callers gate on label == "digest" — tested via integration below
    _append_digest_sample(SAMPLE_ENTRY, data_dir=data_dir)
    rows = _read_bucket(tmp_path)
    assert len(rows) == 1  # called directly, always writes


def test_api_digest_creates_data_dir(tmp_path):
    from app.api import _append_digest_sample
    data_dir = tmp_path / "nonexistent" / "data"
    assert not data_dir.exists()
    _append_digest_sample(SAMPLE_ENTRY, data_dir=data_dir)
    assert data_dir.exists()


def test_api_digest_appends_multiple(tmp_path):
    from app.api import _append_digest_sample
    data_dir = tmp_path / "data"
    _append_digest_sample(SAMPLE_ENTRY, data_dir=data_dir)
    _append_digest_sample({**SAMPLE_ENTRY, "subject": "5 more jobs"}, data_dir=data_dir)
    rows = _read_bucket(tmp_path)
    assert len(rows) == 2

Step 2: Run tests to verify they fail

/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_bucket.py -v

Expected: ImportError: cannot import name '_append_digest_sample'


Step 3: Add _append_digest_sample to api.py

In avocet/app/api.py, add this helper (near the top, after the imports and _DATA_DIR constant):

_DIGEST_SAMPLES_FILE = _DATA_DIR / "digest_samples.jsonl"


def _append_digest_sample(entry: dict, data_dir: Path | None = None) -> None:
    """Append a digest-labeled email to the sample corpus."""
    target_dir = data_dir if data_dir is not None else _DATA_DIR
    target_dir.mkdir(parents=True, exist_ok=True)
    bucket = target_dir / "digest_samples.jsonl"
    record = {
        "subject":   entry.get("subject", ""),
        "body":      entry.get("body", ""),
        "from_addr": entry.get("from_addr", entry.get("from", "")),
        "date":      entry.get("date", ""),
        "account":   entry.get("account", entry.get("source", "")),
    }
    with bucket.open("a", encoding="utf-8") as f:
        f.write(json.dumps(record) + "\n")

Then in post_label() (around line 127, after _append_jsonl(_score_file(), record)):

    if req.label == "digest":
        _append_digest_sample(match)

Step 4: Add the same write to label_tool.py

In avocet/app/label_tool.py, add a module-level constant after _SCORE_FILE:

_DIGEST_SAMPLES_FILE = _ROOT / "data" / "digest_samples.jsonl"

In _do_label() (around line 728, after _append_jsonl(_SCORE_FILE, row)):

            if label == "digest":
                _append_jsonl(
                    _DIGEST_SAMPLES_FILE,
                    {
                        "subject":   entry.get("subject", ""),
                        "body":      (entry.get("body", ""))[:2000],
                        "from_addr": entry.get("from_addr", ""),
                        "date":      entry.get("date", ""),
                        "account":   entry.get("account", ""),
                    },
                )

(_append_jsonl already exists in label_tool.py at line ~396 — reuse it.)

Step 5: Create the example file

Create avocet/data/digest_samples.jsonl.example:

{"subject": "10 new Software Engineer jobs for you", "body": "Software Engineer\nAcme Corp\nSan Francisco, CA\n\nView job: https://www.linkedin.com/jobs/view/1234567890/\n", "from_addr": "LinkedIn <jobalerts@linkedin.com>", "date": "Mon, 03 Mar 2026 09:00:00 +0000", "account": "example@gmail.com"}

Step 6: Update .gitignore in avocet

Verify data/digest_samples.jsonl is gitignored. Open avocet/.gitignore — it should already have data/*.jsonl. If not, add:

data/digest_samples.jsonl

Step 7: Run all avocet tests

/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v

Expected: all tests PASS

Step 8: Commit

cd /Library/Development/CircuitForge/avocet
git add app/api.py app/label_tool.py tests/test_digest_bucket.py data/digest_samples.jsonl.example
git commit -m "feat: digest sample bucket — write digest-labeled emails to digest_samples.jsonl"

Summary

Task Repo Commit message
1 peregrine feat: digest parser registry + LinkedIn parser (moved from imap_sync)
2 avocet feat: fetch_digest_samples script for building new parsers
3 peregrine feat: Adzuna digest email parser
4 peregrine feat: The Ladders digest email parser
5 peregrine refactor: imap_sync uses digest_parsers dispatcher; remove inline LinkedIn parser
6 avocet feat: digest sample bucket — write digest-labeled emails to digest_samples.jsonl

Tasks 1, 2, and 6 are independent and can be done in any order. Tasks 3 and 4 depend on Task 2 (samples needed before implementing parsers). Task 5 depends on Tasks 1, 3, and 4 (all parsers should be ready before switching imap_sync).