peregrine/docs/plans/2026-02-24-craigslist-scraper-plan.md
pyr0ball c368c7a977 chore: seed Peregrine from personal job-seeker (pre-generalization)
App: Peregrine
Company: Circuit Forge LLC
Source: github.com/pyr0ball/job-seeker (personal fork, not linked)
2026-02-24 18:25:39 -08:00

728 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Craigslist Scraper Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Add a Craigslist RSS-based job scraper to `scripts/custom_boards/craigslist.py`, wired into the existing discovery pipeline, with LLM extraction of company name and salary from the fetched posting body.
**Architecture:** RSS fetch per metro × title → `scrape_url` background task fills description → new `enrich_craigslist` task type extracts company/salary via LLM. Config-driven metro list in `config/craigslist.yaml`. Integrates via the existing `CUSTOM_SCRAPERS` registry in `discover.py`.
**Tech Stack:** Python 3.11, `requests`, `xml.etree.ElementTree` (stdlib), `PyYAML`, `email.utils.parsedate_to_datetime` (stdlib), existing `llm_router.py`
**Test runner:** `/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v`
---
## Task 1: Config files + .gitignore
**Files:**
- Create: `config/craigslist.yaml.example`
- Create: `config/craigslist.yaml`
- Modify: `.gitignore`
**Step 1: Create `config/craigslist.yaml.example`**
```yaml
# Craigslist metro subdomains to search.
# Copy to config/craigslist.yaml and adjust for your markets.
# Full subdomain list: https://www.craigslist.org/about/sites
metros:
- sfbay
- newyork
- chicago
- losangeles
- seattle
- austin
# Maps search profile location strings → Craigslist metro subdomain.
# Locations not listed here are silently skipped.
location_map:
"San Francisco Bay Area, CA": sfbay
"New York, NY": newyork
"Chicago, IL": chicago
"Los Angeles, CA": losangeles
"Seattle, WA": seattle
"Austin, TX": austin
# Craigslist job category. Defaults to 'jjj' (general jobs) if omitted.
# Other options: csr (customer service), mar (marketing), sof (software/qa/dba)
# category: jjj
```
**Step 2: Create `config/craigslist.yaml`** (personal config — gitignored)
Copy `.example` as-is (Meghan targets sfbay + remote, so this default is correct).
**Step 3: Add to `.gitignore`**
Add `config/craigslist.yaml` after the existing `config/adzuna.yaml` line:
```
config/adzuna.yaml
config/craigslist.yaml
```
**Step 4: Commit**
```bash
git add config/craigslist.yaml.example .gitignore
git commit -m "feat: add craigslist config template and gitignore entry"
```
---
## Task 2: Core scraper tests (write failing first)
**Files:**
- Create: `tests/test_craigslist.py`
**Step 1: Create `tests/test_craigslist.py` with all fixtures and tests**
```python
"""Tests for Craigslist RSS scraper."""
from datetime import datetime, timezone, timedelta
from email.utils import format_datetime
from unittest.mock import patch, MagicMock
import xml.etree.ElementTree as ET
import pytest
import requests
# ── RSS fixture helpers ────────────────────────────────────────────────────────
def _make_rss(items: list[dict]) -> bytes:
"""Build minimal Craigslist-style RSS XML from a list of item dicts."""
channel = ET.Element("channel")
for item_data in items:
item = ET.SubElement(channel, "item")
for tag, value in item_data.items():
el = ET.SubElement(item, tag)
el.text = value
rss = ET.Element("rss")
rss.append(channel)
return ET.tostring(rss, encoding="utf-8", xml_declaration=True)
def _pubdate(hours_ago: float = 1.0) -> str:
"""Return an RFC 2822 pubDate string for N hours ago."""
dt = datetime.now(tz=timezone.utc) - timedelta(hours=hours_ago)
return format_datetime(dt)
def _mock_resp(content: bytes, status_code: int = 200) -> MagicMock:
mock = MagicMock()
mock.status_code = status_code
mock.content = content
mock.raise_for_status = MagicMock()
if status_code >= 400:
mock.raise_for_status.side_effect = requests.HTTPError(f"HTTP {status_code}")
return mock
# ── Fixtures ──────────────────────────────────────────────────────────────────
_SAMPLE_RSS = _make_rss([{
"title": "Customer Success Manager",
"link": "https://sfbay.craigslist.org/jjj/d/csm-role/1234567890.html",
"description": "Great CSM role at Acme Corp. Salary $120k.",
"pubDate": _pubdate(1),
}])
_TWO_ITEM_RSS = _make_rss([
{
"title": "Customer Success Manager",
"link": "https://sfbay.craigslist.org/jjj/d/csm-role/1111111111.html",
"description": "CSM role 1.",
"pubDate": _pubdate(1),
},
{
"title": "Account Manager",
"link": "https://sfbay.craigslist.org/jjj/d/am-role/2222222222.html",
"description": "AM role.",
"pubDate": _pubdate(2),
},
])
_OLD_ITEM_RSS = _make_rss([{
"title": "Old Job",
"link": "https://sfbay.craigslist.org/jjj/d/old-job/9999999999.html",
"description": "Very old posting.",
"pubDate": _pubdate(hours_ago=500),
}])
_TWO_METRO_CONFIG = {
"metros": ["sfbay", "newyork"],
"location_map": {
"San Francisco Bay Area, CA": "sfbay",
"New York, NY": "newyork",
},
"category": "jjj",
}
_SINGLE_METRO_CONFIG = {
"metros": ["sfbay"],
"location_map": {"San Francisco Bay Area, CA": "sfbay"},
}
_PROFILE = {"titles": ["Customer Success Manager"], "hours_old": 240}
# ── Tests ─────────────────────────────────────────────────────────────────────
def test_scrape_returns_empty_on_missing_config(tmp_path):
"""Missing craigslist.yaml → returns [] without raising."""
with patch("scripts.custom_boards.craigslist._CONFIG_PATH",
tmp_path / "craigslist.yaml"):
import importlib
import scripts.custom_boards.craigslist as cl
importlib.reload(cl)
result = cl.scrape(_PROFILE, "San Francisco Bay Area, CA")
assert result == []
def test_scrape_remote_hits_all_metros():
"""location='Remote' triggers one RSS fetch per configured metro."""
with patch("scripts.custom_boards.craigslist._load_config",
return_value=_TWO_METRO_CONFIG):
with patch("scripts.custom_boards.craigslist.requests.get",
return_value=_mock_resp(_SAMPLE_RSS)) as mock_get:
from scripts.custom_boards import craigslist
result = craigslist.scrape(_PROFILE, "Remote")
assert mock_get.call_count == 2
fetched_urls = [call.args[0] for call in mock_get.call_args_list]
assert any("sfbay" in u for u in fetched_urls)
assert any("newyork" in u for u in fetched_urls)
assert all(r["is_remote"] for r in result)
def test_scrape_location_map_resolves():
"""Known location string maps to exactly one metro."""
with patch("scripts.custom_boards.craigslist._load_config",
return_value=_TWO_METRO_CONFIG):
with patch("scripts.custom_boards.craigslist.requests.get",
return_value=_mock_resp(_SAMPLE_RSS)) as mock_get:
from scripts.custom_boards import craigslist
result = craigslist.scrape(_PROFILE, "San Francisco Bay Area, CA")
assert mock_get.call_count == 1
assert "sfbay" in mock_get.call_args.args[0]
assert len(result) == 1
assert result[0]["is_remote"] is False
def test_scrape_location_not_in_map_returns_empty():
"""Location not in location_map → [] without raising."""
with patch("scripts.custom_boards.craigslist._load_config",
return_value=_SINGLE_METRO_CONFIG):
with patch("scripts.custom_boards.craigslist.requests.get") as mock_get:
from scripts.custom_boards import craigslist
result = craigslist.scrape(_PROFILE, "Portland, OR")
assert result == []
mock_get.assert_not_called()
def test_hours_old_filter():
"""Items older than hours_old are excluded."""
profile = {"titles": ["Customer Success Manager"], "hours_old": 48}
with patch("scripts.custom_boards.craigslist._load_config",
return_value=_SINGLE_METRO_CONFIG):
with patch("scripts.custom_boards.craigslist.requests.get",
return_value=_mock_resp(_OLD_ITEM_RSS)):
from scripts.custom_boards import craigslist
result = craigslist.scrape(profile, "San Francisco Bay Area, CA")
assert result == []
def test_dedup_within_run():
"""Same URL from two different metros is only returned once."""
same_url_rss = _make_rss([{
"title": "CSM Role",
"link": "https://sfbay.craigslist.org/jjj/d/csm/1234.html",
"description": "Same job.",
"pubDate": _pubdate(1),
}])
with patch("scripts.custom_boards.craigslist._load_config",
return_value=_TWO_METRO_CONFIG):
with patch("scripts.custom_boards.craigslist.requests.get",
return_value=_mock_resp(same_url_rss)):
from scripts.custom_boards import craigslist
result = craigslist.scrape(_PROFILE, "Remote")
urls = [r["url"] for r in result]
assert len(urls) == len(set(urls))
def test_http_error_graceful():
"""HTTP error → [] without raising."""
with patch("scripts.custom_boards.craigslist._load_config",
return_value=_SINGLE_METRO_CONFIG):
with patch("scripts.custom_boards.craigslist.requests.get",
side_effect=requests.RequestException("timeout")):
from scripts.custom_boards import craigslist
result = craigslist.scrape(_PROFILE, "San Francisco Bay Area, CA")
assert result == []
def test_results_wanted_cap():
"""Never returns more than results_wanted items."""
with patch("scripts.custom_boards.craigslist._load_config",
return_value=_TWO_METRO_CONFIG):
with patch("scripts.custom_boards.craigslist.requests.get",
return_value=_mock_resp(_TWO_ITEM_RSS)):
from scripts.custom_boards import craigslist
result = craigslist.scrape(_PROFILE, "Remote", results_wanted=1)
assert len(result) <= 1
```
**Step 2: Run tests to verify they all fail**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_craigslist.py -v
```
Expected: `ModuleNotFoundError: No module named 'scripts.custom_boards.craigslist'`
---
## Task 3: Implement `scripts/custom_boards/craigslist.py`
**Files:**
- Create: `scripts/custom_boards/craigslist.py`
**Step 1: Create the scraper**
```python
"""Craigslist job scraper — RSS-based.
Uses Craigslist's native RSS feed endpoint for discovery.
Full job description is populated by the scrape_url background task.
Company name and salary (not structured in Craigslist listings) are
extracted from the description body by the enrich_craigslist task.
Config: config/craigslist.yaml (gitignored — metro list + location map)
config/craigslist.yaml.example (committed template)
Returns a list of dicts compatible with scripts.db.insert_job().
"""
from __future__ import annotations
import time
import xml.etree.ElementTree as ET
from datetime import datetime, timezone
from email.utils import parsedate_to_datetime
from pathlib import Path
from urllib.parse import quote_plus
import requests
import yaml
_CONFIG_PATH = Path(__file__).parent.parent.parent / "config" / "craigslist.yaml"
_DEFAULT_CATEGORY = "jjj"
_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
)
}
_TIMEOUT = 15
_SLEEP = 0.5 # seconds between requests — easy to make configurable later
def _load_config() -> dict:
if not _CONFIG_PATH.exists():
raise FileNotFoundError(
f"Craigslist config not found: {_CONFIG_PATH}\n"
"Copy config/craigslist.yaml.example → config/craigslist.yaml "
"and configure your target metros."
)
cfg = yaml.safe_load(_CONFIG_PATH.read_text()) or {}
if not cfg.get("metros"):
raise ValueError(
"config/craigslist.yaml must contain at least one entry under 'metros'."
)
return cfg
def _rss_url(metro: str, category: str, query: str) -> str:
return (
f"https://{metro}.craigslist.org/search/{category}"
f"?query={quote_plus(query)}&format=rss&sort=date"
)
def _parse_pubdate(pubdate_str: str) -> datetime | None:
"""Parse an RSS pubDate string to a timezone-aware datetime."""
try:
return parsedate_to_datetime(pubdate_str)
except Exception:
return None
def _fetch_rss(url: str) -> list[dict]:
"""Fetch and parse a Craigslist RSS feed. Returns list of raw item dicts."""
resp = requests.get(url, headers=_HEADERS, timeout=_TIMEOUT)
resp.raise_for_status()
try:
root = ET.fromstring(resp.content)
except ET.ParseError as exc:
raise ValueError(f"Malformed RSS XML: {exc}") from exc
items = []
for item in root.findall(".//item"):
def _text(tag: str, _item=item) -> str:
el = _item.find(tag)
return (el.text or "").strip() if el is not None else ""
items.append({
"title": _text("title"),
"link": _text("link"),
"description": _text("description"),
"pubDate": _text("pubDate"),
})
return items
def scrape(profile: dict, location: str, results_wanted: int = 50) -> list[dict]:
"""Fetch jobs from Craigslist RSS for a single location.
Args:
profile: Search profile dict from search_profiles.yaml.
location: Location string (e.g. "Remote" or "San Francisco Bay Area, CA").
results_wanted: Maximum results to return across all metros and titles.
Returns:
List of job dicts with keys: title, company, url, source, location,
is_remote, salary, description.
company/salary are empty — filled later by enrich_craigslist task.
"""
try:
cfg = _load_config()
except (FileNotFoundError, ValueError) as exc:
print(f" [craigslist] Skipped — {exc}")
return []
metros_all: list[str] = cfg.get("metros", [])
location_map: dict[str, str] = cfg.get("location_map", {})
category: str = cfg.get("category") or _DEFAULT_CATEGORY
is_remote_search = location.lower() == "remote"
if is_remote_search:
metros = metros_all
else:
metro = location_map.get(location)
if not metro:
print(f" [craigslist] No metro mapping for '{location}' — skipping")
return []
metros = [metro]
titles: list[str] = profile.get("titles", [])
hours_old: int = profile.get("hours_old", 240)
cutoff = datetime.now(tz=timezone.utc).timestamp() - (hours_old * 3600)
seen_urls: set[str] = set()
results: list[dict] = []
for metro in metros:
if len(results) >= results_wanted:
break
for title in titles:
if len(results) >= results_wanted:
break
url = _rss_url(metro, category, title)
try:
items = _fetch_rss(url)
except requests.RequestException as exc:
print(f" [craigslist] HTTP error ({metro}/{title}): {exc}")
time.sleep(_SLEEP)
continue
except ValueError as exc:
print(f" [craigslist] Parse error ({metro}/{title}): {exc}")
time.sleep(_SLEEP)
continue
for item in items:
if len(results) >= results_wanted:
break
item_url = item.get("link", "")
if not item_url or item_url in seen_urls:
continue
pub = _parse_pubdate(item.get("pubDate", ""))
if pub and pub.timestamp() < cutoff:
continue
seen_urls.add(item_url)
results.append({
"title": item.get("title", ""),
"company": "",
"url": item_url,
"source": "craigslist",
"location": f"{metro} (Craigslist)",
"is_remote": is_remote_search,
"salary": "",
"description": "",
})
time.sleep(_SLEEP)
return results[:results_wanted]
```
**Step 2: Run tests**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_craigslist.py -v
```
Expected: all 8 PASS
**Step 3: Run full test suite to check for regressions**
```bash
/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v
```
Expected: all existing tests still PASS
**Step 4: Commit**
```bash
git add scripts/custom_boards/craigslist.py tests/test_craigslist.py
git commit -m "feat: add Craigslist RSS scraper to custom_boards"
```
---
## Task 4: Wire into discover.py + search_profiles.yaml
**Files:**
- Modify: `scripts/discover.py:20-32`
- Modify: `config/search_profiles.yaml`
**Step 1: Add to `CUSTOM_SCRAPERS` registry in `discover.py`**
Find this block (around line 20):
```python
from scripts.custom_boards import adzuna as _adzuna
from scripts.custom_boards import theladders as _theladders
```
Replace with:
```python
from scripts.custom_boards import adzuna as _adzuna
from scripts.custom_boards import theladders as _theladders
from scripts.custom_boards import craigslist as _craigslist
```
Find:
```python
CUSTOM_SCRAPERS: dict[str, object] = {
"adzuna": _adzuna.scrape,
"theladders": _theladders.scrape,
}
```
Replace with:
```python
CUSTOM_SCRAPERS: dict[str, object] = {
"adzuna": _adzuna.scrape,
"theladders": _theladders.scrape,
"craigslist": _craigslist.scrape,
}
```
**Step 2: Add `craigslist` to relevant profiles in `config/search_profiles.yaml`**
For each profile that has `custom_boards:`, add `- craigslist`. Example — the `cs_leadership` profile currently has:
```yaml
custom_boards:
- adzuna
- theladders
```
Change to:
```yaml
custom_boards:
- adzuna
- theladders
- craigslist
```
Repeat for all profiles where Craigslist makes sense (all of them — remote + SF Bay Area are both mapped).
**Step 3: Verify discover.py imports cleanly**
```bash
conda run -n job-seeker python -c "from scripts.discover import CUSTOM_SCRAPERS; print(list(CUSTOM_SCRAPERS.keys()))"
```
Expected: `['adzuna', 'theladders', 'craigslist']`
**Step 4: Commit**
```bash
git add scripts/discover.py config/search_profiles.yaml
git commit -m "feat: register craigslist scraper in discover.py and search profiles"
```
---
## Task 5: LLM enrichment — extract company + salary for Craigslist jobs
**Files:**
- Modify: `scripts/enrich_descriptions.py`
- Modify: `scripts/task_runner.py`
**Step 1: Read `scripts/task_runner.py`** to understand the `scrape_url` completion handler before editing.
**Step 2: Add `enrich_craigslist_fields()` to `enrich_descriptions.py`**
Add this function after `enrich_all_descriptions` (before `if __name__ == "__main__"`):
```python
def enrich_craigslist_fields(
db_path: Path = DEFAULT_DB,
job_id: int = None,
) -> dict:
"""
Use LLM to extract company name and salary from a Craigslist job description.
Called after scrape_url populates the description for a craigslist job.
Only runs when: source='craigslist', company='', description non-empty.
Returns dict with keys 'company' and/or 'salary' (may be empty strings).
"""
import sqlite3 as _sq
conn = _sq.connect(db_path)
conn.row_factory = _sq.Row
row = conn.execute(
"SELECT id, description, company, source FROM jobs WHERE id=?", (job_id,)
).fetchone()
conn.close()
if not row:
return {}
if row["source"] != "craigslist":
return {}
if row["company"]: # already populated
return {}
if not (row["description"] or "").strip():
return {}
sys.path.insert(0, str(Path(__file__).parent.parent))
from scripts.llm_router import LLMRouter
prompt = (
"Extract the following from this job posting. "
"Return JSON only, no commentary.\n\n"
'{"company": "<company name or empty string>", '
'"salary": "<salary/compensation or empty string>"}\n\n'
f"Posting:\n{row['description'][:3000]}"
)
try:
router = LLMRouter()
raw = router.complete(prompt)
except Exception as exc:
print(f"[enrich_craigslist] LLM error for job {job_id}: {exc}")
return {}
import json, re
try:
# Strip markdown code fences if present
clean = re.sub(r"```(?:json)?|```", "", raw).strip()
fields = json.loads(clean)
except (json.JSONDecodeError, ValueError):
print(f"[enrich_craigslist] Could not parse LLM response for job {job_id}: {raw!r}")
return {}
extracted = {
k: (fields.get(k) or "").strip()
for k in ("company", "salary")
if (fields.get(k) or "").strip()
}
if extracted:
from scripts.db import update_job_fields
update_job_fields(db_path, job_id, extracted)
print(f"[enrich_craigslist] job {job_id}: "
f"company={extracted.get('company', '—')} "
f"salary={extracted.get('salary', '—')}")
return extracted
```
Also add `import sys` to the top of `enrich_descriptions.py` if not already present.
**Step 3: Add `enrich_craigslist` task type to `task_runner.py`**
In `_run_task`, add a new `elif` branch. Find the block that handles `scrape_url` and add after it:
```python
elif task_type == "enrich_craigslist":
from scripts.enrich_descriptions import enrich_craigslist_fields
extracted = enrich_craigslist_fields(db_path, job_id)
company = extracted.get("company", "")
msg = f"company={company}" if company else "no company found"
update_task_status(db_path, task_id, "completed", error=msg)
return
```
**Step 4: Auto-submit `enrich_craigslist` after `scrape_url` for Craigslist jobs**
Still in `task_runner.py`, find the `scrape_url` completion handler. After the `update_task_status` call for `scrape_url`, add:
```python
# Auto-enrich company/salary for Craigslist jobs
import sqlite3 as _sq
_conn = _sq.connect(db_path)
_conn.row_factory = _sq.Row
_job = _conn.execute(
"SELECT source, company FROM jobs WHERE id=?", (job_id,)
).fetchone()
_conn.close()
if _job and _job["source"] == "craigslist" and not _job["company"]:
submit_task(db_path, "enrich_craigslist", job_id)
```
**Step 5: Smoke test — run a discovery cycle and check a craigslist job**
```bash
conda run -n job-seeker python -c "
from scripts.custom_boards.craigslist import scrape
jobs = scrape({'titles': ['Customer Success Manager'], 'hours_old': 48}, 'San Francisco Bay Area, CA', results_wanted=3)
for j in jobs:
print(j['title'], '|', j['url'])
"
```
Expected: 03 job dicts printed (may be 0 if no recent postings — that's fine).
**Step 6: Commit**
```bash
git add scripts/enrich_descriptions.py scripts/task_runner.py
git commit -m "feat: add enrich_craigslist task for LLM company/salary extraction"
```
---
## Final: push to remote
```bash
git push origin main
```