peregrine/docs/plans/2026-02-24-craigslist-scraper-design.md
pyr0ball f11a38eb0b chore: seed Peregrine from personal job-seeker (pre-generalization)
App: Peregrine
Company: Circuit Forge LLC
Source: github.com/pyr0ball/job-seeker (personal fork, not linked)
2026-02-24 18:25:39 -08:00

6.1 KiB
Raw Blame History

Design: Craigslist Custom Board Scraper

Date: 2026-02-24 Status: Approved


Overview

Add a Craigslist scraper to scripts/custom_boards/craigslist.py following the existing adzuna/theladders pattern. Craigslist is regional (one subdomain per metro), has no native remote filter, and exposes an RSS feed that gives clean structured data without Playwright.

Discovery uses RSS for speed and reliability. Full job description is populated by the existing scrape_url background task. Company name and salary — not present in Craigslist listings as structured fields — are extracted from the description body by the existing enrich_descriptions LLM pipeline after the posting is fetched.


Files

Action File
Create scripts/custom_boards/craigslist.py
Create config/craigslist.yaml (gitignored)
Create config/craigslist.yaml.example
Create tests/test_craigslist.py
Modify scripts/discover.py — add to CUSTOM_SCRAPERS registry
Modify scripts/enrich_descriptions.py — add company/salary extraction for craigslist source
Modify config/search_profiles.yaml — add craigslist to custom_boards on relevant profiles
Modify .gitignore — add config/craigslist.yaml

Config (config/craigslist.yaml)

Gitignored. .example committed alongside it.

# Craigslist metro subdomains to search.
# Full list at: https://www.craigslist.org/about/sites
metros:
  - sfbay
  - newyork
  - chicago
  - losangeles
  - seattle
  - austin

# Maps search profile location strings to a single metro subdomain.
# Locations not listed here are skipped silently.
location_map:
  "San Francisco Bay Area, CA": sfbay
  "New York, NY": newyork
  "Chicago, IL": chicago
  "Los Angeles, CA": losangeles
  "Seattle, WA": seattle
  "Austin, TX": austin

# Craigslist job category. Defaults to 'jjj' (general jobs) if omitted.
# Other useful values: csr (customer service), mar (marketing), sof (software)
# category: jjj

Scraper Architecture

RSS URL pattern

https://{metro}.craigslist.org/search/{category}?query={title}&format=rss&sort=date

Default category: jjj. Overridable via category key in config.

scrape(profile, location, results_wanted) flow

  1. Load config/craigslist.yaml — return [] with a printed warning if missing or malformed
  2. Determine metros to search:
    • location.lower() == "remote" → all configured metros (Craigslist has no native remote filter)
    • Any other string → location_map.get(location) → single metro; skip silently if not mapped
  3. For each metro × each title in profile["titles"]:
    • Fetch RSS via requests.get with a standard User-Agent header
    • Parse with xml.etree.ElementTree (stdlib — no extra deps)
    • Filter <item> entries by <pubDate> against profile["hours_old"]
    • Extract title, URL, and description snippet from each item
    • time.sleep(0.5) between fetches (polite pacing; easy to make configurable later)
  4. Dedup by URL within the run via a seen_urls set
  5. Stop when results_wanted is reached
  6. Return list of job dicts

Return dict shape

{
    "title":       "<RSS item title, cleaned>",
    "company":     "",              # not in Craigslist — filled by LLM enrichment
    "url":         "<item link>",
    "source":      "craigslist",
    "location":    "<metro> (Craigslist)",
    "is_remote":   True,            # if remote search, else False
    "salary":      "",              # not reliably structured — filled by LLM enrichment
    "description": "",              # scrape_url background task fills this in
}

Error handling

  • Missing config → [] + printed warning, never raises
  • requests.RequestException → skip that metro/title, print warning, continue
  • Malformed RSS XML → skip that response, print warning, continue
  • HTTP non-200 → skip, print status code

LLM Enrichment for company/salary

Craigslist postings frequently include company name and salary in the body text, but not as structured fields. After scrape_url populates description, the enrich_descriptions task handles extraction.

Trigger condition: source == "craigslist" AND company == "" AND description != ""

Prompt addition: Extend the existing enrichment prompt to also extract:

  • Company name (if present in the posting body)
  • Salary or compensation range (if mentioned)

Results written back via update_job_fields. If the LLM cannot extract a company name, the field stays blank — this is expected and acceptable for Craigslist.


discover.py Integration

One-line addition to the CUSTOM_SCRAPERS registry:

from scripts.custom_boards import craigslist as _craigslist

CUSTOM_SCRAPERS: dict[str, object] = {
    "adzuna":      _adzuna.scrape,
    "theladders":  _theladders.scrape,
    "craigslist":  _craigslist.scrape,   # new
}

Add craigslist to custom_boards in config/search_profiles.yaml for relevant profiles.


Tests (tests/test_craigslist.py)

All tests use mocked requests.get with fixture RSS XML — no network calls.

Test Asserts
test_scrape_returns_empty_on_missing_config Missing yaml → [], no raise
test_scrape_remote_hits_all_metros location="Remote" → one fetch per configured metro
test_scrape_location_map_resolves "San Francisco Bay Area, CA"sfbay only
test_scrape_location_not_in_map_returns_empty Unknown location → [], no raise
test_hours_old_filter Items older than hours_old are excluded
test_dedup_within_run Same URL appearing in two metros only returned once
test_http_error_graceful RequestException[], no raise
test_results_wanted_cap Never returns more than results_wanted

Out of Scope

  • Playwright-based scraping (RSS is sufficient; Playwright adds a dep for no gain)
  • Craigslist subcategory multi-search per profile (config category override is sufficient)
  • Salary/company extraction directly in the scraper (LLM enrichment is the right layer)
  • Windows support (deferred globally)