Adds a full MkDocs documentation site under docs/ with Material theme. Getting Started: installation walkthrough, 7-step first-run wizard guide, Docker Compose profile reference with GPU memory guidance and preflight.py description. User Guide: job discovery (search profiles, custom boards, enrichment), job review (sorting, match scores, batch actions), apply workspace (cover letter gen, PDF export, mark applied), interviews (kanban stages, company research auto-trigger, survey assistant), email sync (IMAP, Gmail App Password, classification labels, stage auto-updates), integrations (all 13 drivers with tier requirements), settings (every tab documented). Developer Guide: contributing (dev env setup, code style, branch naming, PR checklist), architecture (ASCII layer diagram, design decisions), adding scrapers (full scrape() interface, registration, search profile config, test patterns), adding integrations (IntegrationBase full interface, auto- discovery, tier gating, test patterns), testing (patterns, fixtures, what not to test). Reference: tier system (full FEATURES table, can_use/tier_label API, dev override, adding gates), LLM router (backend types, complete() signature, fallback chains, vision routing, __auto__ resolution, adding backends), config files (every file with field-level docs and gitignore status). Also adds CONTRIBUTING.md at repo root pointing to the docs site.
7.3 KiB
Adding a Custom Job Board Scraper
Peregrine supports pluggable custom job board scrapers. Standard boards use the JobSpy library. Custom scrapers handle boards with non-standard APIs, paywalls, or SSR-rendered pages.
This guide walks through adding a new scraper from scratch.
Step 1 — Create the scraper module
Create scripts/custom_boards/myboard.py. Every custom scraper must implement one function:
# scripts/custom_boards/myboard.py
def scrape(profile: dict, db_path: str) -> list[dict]:
"""
Scrape job listings from MyBoard for the given search profile.
Args:
profile: The active search profile dict from search_profiles.yaml.
Keys include: titles (list), locations (list),
hours_old (int), results_per_board (int).
db_path: Absolute path to staging.db. Use this if you need to
check for existing URLs before returning.
Returns:
List of job dicts. Each dict must contain at minimum:
title (str) — job title
company (str) — company name
url (str) — canonical job URL (used as unique key)
source (str) — board identifier, e.g. "myboard"
location (str) — "Remote" or "City, State"
is_remote (bool) — True if remote
salary (str) — salary string or "" if unknown
description (str) — full job description text or "" if unavailable
date_found (str) — ISO 8601 datetime string, e.g. "2026-02-25T12:00:00"
"""
jobs = []
for title in profile.get("titles", []):
for location in profile.get("locations", []):
results = _fetch_from_myboard(title, location, profile)
jobs.extend(results)
return jobs
def _fetch_from_myboard(title: str, location: str, profile: dict) -> list[dict]:
"""Internal helper — call the board's API and transform results."""
import requests
from datetime import datetime
params = {
"q": title,
"l": location,
"limit": profile.get("results_per_board", 50),
}
try:
resp = requests.get(
"https://api.myboard.com/jobs",
params=params,
timeout=15,
)
resp.raise_for_status()
data = resp.json()
except Exception as e:
print(f"[myboard] fetch error: {e}")
return []
jobs = []
for item in data.get("results", []):
jobs.append({
"title": item.get("title", ""),
"company": item.get("company", ""),
"url": item.get("url", ""),
"source": "myboard",
"location": item.get("location", ""),
"is_remote": "remote" in item.get("location", "").lower(),
"salary": item.get("salary", ""),
"description": item.get("description", ""),
"date_found": datetime.utcnow().isoformat(),
})
return jobs
Required fields
| Field | Type | Notes |
|---|---|---|
title |
str | Job title |
company |
str | Company name |
url |
str | Unique key — must be stable and canonical |
source |
str | Short board identifier, e.g. "myboard" |
location |
str | "Remote" or "City, ST" |
is_remote |
bool | True if remote |
salary |
str | Salary string or "" |
description |
str | Full description text or "" |
date_found |
str | ISO 8601 UTC datetime |
Deduplication
discover.py deduplicates by url before inserting into the database. If a job with the same URL already exists, it is silently skipped. You do not need to handle deduplication inside your scraper.
Rate limiting
Be a good citizen:
- Add a
time.sleep(0.5)between paginated requests - Respect
Retry-Afterheaders - Do not scrape faster than a human browsing the site
- If the site provides an official API, prefer that over scraping HTML
Credentials
If your scraper requires API keys or credentials:
- Create
config/myboard.yaml.exampleas a template - Create
config/myboard.yaml(gitignored) for live credentials - Read it in your scraper with
yaml.safe_load(open("config/myboard.yaml")) - Document the credential setup in comments at the top of your module
Step 2 — Register the scraper
Open scripts/discover.py and add your scraper to the CUSTOM_SCRAPERS dict:
from scripts.custom_boards import adzuna, theladders, craigslist, myboard
CUSTOM_SCRAPERS = {
"adzuna": adzuna.scrape,
"theladders": theladders.scrape,
"craigslist": craigslist.scrape,
"myboard": myboard.scrape, # add this line
}
Step 3 — Activate in a search profile
Open config/search_profiles.yaml and add myboard to custom_boards in any profile:
profiles:
- name: cs_leadership
boards:
- linkedin
- indeed
custom_boards:
- adzuna
- myboard # add this line
titles:
- Customer Success Manager
locations:
- Remote
Step 4 — Write a test
Create tests/test_myboard.py. Mock the HTTP call to avoid hitting the live API during tests:
# tests/test_myboard.py
from unittest.mock import patch
from scripts.custom_boards.myboard import scrape
MOCK_RESPONSE = {
"results": [
{
"title": "Customer Success Manager",
"company": "Acme Corp",
"url": "https://myboard.com/jobs/12345",
"location": "Remote",
"salary": "$80,000 - $100,000",
"description": "We are looking for a CSM...",
}
]
}
def test_scrape_returns_correct_shape():
profile = {
"titles": ["Customer Success Manager"],
"locations": ["Remote"],
"results_per_board": 10,
"hours_old": 240,
}
with patch("scripts.custom_boards.myboard.requests.get") as mock_get:
mock_get.return_value.ok = True
mock_get.return_value.raise_for_status = lambda: None
mock_get.return_value.json.return_value = MOCK_RESPONSE
jobs = scrape(profile, db_path="nonexistent.db")
assert len(jobs) == 1
job = jobs[0]
# Required fields
assert "title" in job
assert "company" in job
assert "url" in job
assert "source" in job
assert "location" in job
assert "is_remote" in job
assert "salary" in job
assert "description" in job
assert "date_found" in job
assert job["source"] == "myboard"
assert job["title"] == "Customer Success Manager"
assert job["url"] == "https://myboard.com/jobs/12345"
def test_scrape_handles_http_error_gracefully():
profile = {
"titles": ["Customer Success Manager"],
"locations": ["Remote"],
"results_per_board": 10,
"hours_old": 240,
}
with patch("scripts.custom_boards.myboard.requests.get") as mock_get:
mock_get.side_effect = Exception("Connection refused")
jobs = scrape(profile, db_path="nonexistent.db")
assert jobs == []
Existing Scrapers as Reference
| Scraper | Notes |
|---|---|
scripts/custom_boards/adzuna.py |
REST API with app_id + app_key authentication |
scripts/custom_boards/theladders.py |
SSR scraper using curl_cffi to parse __NEXT_DATA__ JSON embedded in the page |
scripts/custom_boards/craigslist.py |
RSS feed scraper |