Adds a full MkDocs documentation site under docs/ with Material theme. Getting Started: installation walkthrough, 7-step first-run wizard guide, Docker Compose profile reference with GPU memory guidance and preflight.py description. User Guide: job discovery (search profiles, custom boards, enrichment), job review (sorting, match scores, batch actions), apply workspace (cover letter gen, PDF export, mark applied), interviews (kanban stages, company research auto-trigger, survey assistant), email sync (IMAP, Gmail App Password, classification labels, stage auto-updates), integrations (all 13 drivers with tier requirements), settings (every tab documented). Developer Guide: contributing (dev env setup, code style, branch naming, PR checklist), architecture (ASCII layer diagram, design decisions), adding scrapers (full scrape() interface, registration, search profile config, test patterns), adding integrations (IntegrationBase full interface, auto- discovery, tier gating, test patterns), testing (patterns, fixtures, what not to test). Reference: tier system (full FEATURES table, can_use/tier_label API, dev override, adding gates), LLM router (backend types, complete() signature, fallback chains, vision routing, __auto__ resolution, adding backends), config files (every file with field-level docs and gitignore status). Also adds CONTRIBUTING.md at repo root pointing to the docs site.
244 lines
7.3 KiB
Markdown
244 lines
7.3 KiB
Markdown
# Adding a Custom Job Board Scraper
|
|
|
|
Peregrine supports pluggable custom job board scrapers. Standard boards use the JobSpy library. Custom scrapers handle boards with non-standard APIs, paywalls, or SSR-rendered pages.
|
|
|
|
This guide walks through adding a new scraper from scratch.
|
|
|
|
---
|
|
|
|
## Step 1 — Create the scraper module
|
|
|
|
Create `scripts/custom_boards/myboard.py`. Every custom scraper must implement one function:
|
|
|
|
```python
|
|
# scripts/custom_boards/myboard.py
|
|
|
|
def scrape(profile: dict, db_path: str) -> list[dict]:
|
|
"""
|
|
Scrape job listings from MyBoard for the given search profile.
|
|
|
|
Args:
|
|
profile: The active search profile dict from search_profiles.yaml.
|
|
Keys include: titles (list), locations (list),
|
|
hours_old (int), results_per_board (int).
|
|
db_path: Absolute path to staging.db. Use this if you need to
|
|
check for existing URLs before returning.
|
|
|
|
Returns:
|
|
List of job dicts. Each dict must contain at minimum:
|
|
title (str) — job title
|
|
company (str) — company name
|
|
url (str) — canonical job URL (used as unique key)
|
|
source (str) — board identifier, e.g. "myboard"
|
|
location (str) — "Remote" or "City, State"
|
|
is_remote (bool) — True if remote
|
|
salary (str) — salary string or "" if unknown
|
|
description (str) — full job description text or "" if unavailable
|
|
date_found (str) — ISO 8601 datetime string, e.g. "2026-02-25T12:00:00"
|
|
"""
|
|
jobs = []
|
|
|
|
for title in profile.get("titles", []):
|
|
for location in profile.get("locations", []):
|
|
results = _fetch_from_myboard(title, location, profile)
|
|
jobs.extend(results)
|
|
|
|
return jobs
|
|
|
|
|
|
def _fetch_from_myboard(title: str, location: str, profile: dict) -> list[dict]:
|
|
"""Internal helper — call the board's API and transform results."""
|
|
import requests
|
|
from datetime import datetime
|
|
|
|
params = {
|
|
"q": title,
|
|
"l": location,
|
|
"limit": profile.get("results_per_board", 50),
|
|
}
|
|
|
|
try:
|
|
resp = requests.get(
|
|
"https://api.myboard.com/jobs",
|
|
params=params,
|
|
timeout=15,
|
|
)
|
|
resp.raise_for_status()
|
|
data = resp.json()
|
|
except Exception as e:
|
|
print(f"[myboard] fetch error: {e}")
|
|
return []
|
|
|
|
jobs = []
|
|
for item in data.get("results", []):
|
|
jobs.append({
|
|
"title": item.get("title", ""),
|
|
"company": item.get("company", ""),
|
|
"url": item.get("url", ""),
|
|
"source": "myboard",
|
|
"location": item.get("location", ""),
|
|
"is_remote": "remote" in item.get("location", "").lower(),
|
|
"salary": item.get("salary", ""),
|
|
"description": item.get("description", ""),
|
|
"date_found": datetime.utcnow().isoformat(),
|
|
})
|
|
|
|
return jobs
|
|
```
|
|
|
|
### Required fields
|
|
|
|
| Field | Type | Notes |
|
|
|-------|------|-------|
|
|
| `title` | str | Job title |
|
|
| `company` | str | Company name |
|
|
| `url` | str | **Unique key** — must be stable and canonical |
|
|
| `source` | str | Short board identifier, e.g. `"myboard"` |
|
|
| `location` | str | `"Remote"` or `"City, ST"` |
|
|
| `is_remote` | bool | `True` if remote |
|
|
| `salary` | str | Salary string or `""` |
|
|
| `description` | str | Full description text or `""` |
|
|
| `date_found` | str | ISO 8601 UTC datetime |
|
|
|
|
### Deduplication
|
|
|
|
`discover.py` deduplicates by `url` before inserting into the database. If a job with the same URL already exists, it is silently skipped. You do not need to handle deduplication inside your scraper.
|
|
|
|
### Rate limiting
|
|
|
|
Be a good citizen:
|
|
- Add a `time.sleep(0.5)` between paginated requests
|
|
- Respect `Retry-After` headers
|
|
- Do not scrape faster than a human browsing the site
|
|
- If the site provides an official API, prefer that over scraping HTML
|
|
|
|
### Credentials
|
|
|
|
If your scraper requires API keys or credentials:
|
|
- Create `config/myboard.yaml.example` as a template
|
|
- Create `config/myboard.yaml` (gitignored) for live credentials
|
|
- Read it in your scraper with `yaml.safe_load(open("config/myboard.yaml"))`
|
|
- Document the credential setup in comments at the top of your module
|
|
|
|
---
|
|
|
|
## Step 2 — Register the scraper
|
|
|
|
Open `scripts/discover.py` and add your scraper to the `CUSTOM_SCRAPERS` dict:
|
|
|
|
```python
|
|
from scripts.custom_boards import adzuna, theladders, craigslist, myboard
|
|
|
|
CUSTOM_SCRAPERS = {
|
|
"adzuna": adzuna.scrape,
|
|
"theladders": theladders.scrape,
|
|
"craigslist": craigslist.scrape,
|
|
"myboard": myboard.scrape, # add this line
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Step 3 — Activate in a search profile
|
|
|
|
Open `config/search_profiles.yaml` and add `myboard` to `custom_boards` in any profile:
|
|
|
|
```yaml
|
|
profiles:
|
|
- name: cs_leadership
|
|
boards:
|
|
- linkedin
|
|
- indeed
|
|
custom_boards:
|
|
- adzuna
|
|
- myboard # add this line
|
|
titles:
|
|
- Customer Success Manager
|
|
locations:
|
|
- Remote
|
|
```
|
|
|
|
---
|
|
|
|
## Step 4 — Write a test
|
|
|
|
Create `tests/test_myboard.py`. Mock the HTTP call to avoid hitting the live API during tests:
|
|
|
|
```python
|
|
# tests/test_myboard.py
|
|
|
|
from unittest.mock import patch
|
|
from scripts.custom_boards.myboard import scrape
|
|
|
|
MOCK_RESPONSE = {
|
|
"results": [
|
|
{
|
|
"title": "Customer Success Manager",
|
|
"company": "Acme Corp",
|
|
"url": "https://myboard.com/jobs/12345",
|
|
"location": "Remote",
|
|
"salary": "$80,000 - $100,000",
|
|
"description": "We are looking for a CSM...",
|
|
}
|
|
]
|
|
}
|
|
|
|
def test_scrape_returns_correct_shape():
|
|
profile = {
|
|
"titles": ["Customer Success Manager"],
|
|
"locations": ["Remote"],
|
|
"results_per_board": 10,
|
|
"hours_old": 240,
|
|
}
|
|
|
|
with patch("scripts.custom_boards.myboard.requests.get") as mock_get:
|
|
mock_get.return_value.ok = True
|
|
mock_get.return_value.raise_for_status = lambda: None
|
|
mock_get.return_value.json.return_value = MOCK_RESPONSE
|
|
|
|
jobs = scrape(profile, db_path="nonexistent.db")
|
|
|
|
assert len(jobs) == 1
|
|
job = jobs[0]
|
|
|
|
# Required fields
|
|
assert "title" in job
|
|
assert "company" in job
|
|
assert "url" in job
|
|
assert "source" in job
|
|
assert "location" in job
|
|
assert "is_remote" in job
|
|
assert "salary" in job
|
|
assert "description" in job
|
|
assert "date_found" in job
|
|
|
|
assert job["source"] == "myboard"
|
|
assert job["title"] == "Customer Success Manager"
|
|
assert job["url"] == "https://myboard.com/jobs/12345"
|
|
|
|
|
|
def test_scrape_handles_http_error_gracefully():
|
|
profile = {
|
|
"titles": ["Customer Success Manager"],
|
|
"locations": ["Remote"],
|
|
"results_per_board": 10,
|
|
"hours_old": 240,
|
|
}
|
|
|
|
with patch("scripts.custom_boards.myboard.requests.get") as mock_get:
|
|
mock_get.side_effect = Exception("Connection refused")
|
|
|
|
jobs = scrape(profile, db_path="nonexistent.db")
|
|
|
|
assert jobs == []
|
|
```
|
|
|
|
---
|
|
|
|
## Existing Scrapers as Reference
|
|
|
|
| Scraper | Notes |
|
|
|---------|-------|
|
|
| `scripts/custom_boards/adzuna.py` | REST API with `app_id` + `app_key` authentication |
|
|
| `scripts/custom_boards/theladders.py` | SSR scraper using `curl_cffi` to parse `__NEXT_DATA__` JSON embedded in the page |
|
|
| `scripts/custom_boards/craigslist.py` | RSS feed scraper |
|