Compare commits

...

7 commits

Author SHA1 Message Date
94fd2ea332 feat: add Jobgether recruiter framing to cover letter generation
Some checks failed
CI / test (push) Failing after 32s
When source == "jobgether", build_prompt() injects a recruiter context
note directing the LLM to address the Jobgether recruiter using
"Your client [at {company}] will appreciate..." framing rather than
addressing the employer directly. generate() and task_runner both
thread the is_jobgether flag through automatically.
2026-03-15 09:45:51 -07:00
e6257249cb feat: add Jobgether URL detection and scraper to scrape_url.py 2026-03-15 09:45:50 -07:00
4028667c62 feat: filter Jobgether listings via blocklist 2026-03-15 09:45:50 -07:00
a005397d5d docs: update spec — Jobgether discovery scraper not viable (Cloudflare + robots.txt) 2026-03-15 09:45:50 -07:00
17f7baae3c docs: add Jobgether integration implementation plan 2026-03-15 09:45:50 -07:00
ef9cb29518 docs: add cover letter recruiter framing to Jobgether spec 2026-03-15 09:45:50 -07:00
fe2f27c2a2 docs: add Jobgether integration design spec 2026-03-15 09:45:50 -07:00
10 changed files with 1050 additions and 3 deletions

View file

@ -3,7 +3,8 @@
# Company name blocklist — partial case-insensitive match on the company field.
# e.g. "Amazon" blocks any listing where company contains "amazon".
companies: []
companies:
- jobgether
# Industry/content blocklist — blocked if company name OR job description contains any keyword.
# Use this for industries you will never work in regardless of company.

View file

@ -0,0 +1,700 @@
# Jobgether Integration Implementation Plan
> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Filter Jobgether listings out of all other scrapers, add a dedicated Jobgether scraper and URL scraper (Playwright-based), and add recruiter-aware cover letter framing for Jobgether jobs.
**Architecture:** Blocklist config handles filtering with zero code changes. A new `_scrape_jobgether()` in `scrape_url.py` handles manual URL imports via Playwright with URL slug fallback. A new `scripts/custom_boards/jobgether.py` handles discovery. Cover letter framing is an `is_jobgether` flag threaded from `task_runner.py``generate()``build_prompt()`.
**Tech Stack:** Python, Playwright (already installed), SQLite, PyTest, YAML config
**Spec:** `/Library/Development/CircuitForge/peregrine/docs/superpowers/specs/2026-03-15-jobgether-integration-design.md`
---
## Worktree Setup
- [ ] **Create worktree for this feature**
```bash
cd /Library/Development/CircuitForge/peregrine
git worktree add .worktrees/jobgether-integration -b feature/jobgether-integration
```
All implementation work happens in `/Library/Development/CircuitForge/peregrine/.worktrees/jobgether-integration/`.
---
## Chunk 1: Blocklist filter + scrape_url.py
### Task 1: Add Jobgether to blocklist
**Files:**
- Modify: `/Library/Development/CircuitForge/peregrine/config/blocklist.yaml`
- [ ] **Step 1: Edit blocklist.yaml**
```yaml
companies:
- jobgether
```
- [ ] **Step 2: Verify the existing `_is_blocklisted` test passes (or write one)**
Check `/Library/Development/CircuitForge/peregrine/tests/test_discover.py` for existing blocklist tests. If none cover company matching, add:
```python
def test_is_blocklisted_jobgether():
from scripts.discover import _is_blocklisted
blocklist = {"companies": ["jobgether"], "industries": [], "locations": []}
assert _is_blocklisted({"company": "Jobgether", "location": "", "description": ""}, blocklist)
assert _is_blocklisted({"company": "jobgether inc", "location": "", "description": ""}, blocklist)
assert not _is_blocklisted({"company": "Acme Corp", "location": "", "description": ""}, blocklist)
```
Run: `conda run -n job-seeker python -m pytest tests/test_discover.py -v -k "blocklist"`
Expected: PASS
- [ ] **Step 3: Commit**
```bash
git add config/blocklist.yaml tests/test_discover.py
git commit -m "feat: filter Jobgether listings via blocklist"
```
---
### Task 2: Add Jobgether detection to scrape_url.py
**Files:**
- Modify: `/Library/Development/CircuitForge/peregrine/scripts/scrape_url.py`
- Modify: `/Library/Development/CircuitForge/peregrine/tests/test_scrape_url.py`
- [ ] **Step 1: Write failing tests**
In `/Library/Development/CircuitForge/peregrine/tests/test_scrape_url.py`, add:
```python
def test_detect_board_jobgether():
from scripts.scrape_url import _detect_board
assert _detect_board("https://jobgether.com/offer/69b42d9d24d79271ee0618e8-csm---resware") == "jobgether"
assert _detect_board("https://www.jobgether.com/offer/abc-role---company") == "jobgether"
def test_jobgether_slug_company_extraction():
from scripts.scrape_url import _company_from_jobgether_url
assert _company_from_jobgether_url(
"https://jobgether.com/offer/69b42d9d24d79271ee0618e8-customer-success-manager---resware"
) == "Resware"
assert _company_from_jobgether_url(
"https://jobgether.com/offer/abc123-director-of-cs---acme-corp"
) == "Acme Corp"
assert _company_from_jobgether_url(
"https://jobgether.com/offer/abc123-no-separator-here"
) == ""
def test_scrape_jobgether_no_playwright(tmp_path):
"""When Playwright is unavailable, _scrape_jobgether falls back to URL slug for company."""
# Patch playwright.sync_api to None in sys.modules so the local import inside
# _scrape_jobgether raises ImportError at call time (local imports run at call time,
# not at module load time — so no reload needed).
import sys
import unittest.mock as mock
url = "https://jobgether.com/offer/69b42d9d24d79271ee0618e8-customer-success-manager---resware"
with mock.patch.dict(sys.modules, {"playwright": None, "playwright.sync_api": None}):
from scripts.scrape_url import _scrape_jobgether
result = _scrape_jobgether(url)
assert result.get("company") == "Resware"
assert result.get("source") == "jobgether"
```
Run: `conda run -n job-seeker python -m pytest tests/test_scrape_url.py::test_detect_board_jobgether tests/test_scrape_url.py::test_jobgether_slug_company_extraction tests/test_scrape_url.py::test_scrape_jobgether_no_playwright -v`
Expected: FAIL (functions not yet defined)
- [ ] **Step 2: Add `_company_from_jobgether_url()` to scrape_url.py**
Add after the `_STRIP_PARAMS` block (around line 34):
```python
def _company_from_jobgether_url(url: str) -> str:
"""Extract company name from Jobgether offer URL slug.
Slug format: /offer/{24-hex-hash}-{title-slug}---{company-slug}
Triple-dash separator delimits title from company.
Returns title-cased company name, or "" if pattern not found.
"""
m = re.search(r"---([^/?]+)$", urlparse(url).path)
if not m:
print(f"[scrape_url] Jobgether URL slug: no company separator found in {url}")
return ""
return m.group(1).replace("-", " ").title()
```
- [ ] **Step 3: Add `"jobgether"` branch to `_detect_board()`**
In `/Library/Development/CircuitForge/peregrine/scripts/scrape_url.py`, modify `_detect_board()` (add before `return "generic"`):
```python
if "jobgether.com" in url_lower:
return "jobgether"
```
- [ ] **Step 4: Add `_scrape_jobgether()` function**
Add after `_scrape_glassdoor()` (around line 137):
```python
def _scrape_jobgether(url: str) -> dict:
"""Scrape a Jobgether offer page using Playwright to bypass 403.
Falls back to URL slug for company name when Playwright is unavailable.
Does not use requests — no raise_for_status().
"""
try:
from playwright.sync_api import sync_playwright
except ImportError:
company = _company_from_jobgether_url(url)
if company:
print(f"[scrape_url] Jobgether: Playwright not installed, using slug fallback → {company}")
return {"company": company, "source": "jobgether"} if company else {}
try:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
try:
ctx = browser.new_context(user_agent=_HEADERS["User-Agent"])
page = ctx.new_page()
page.goto(url, timeout=30_000)
page.wait_for_load_state("networkidle", timeout=20_000)
result = page.evaluate("""() => {
const title = document.querySelector('h1')?.textContent?.trim() || '';
const company = document.querySelector('[class*="company"], [class*="employer"], [data-testid*="company"]')
?.textContent?.trim() || '';
const location = document.querySelector('[class*="location"], [data-testid*="location"]')
?.textContent?.trim() || '';
const desc = document.querySelector('[class*="description"], [class*="job-desc"], article')
?.innerText?.trim() || '';
return { title, company, location, description: desc };
}""")
finally:
browser.close()
# Fall back to slug for company if DOM extraction missed it
if not result.get("company"):
result["company"] = _company_from_jobgether_url(url)
result["source"] = "jobgether"
return {k: v for k, v in result.items() if v}
except Exception as exc:
print(f"[scrape_url] Jobgether Playwright error for {url}: {exc}")
# Last resort: slug fallback
company = _company_from_jobgether_url(url)
return {"company": company, "source": "jobgether"} if company else {}
```
> ⚠️ **The CSS selectors in the `page.evaluate()` call are placeholders.** Before committing, inspect `https://jobgether.com/offer/` in a browser to find the actual class names for title, company, location, and description. Update the selectors accordingly.
- [ ] **Step 5: Add dispatch branch in `scrape_job_url()`**
In the `if board == "linkedin":` dispatch chain (around line 208), add before the `else`:
```python
elif board == "jobgether":
fields = _scrape_jobgether(url)
```
- [ ] **Step 6: Run tests to verify they pass**
Run: `conda run -n job-seeker python -m pytest tests/test_scrape_url.py -v`
Expected: All PASS (including pre-existing tests)
- [ ] **Step 7: Commit**
```bash
git add scripts/scrape_url.py tests/test_scrape_url.py
git commit -m "feat: add Jobgether URL detection and scraper to scrape_url.py"
```
---
## Chunk 2: Jobgether custom board scraper
> ⚠️ **Pre-condition:** Before writing the scraper, inspect `https://jobgether.com/remote-jobs` live to determine the actual URL/filter param format and DOM card selectors. Use the Playwright MCP browser tool or Chrome devtools. Record: (1) the query param for job title search, (2) the job card CSS selectors for title, company, URL, location, salary.
### Task 3: Inspect Jobgether search live
**Files:** None (research step)
- [ ] **Step 1: Navigate to Jobgether remote jobs and inspect search params**
Using browser devtools or Playwright network capture, navigate to `https://jobgether.com/remote-jobs`, search for "Customer Success Manager", and capture:
- The resulting URL (query params)
- Network requests (XHR/fetch) if the page uses API calls
- CSS selectors for job card elements
Record findings here before proceeding.
- [ ] **Step 2: Test a Playwright page.evaluate() extraction manually**
```python
# Run interactively to validate selectors
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False) # headless=False to see the page
page = browser.new_page()
page.goto("https://jobgether.com/remote-jobs")
page.wait_for_load_state("networkidle")
# Test your selectors here
cards = page.query_selector_all("[YOUR_CARD_SELECTOR]")
print(len(cards))
browser.close()
```
---
### Task 4: Write jobgether.py scraper
**Files:**
- Create: `/Library/Development/CircuitForge/peregrine/scripts/custom_boards/jobgether.py`
- Modify: `/Library/Development/CircuitForge/peregrine/tests/test_discover.py` (or create `tests/test_jobgether.py`)
- [ ] **Step 1: Write failing test**
In `/Library/Development/CircuitForge/peregrine/tests/test_discover.py` (or a new `tests/test_jobgether.py`):
```python
def test_jobgether_scraper_returns_empty_on_missing_playwright(monkeypatch):
"""Graceful fallback when Playwright is unavailable."""
import scripts.custom_boards.jobgether as jg
monkeypatch.setattr("scripts.custom_boards.jobgether.sync_playwright", None)
result = jg.scrape({"titles": ["Customer Success Manager"]}, "Remote", results_wanted=5)
assert result == []
def test_jobgether_scraper_respects_results_wanted(monkeypatch):
"""Scraper caps results at results_wanted."""
import scripts.custom_boards.jobgether as jg
fake_jobs = [
{"title": f"CSM {i}", "href": f"/offer/abc{i}-csm---acme", "company": f"Acme {i}",
"location": "Remote", "is_remote": True, "salary": ""}
for i in range(20)
]
class FakePage:
def goto(self, *a, **kw): pass
def wait_for_load_state(self, *a, **kw): pass
def evaluate(self, _): return fake_jobs
class FakeCtx:
def new_page(self): return FakePage()
class FakeBrowser:
def new_context(self, **kw): return FakeCtx()
def close(self): pass
class FakeChromium:
def launch(self, **kw): return FakeBrowser()
class FakeP:
chromium = FakeChromium()
def __enter__(self): return self
def __exit__(self, *a): pass
monkeypatch.setattr("scripts.custom_boards.jobgether.sync_playwright", lambda: FakeP())
result = jg.scrape({"titles": ["CSM"]}, "Remote", results_wanted=5)
assert len(result) <= 5
```
Run: `conda run -n job-seeker python -m pytest tests/ -v -k "jobgether"`
Expected: FAIL (module not found)
- [ ] **Step 2: Create `scripts/custom_boards/jobgether.py`**
```python
"""Jobgether scraper — Playwright-based (requires chromium installed).
Jobgether (jobgether.com) is a remote-work job aggregator. It blocks plain
requests with 403, so we use Playwright to render the page and extract cards.
Install Playwright: conda run -n job-seeker pip install playwright &&
conda run -n job-seeker python -m playwright install chromium
Returns a list of dicts compatible with scripts.db.insert_job().
"""
from __future__ import annotations
import re
import time
from typing import Any
_BASE = "https://jobgether.com"
_SEARCH_PATH = "/remote-jobs"
# TODO: Replace with confirmed query param key after live inspection (Task 3)
_QUERY_PARAM = "search"
# Module-level import so tests can monkeypatch scripts.custom_boards.jobgether.sync_playwright
try:
from playwright.sync_api import sync_playwright
except ImportError:
sync_playwright = None
def scrape(profile: dict, location: str, results_wanted: int = 50) -> list[dict]:
"""
Scrape job listings from Jobgether using Playwright.
Args:
profile: Search profile dict (uses 'titles').
location: Location string — Jobgether is remote-focused; location used
only if the site exposes a location filter.
results_wanted: Maximum results to return across all titles.
Returns:
List of job dicts with keys: title, company, url, source, location,
is_remote, salary, description.
"""
if sync_playwright is None:
print(
" [jobgether] playwright not installed.\n"
" Install: conda run -n job-seeker pip install playwright && "
"conda run -n job-seeker python -m playwright install chromium"
)
return []
results: list[dict] = []
seen_urls: set[str] = set()
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
ctx = browser.new_context(
user_agent=(
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
)
)
page = ctx.new_page()
for title in profile.get("titles", []):
if len(results) >= results_wanted:
break
# TODO: Confirm URL param format from live inspection (Task 3)
url = f"{_BASE}{_SEARCH_PATH}?{_QUERY_PARAM}={title.replace(' ', '+')}"
try:
page.goto(url, timeout=30_000)
page.wait_for_load_state("networkidle", timeout=20_000)
except Exception as exc:
print(f" [jobgether] Page load error for '{title}': {exc}")
continue
# TODO: Replace JS selector with confirmed card selector from Task 3
try:
raw_jobs: list[dict[str, Any]] = page.evaluate(_extract_jobs_js())
except Exception as exc:
print(f" [jobgether] JS extract error for '{title}': {exc}")
continue
if not raw_jobs:
print(f" [jobgether] No cards found for '{title}' — selector may need updating")
continue
for job in raw_jobs:
href = job.get("href", "")
if not href:
continue
full_url = _BASE + href if href.startswith("/") else href
if full_url in seen_urls:
continue
seen_urls.add(full_url)
results.append({
"title": job.get("title", ""),
"company": job.get("company", ""),
"url": full_url,
"source": "jobgether",
"location": job.get("location") or "Remote",
"is_remote": True, # Jobgether is remote-focused
"salary": job.get("salary") or "",
"description": "", # not in card view; scrape_url fills in
})
if len(results) >= results_wanted:
break
time.sleep(1) # polite pacing between titles
browser.close()
return results[:results_wanted]
def _extract_jobs_js() -> str:
"""JS to run in page context — extracts job data from rendered card elements.
TODO: Replace selectors with confirmed values from Task 3 live inspection.
"""
return """() => {
// TODO: replace '[class*=job-card]' with confirmed card selector
const cards = document.querySelectorAll('[class*="job-card"], [data-testid*="job"]');
return Array.from(cards).map(card => {
// TODO: replace these selectors with confirmed values
const titleEl = card.querySelector('h2, h3, [class*="title"]');
const companyEl = card.querySelector('[class*="company"], [class*="employer"]');
const linkEl = card.querySelector('a');
const salaryEl = card.querySelector('[class*="salary"]');
const locationEl = card.querySelector('[class*="location"]');
return {
title: titleEl ? titleEl.textContent.trim() : null,
company: companyEl ? companyEl.textContent.trim() : null,
href: linkEl ? linkEl.getAttribute('href') : null,
salary: salaryEl ? salaryEl.textContent.trim() : null,
location: locationEl ? locationEl.textContent.trim() : null,
is_remote: true,
};
}).filter(j => j.title && j.href);
}"""
```
- [ ] **Step 3: Run tests**
Run: `conda run -n job-seeker python -m pytest tests/ -v -k "jobgether"`
Expected: PASS
- [ ] **Step 4: Commit**
```bash
git add scripts/custom_boards/jobgether.py tests/test_discover.py
git commit -m "feat: add Jobgether custom board scraper (selectors pending live inspection)"
```
---
## Chunk 3: Registration, config, cover letter framing
### Task 5: Register scraper in discover.py + update search_profiles.yaml
**Files:**
- Modify: `/Library/Development/CircuitForge/peregrine/scripts/discover.py`
- Modify: `/Library/Development/CircuitForge/peregrine/config/search_profiles.yaml`
- Modify: `/Library/Development/CircuitForge/peregrine/config/search_profiles.yaml.example` (if it exists)
- [ ] **Step 1: Add import to discover.py import block (lines 2022)**
`jobgether.py` absorbs the Playwright `ImportError` internally (module-level `try/except`), so it always imports successfully. Match the existing pattern exactly:
```python
from scripts.custom_boards import jobgether as _jobgether
```
- [ ] **Step 2: Add to CUSTOM_SCRAPERS dict literal (lines 3034)**
```python
CUSTOM_SCRAPERS: dict[str, object] = {
"adzuna": _adzuna.scrape,
"theladders": _theladders.scrape,
"craigslist": _craigslist.scrape,
"jobgether": _jobgether.scrape,
}
```
When Playwright is absent, `_jobgether.scrape()` returns `[]` gracefully — no special guard needed in `discover.py`.
- [ ] **Step 3: Add `jobgether` to remote-eligible profiles in search_profiles.yaml**
Add `- jobgether` to the `custom_boards` list for every profile that has `Remote` in its `locations`. Based on the current file, that means: `cs_leadership`, `music_industry`, `animal_welfare`, `education`. Do NOT add it to `default` (locations: San Francisco CA only).
- [ ] **Step 4: Run discover tests**
Run: `conda run -n job-seeker python -m pytest tests/test_discover.py -v`
Expected: All PASS
- [ ] **Step 5: Commit**
```bash
git add scripts/discover.py config/search_profiles.yaml
git commit -m "feat: register Jobgether scraper and add to remote search profiles"
```
---
### Task 6: Cover letter recruiter framing
**Files:**
- Modify: `/Library/Development/CircuitForge/peregrine/scripts/generate_cover_letter.py`
- Modify: `/Library/Development/CircuitForge/peregrine/scripts/task_runner.py`
- Modify: `/Library/Development/CircuitForge/peregrine/tests/test_match.py` or add `tests/test_cover_letter.py`
- [ ] **Step 1: Write failing test**
Create or add to `/Library/Development/CircuitForge/peregrine/tests/test_cover_letter.py`:
```python
def test_build_prompt_jobgether_framing_unknown_company():
from scripts.generate_cover_letter import build_prompt
prompt = build_prompt(
title="Customer Success Manager",
company="Jobgether",
description="CSM role at an undisclosed company.",
examples=[],
is_jobgether=True,
)
assert "Your client" in prompt
assert "recruiter" in prompt.lower() or "jobgether" in prompt.lower()
def test_build_prompt_jobgether_framing_known_company():
from scripts.generate_cover_letter import build_prompt
prompt = build_prompt(
title="Customer Success Manager",
company="Resware",
description="CSM role at Resware.",
examples=[],
is_jobgether=True,
)
assert "Your client at Resware" in prompt
def test_build_prompt_no_jobgether_framing_by_default():
from scripts.generate_cover_letter import build_prompt
prompt = build_prompt(
title="Customer Success Manager",
company="Acme Corp",
description="CSM role.",
examples=[],
)
assert "Your client" not in prompt
```
Run: `conda run -n job-seeker python -m pytest tests/test_cover_letter.py -v`
Expected: FAIL
- [ ] **Step 2: Add `is_jobgether` to `build_prompt()` in generate_cover_letter.py**
Modify the `build_prompt()` signature (line 186):
```python
def build_prompt(
title: str,
company: str,
description: str,
examples: list[dict],
mission_hint: str | None = None,
is_jobgether: bool = False,
) -> str:
```
Add the recruiter hint block after the `mission_hint` block (after line 203):
```python
if is_jobgether:
if company and company.lower() != "jobgether":
recruiter_note = (
f"🤝 Recruiter context: This listing is posted by Jobgether on behalf of "
f"{company}. Address the cover letter to the Jobgether recruiter, not directly "
f"to the hiring company. Use framing like 'Your client at {company} will "
f"appreciate...' rather than addressing {company} directly. The role "
f"requirements are those of the actual employer."
)
else:
recruiter_note = (
"🤝 Recruiter context: This listing is posted by Jobgether on behalf of an "
"undisclosed employer. Address the cover letter to the Jobgether recruiter. "
"Use framing like 'Your client will appreciate...' rather than addressing "
"the company directly."
)
parts.append(f"{recruiter_note}\n")
```
- [ ] **Step 3: Add `is_jobgether` to `generate()` signature**
Modify `generate()` (line 233):
```python
def generate(
title: str,
company: str,
description: str = "",
previous_result: str = "",
feedback: str = "",
is_jobgether: bool = False,
_router=None,
) -> str:
```
Pass it through to `build_prompt()` (line 254):
```python
prompt = build_prompt(title, company, description, examples,
mission_hint=mission_hint, is_jobgether=is_jobgether)
```
- [ ] **Step 4: Pass `is_jobgether` from task_runner.py**
In `/Library/Development/CircuitForge/peregrine/scripts/task_runner.py`, modify the `generate()` call inside the `cover_letter` task block (`elif task_type == "cover_letter":` starts at line 152; the `generate()` call is at ~line 156):
```python
elif task_type == "cover_letter":
import json as _json
p = _json.loads(params or "{}")
from scripts.generate_cover_letter import generate
result = generate(
job.get("title", ""),
job.get("company", ""),
job.get("description", ""),
previous_result=p.get("previous_result", ""),
feedback=p.get("feedback", ""),
is_jobgether=job.get("source") == "jobgether",
)
update_cover_letter(db_path, job_id, result)
```
- [ ] **Step 5: Run tests**
Run: `conda run -n job-seeker python -m pytest tests/test_cover_letter.py -v`
Expected: All PASS
- [ ] **Step 6: Run full test suite**
Run: `conda run -n job-seeker python -m pytest tests/ -v`
Expected: All PASS
- [ ] **Step 7: Commit**
```bash
git add scripts/generate_cover_letter.py scripts/task_runner.py tests/test_cover_letter.py
git commit -m "feat: add Jobgether recruiter framing to cover letter generation"
```
---
## Final: Merge
- [ ] **Merge worktree branch to main**
```bash
cd /Library/Development/CircuitForge/peregrine
git merge feature/jobgether-integration
git worktree remove .worktrees/jobgether-integration
```
- [ ] **Push to remote**
```bash
git push origin main
```
---
## Manual verification after merge
1. Add the stuck Jobgether manual import (job 2286) — delete the old stuck row and re-add the URL via "Add Jobs by URL" in the Home page. Verify the scraper resolves company = "Resware".
2. Run a short discovery (`discover.py` with `results_per_board: 5`) and confirm no `company="Jobgether"` rows appear in `staging.db`.
3. Generate a cover letter for a Jobgether-sourced job and confirm recruiter framing appears.

View file

@ -0,0 +1,173 @@
# Jobgether Integration Design
**Date:** 2026-03-15
**Status:** Approved
**Scope:** Peregrine — discovery pipeline + manual URL import
---
## Problem
Jobgether is a job aggregator that posts listings on LinkedIn and other boards with `company = "Jobgether"` rather than the actual employer. This causes two problems:
1. **Misleading listings** — Jobs appear to be at "Jobgether" rather than the real hiring company. Meg sees "Jobgether" as employer throughout the pipeline (Job Review, cover letters, company research).
2. **Broken manual import** — Direct `jobgether.com` URLs return HTTP 403 when scraped with plain `requests`, leaving jobs stuck as `title = "Importing…"`.
**Evidence from DB:** 29+ Jobgether-sourced LinkedIn listings with `company = "Jobgether"`. Actual employer is intentionally withheld by Jobgether's business model ("on behalf of a partner company").
---
## Decision: Option A — Filter + Dedicated Scraper
Drop Jobgether listings from other scrapers entirely and replace with a direct Jobgether scraper that retrieves accurate company names. Existing Jobgether-via-LinkedIn listings in the DB are left as-is for manual review/rejection.
**Why not Option B (follow-through):** LinkedIn→Jobgether→employer is a two-hop chain where the employer is deliberately hidden. Jobgether blocks `requests`. Not worth the complexity for unreliable data.
---
## Components
### 1. Jobgether company filter — `config/blocklist.yaml`
Add `"jobgether"` to the `companies` list in `config/blocklist.yaml`. The existing `_is_blocklisted()` function in `discover.py` already performs a partial case-insensitive match on the company field and applies to all scrapers (JobSpy boards + all custom boards). No code change required.
```yaml
companies:
- jobgether
```
This is the correct mechanism — it is user-visible, config-driven, and applies uniformly. Log output already reports blocklisted jobs per run.
### 2. URL handling in `scrape_url.py`
Three changes required:
**a) `_detect_board()`** — add `"jobgether"` branch returning `"jobgether"` when `"jobgether.com"` is in the URL. Must be added before the `return "generic"` fallback.
**b) dispatch block in `scrape_job_url()`** — add `elif board == "jobgether": fields = _scrape_jobgether(url)` to the `if/elif` chain (lines 208215). Without this, the new `_detect_board()` branch silently falls through to `_scrape_generic()`.
**c) `_scrape_jobgether(url)`** — Playwright-based scraper to bypass 403. Extracts:
- `title` — job title from page heading
- `company` — actual employer name (visible on Jobgether offer pages)
- `location` — remote/location info
- `description` — full job description
- `source = "jobgether"`
Playwright errors (`playwright.sync_api.Error`, `TimeoutError`) are not subclasses of `requests.RequestException` but are caught by the existing broad `except Exception` handler in `scrape_job_url()` — no changes needed to the error handling block.
**URL slug fallback for company name (manual import path only):** Jobgether offer URLs follow the pattern:
```
https://jobgether.com/offer/{24-hex-hash}-{title-slug}---{company-slug}
```
When Playwright is unavailable, parse `company-slug` using:
```python
m = re.search(r'---([^/?]+)$', parsed_path)
company = m.group(1).replace("-", " ").title() if m else ""
```
Example: `/offer/69b42d9d24d79271ee0618e8-customer-success-manager---resware``"Resware"`.
This fallback is scoped to `_scrape_jobgether()` in `scrape_url.py` only; the discovery scraper always gets company name from the rendered DOM. `_scrape_jobgether()` does not make any `requests` calls — there is no `raise_for_status()` — so the `requests.RequestException` handler in `scrape_job_url()` is irrelevant to this path; only the broad `except Exception` applies.
**Pre-implementation checkpoint:** Confirm that Jobgether offer URLs have no tracking query params beyond UTM (already covered by `_STRIP_PARAMS`). No `canonicalize_url()` changes are expected but verify before implementation.
### 3. `scripts/custom_boards/jobgether.py`
Playwright-based search scraper following the same interface as `theladders.py`:
```python
def scrape(profile: dict, location: str, results_wanted: int = 50) -> list[dict]
```
- Base URL: `https://jobgether.com/remote-jobs`
- Search strategy: iterate over `profile["titles"]`, apply search/filter params
- **Pre-condition — do not begin implementation of this file until live URL inspection is complete.** Use browser dev tools or a Playwright `page.on("request")` capture to determine the actual query parameter format for title/location filtering. Jobgether may use URL query params, path segments, or JS-driven state — this cannot be assumed from the URL alone.
- Extraction: job cards from rendered DOM (Playwright `page.evaluate()`)
- Returns standard job dicts: `title, company, url, source, location, is_remote, salary, description`
- `source = "jobgether"`
- Graceful `ImportError` handling if Playwright not installed (same pattern as `theladders.py`)
- Polite pacing: 1s sleep between title iterations
- Company name comes from DOM; URL slug parse is not needed in this path
### 4. Registration + config
**`discover.py` — import block (lines 2022):**
```python
from scripts.custom_boards import jobgether as _jobgether
```
**`discover.py` — `CUSTOM_SCRAPERS` dict literal (lines 3034):**
```python
CUSTOM_SCRAPERS: dict[str, object] = {
"adzuna": _adzuna.scrape,
"theladders": _theladders.scrape,
"craigslist": _craigslist.scrape,
"jobgether": _jobgether.scrape, # ← add this line
}
```
**`config/search_profiles.yaml` (and `.example`):**
Add `jobgether` to `custom_boards` for any profile that includes `Remote` in its `locations` list. Jobgether is a remote-work-focused aggregator; adding it to location-specific non-remote profiles is not useful. Do not add a `custom_boards` key to profiles that don't already have one unless they are remote-eligible.
```yaml
custom_boards:
- jobgether
```
---
## Data Flow
```
discover.py
├── JobSpy boards → _is_blocklisted(company="jobgether") → drop → DB insert
├── custom: adzuna → _is_blocklisted(company="jobgether") → drop → DB insert
├── custom: theladders → _is_blocklisted(company="jobgether") → drop → DB insert
├── custom: craigslist → _is_blocklisted(company="jobgether") → drop → DB insert
└── custom: jobgether → (company = real employer, never "jobgether") → DB insert
scrape_url.py
└── jobgether.com URL → _detect_board() = "jobgether"
→ _scrape_jobgether()
├── Playwright available → full job fields from page
└── Playwright unavailable → company from URL slug only
```
---
## Implementation Notes
- **Slug fallback None-guard:** The regex `r'---([^/?]+)$'` returns a wrong value (not `None`) if the URL slug doesn't follow the expected format. Add a logged warning and return `""` rather than title-casing garbage.
- **Import guard in `discover.py`:** Wrap the `jobgether` import with `try/except ImportError`, setting `_jobgether = None`, and gate the `CUSTOM_SCRAPERS` registration with `if _jobgether is not None`. This ensures the graceful ImportError in `jobgether.py` (for missing Playwright) propagates cleanly to the caller rather than crashing discovery.
### 5. Cover letter recruiter framing — `scripts/generate_cover_letter.py`
When `source = "jobgether"`, inject a system hint that shifts the cover letter addressee from the employer to the Jobgether recruiter. Use Policy A: recruiter framing applies for all Jobgether-sourced jobs regardless of whether the real company name was resolved.
- If company is known (e.g. "Resware"): *"Your client at Resware will appreciate..."*
- If company is unknown: *"Your client will appreciate..."*
The real company name is always stored in the DB as resolved by the scraper — this is internal knowledge only. The framing shift is purely in the generated letter text, not in how the job is stored or displayed.
Implementation: add an `is_jobgether` flag to the cover letter prompt context (same pattern as `mission_hint` injection). Add a conditional block in the system prompt / Para 1 instructions when the flag is true.
---
## Out of Scope
- Retroactively fixing existing `company = "Jobgether"` rows in the DB (left for manual review/rejection)
- Jobgether discovery scraper — **decided against during implementation (2026-03-15)**: Cloudflare Turnstile blocks all headless browsers on all Jobgether pages; `filter-api.jobgether.com` requires auth; `robots.txt` blocks all bots. The email digest → manual URL paste → slug company extraction flow covers the actual use case.
- Jobgether authentication / logged-in scraping
- Pagination
- Dedup between Jobgether and other boards (existing URL dedup handles this)
---
## Files Changed
| File | Change |
|------|--------|
| `config/blocklist.yaml` | Add `"jobgether"` to `companies` list |
| `scripts/discover.py` | Add import + entry in `CUSTOM_SCRAPERS` dict literal |
| `scripts/scrape_url.py` | Add `_detect_board` branch, dispatch branch, `_scrape_jobgether()` |
| `scripts/custom_boards/jobgether.py` | New file — Playwright search scraper |
| `config/search_profiles.yaml` | Add `jobgether` to `custom_boards` |
| `config/search_profiles.yaml.example` | Same |

View file

@ -189,6 +189,7 @@ def build_prompt(
description: str,
examples: list[dict],
mission_hint: str | None = None,
is_jobgether: bool = False,
) -> str:
parts = [SYSTEM_CONTEXT.strip(), ""]
if examples:
@ -202,6 +203,24 @@ def build_prompt(
if mission_hint:
parts.append(f"⭐ Mission alignment note (for Para 3): {mission_hint}\n")
if is_jobgether:
if company and company.lower() != "jobgether":
recruiter_note = (
f"🤝 Recruiter context: This listing is posted by Jobgether on behalf of "
f"{company}. Address the cover letter to the Jobgether recruiter, not directly "
f"to the hiring company. Use framing like 'Your client at {company} will "
f"appreciate...' rather than addressing {company} directly. The role "
f"requirements are those of the actual employer."
)
else:
recruiter_note = (
"🤝 Recruiter context: This listing is posted by Jobgether on behalf of an "
"undisclosed employer. Address the cover letter to the Jobgether recruiter. "
"Use framing like 'Your client will appreciate...' rather than addressing "
"the company directly."
)
parts.append(f"{recruiter_note}\n")
parts.append(f"Now write a new cover letter for:")
parts.append(f" Role: {title}")
parts.append(f" Company: {company}")
@ -236,6 +255,7 @@ def generate(
description: str = "",
previous_result: str = "",
feedback: str = "",
is_jobgether: bool = False,
_router=None,
) -> str:
"""Generate a cover letter and return it as a string.
@ -251,7 +271,8 @@ def generate(
mission_hint = detect_mission_alignment(company, description)
if mission_hint:
print(f"[cover-letter] Mission alignment detected for {company}", file=sys.stderr)
prompt = build_prompt(title, company, description, examples, mission_hint=mission_hint)
prompt = build_prompt(title, company, description, examples,
mission_hint=mission_hint, is_jobgether=is_jobgether)
if previous_result:
prompt += f"\n\n---\nPrevious draft:\n{previous_result}"

View file

@ -33,6 +33,20 @@ _STRIP_PARAMS = {
"eid", "otpToken", "ssid", "fmid",
}
def _company_from_jobgether_url(url: str) -> str:
"""Extract company name from Jobgether offer URL slug.
Slug format: /offer/{24-hex-hash}-{title-slug}---{company-slug}
Triple-dash separator delimits title from company.
Returns title-cased company name, or "" if pattern not found.
"""
m = re.search(r"---([^/?]+)$", urlparse(url).path)
if not m:
print(f"[scrape_url] Jobgether URL slug: no company separator found in {url}")
return ""
return m.group(1).replace("-", " ").title()
_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
@ -51,6 +65,8 @@ def _detect_board(url: str) -> str:
return "indeed"
if "glassdoor.com" in url_lower:
return "glassdoor"
if "jobgether.com" in url_lower:
return "jobgether"
return "generic"
@ -136,6 +152,55 @@ def _scrape_glassdoor(url: str) -> dict:
return {}
def _scrape_jobgether(url: str) -> dict:
"""Scrape a Jobgether offer page using Playwright to bypass 403.
Falls back to URL slug for company name when Playwright is unavailable.
Does not use requests no raise_for_status().
"""
try:
from playwright.sync_api import sync_playwright
except ImportError:
company = _company_from_jobgether_url(url)
if company:
print(f"[scrape_url] Jobgether: Playwright not installed, using slug fallback → {company}")
return {"company": company, "source": "jobgether"} if company else {}
try:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
try:
ctx = browser.new_context(user_agent=_HEADERS["User-Agent"])
page = ctx.new_page()
page.goto(url, timeout=30_000)
page.wait_for_load_state("networkidle", timeout=20_000)
result = page.evaluate("""() => {
const title = document.querySelector('h1')?.textContent?.trim() || '';
const company = document.querySelector('[class*="company"], [class*="employer"], [data-testid*="company"]')
?.textContent?.trim() || '';
const location = document.querySelector('[class*="location"], [data-testid*="location"]')
?.textContent?.trim() || '';
const desc = document.querySelector('[class*="description"], [class*="job-desc"], article')
?.innerText?.trim() || '';
return { title, company, location, description: desc };
}""")
finally:
browser.close()
# Fall back to slug for company if DOM extraction missed it
if not result.get("company"):
result["company"] = _company_from_jobgether_url(url)
result["source"] = "jobgether"
return {k: v for k, v in result.items() if v}
except Exception as exc:
print(f"[scrape_url] Jobgether Playwright error for {url}: {exc}")
company = _company_from_jobgether_url(url)
return {"company": company, "source": "jobgether"} if company else {}
def _parse_json_ld_or_og(html: str) -> dict:
"""Extract job fields from JSON-LD structured data, then og: meta tags."""
soup = BeautifulSoup(html, "html.parser")
@ -211,6 +276,8 @@ def scrape_job_url(db_path: Path = DEFAULT_DB, job_id: int = None) -> dict:
fields = _scrape_indeed(url)
elif board == "glassdoor":
fields = _scrape_glassdoor(url)
elif board == "jobgether":
fields = _scrape_jobgether(url)
else:
fields = _scrape_generic(url)
except requests.RequestException as exc:

View file

@ -169,6 +169,7 @@ def _run_task(db_path: Path, task_id: int, task_type: str, job_id: int,
job.get("description", ""),
previous_result=p.get("previous_result", ""),
feedback=p.get("feedback", ""),
is_jobgether=job.get("source") == "jobgether",
)
update_cover_letter(db_path, job_id, result)

View file

@ -115,3 +115,41 @@ def test_generate_calls_llm_router():
mock_router.complete.assert_called_once()
assert "Alex Rivera" in result
# ── Jobgether recruiter framing tests ─────────────────────────────────────────
def test_build_prompt_jobgether_framing_unknown_company():
from scripts.generate_cover_letter import build_prompt
prompt = build_prompt(
title="Customer Success Manager",
company="Jobgether",
description="CSM role at an undisclosed company.",
examples=[],
is_jobgether=True,
)
assert "Your client" in prompt
assert "jobgether" in prompt.lower()
def test_build_prompt_jobgether_framing_known_company():
from scripts.generate_cover_letter import build_prompt
prompt = build_prompt(
title="Customer Success Manager",
company="Resware",
description="CSM role at Resware.",
examples=[],
is_jobgether=True,
)
assert "Your client at Resware" in prompt
def test_build_prompt_no_jobgether_framing_by_default():
from scripts.generate_cover_letter import build_prompt
prompt = build_prompt(
title="Customer Success Manager",
company="Acme Corp",
description="CSM role.",
examples=[],
)
assert "Your client" not in prompt

View file

@ -79,10 +79,12 @@ class TestTaskRunnerCoverLetterParams:
"""Invoke _run_task for cover_letter and return captured generate() kwargs."""
captured = {}
def mock_generate(title, company, description="", previous_result="", feedback="", _router=None):
def mock_generate(title, company, description="", previous_result="", feedback="",
is_jobgether=False, _router=None):
captured.update({
"title": title, "company": company,
"previous_result": previous_result, "feedback": feedback,
"is_jobgether": is_jobgether,
})
return "Generated letter"

View file

@ -183,3 +183,14 @@ def test_discover_custom_board_deduplicates(tmp_path):
assert count == 0 # duplicate skipped
assert len(get_jobs_by_status(db_path, "pending")) == 1
# ── Blocklist integration ─────────────────────────────────────────────────────
def test_is_blocklisted_jobgether():
"""_is_blocklisted filters jobs from Jobgether (case-insensitive)."""
from scripts.discover import _is_blocklisted
blocklist = {"companies": ["jobgether"], "industries": [], "locations": []}
assert _is_blocklisted({"company": "Jobgether", "location": "", "description": ""}, blocklist)
assert _is_blocklisted({"company": "jobgether inc", "location": "", "description": ""}, blocklist)
assert not _is_blocklisted({"company": "Acme Corp", "location": "", "description": ""}, blocklist)

View file

@ -133,3 +133,36 @@ def test_scrape_url_graceful_on_http_error(tmp_path):
row = conn.execute("SELECT id FROM jobs WHERE id=?", (job_id,)).fetchone()
conn.close()
assert row is not None
def test_detect_board_jobgether():
from scripts.scrape_url import _detect_board
assert _detect_board("https://jobgether.com/offer/69b42d9d24d79271ee0618e8-csm---resware") == "jobgether"
assert _detect_board("https://www.jobgether.com/offer/abc-role---company") == "jobgether"
def test_jobgether_slug_company_extraction():
from scripts.scrape_url import _company_from_jobgether_url
assert _company_from_jobgether_url(
"https://jobgether.com/offer/69b42d9d24d79271ee0618e8-customer-success-manager---resware"
) == "Resware"
assert _company_from_jobgether_url(
"https://jobgether.com/offer/abc123-director-of-cs---acme-corp"
) == "Acme Corp"
assert _company_from_jobgether_url(
"https://jobgether.com/offer/abc123-no-separator-here"
) == ""
def test_scrape_jobgether_no_playwright(tmp_path):
"""When Playwright is unavailable, _scrape_jobgether falls back to URL slug for company."""
import sys
import unittest.mock as mock
url = "https://jobgether.com/offer/69b42d9d24d79271ee0618e8-customer-success-manager---resware"
with mock.patch.dict(sys.modules, {"playwright": None, "playwright.sync_api": None}):
from scripts.scrape_url import _scrape_jobgether
result = _scrape_jobgether(url)
assert result.get("company") == "Resware"
assert result.get("source") == "jobgether"