docs: add Jobgether non-headless Playwright scraper to backlog
Xvfb-based Playwright can bypass Cloudflare Turnstile on jobgether.com. Live inspection confirmed selectors; deferred pending Xvfb integration.
This commit is contained in:
parent
9c36c578ef
commit
22696b4e50
1 changed files with 55 additions and 0 deletions
|
|
@ -50,6 +50,36 @@ community-contributed and CF-freebie scrapers (free, MIT, in `scripts/plugins/`)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Discovery — Jobgether Non-Headless Scraper
|
||||||
|
|
||||||
|
Design doc: `peregrine/docs/superpowers/specs/2026-03-15-jobgether-integration-design.md`
|
||||||
|
|
||||||
|
**Background:** Headless Playwright is blocked by Cloudflare Turnstile on all `jobgether.com` pages.
|
||||||
|
A non-headless Playwright instance backed by `Xvfb` (virtual framebuffer) renders as a real browser and
|
||||||
|
bypasses Turnstile. Heimdall already has Xvfb available.
|
||||||
|
|
||||||
|
**Live-inspection findings (2026-03-15):**
|
||||||
|
- Search URL: `https://jobgether.com/search-offers?keyword=<query>`
|
||||||
|
- Job cards: `div.new-opportunity` — one per listing
|
||||||
|
- Card URL: `div.new-opportunity > a[href*="/offer/"]` (`href` attr)
|
||||||
|
- Title: `#offer-body h3`
|
||||||
|
- Company: `#offer-body p.font-medium`
|
||||||
|
- Dedup: existing URL-based dedup in `discover.py` covers Jobgether↔other-board overlap
|
||||||
|
|
||||||
|
**Implementation tasks (blocked until Xvfb-Playwright integration is in place):**
|
||||||
|
- [ ] Add `Xvfb` launch helper to `scripts/custom_boards/` (shared util, or inline in scraper)
|
||||||
|
- [ ] Implement `scripts/custom_boards/jobgether.py` using `p.chromium.launch(headless=False)` with `DISPLAY=:99`
|
||||||
|
- [ ] Pre-launch `Xvfb :99 -screen 0 1280x720x24` (or assert `DISPLAY` is already set)
|
||||||
|
- [ ] Register `jobgether` in `discover.py` `CUSTOM_SCRAPERS` (currently omitted — no viable scraper)
|
||||||
|
- [ ] Add `jobgether` to `custom_boards` in remote-eligible profiles in `config/search_profiles.yaml`
|
||||||
|
- [ ] Remove or update the "Jobgether discovery scraper — decided against" note in the design spec
|
||||||
|
|
||||||
|
**Pre-condition:** Validate Xvfb approach manually (headless=False + `DISPLAY=:99`) before implementing.
|
||||||
|
The `filter-api.jobgether.com` endpoint still requires auth and `robots.txt` still blocks bots —
|
||||||
|
confirm Turnstile acceptance is the only remaining blocker before beginning.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Settings / Data Management
|
## Settings / Data Management
|
||||||
|
|
||||||
- **Backup / Restore / Teleport** — Settings panel option to export a full config snapshot (user.yaml + all gitignored configs) as a zip, restore from a snapshot, and "teleport" (export + import to a new machine or Docker volume). Useful for migrations, multi-machine setups, and safe wizard testing.
|
- **Backup / Restore / Teleport** — Settings panel option to export a full config snapshot (user.yaml + all gitignored configs) as a zip, restore from a snapshot, and "teleport" (export + import to a new machine or Docker volume). Useful for migrations, multi-machine setups, and safe wizard testing.
|
||||||
|
|
@ -63,6 +93,31 @@ community-contributed and CF-freebie scrapers (free, MIT, in `scripts/plugins/`)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## LinkedIn Import
|
||||||
|
|
||||||
|
Shipped in v0.4.0. Ongoing maintenance and known decisions:
|
||||||
|
|
||||||
|
- **Selector maintenance** — LinkedIn changes their DOM periodically. When import stops working, update
|
||||||
|
CSS selectors in `scripts/linkedin_utils.py` only (all other files import from there). Real `data-section`
|
||||||
|
attribute values (as of 2025 DOM): `summary`, `currentPositionsDetails`, `educationsDetails`,
|
||||||
|
`certifications`, `posts`, `volunteering`, `publications`, `projects`.
|
||||||
|
|
||||||
|
- **Data export zip is the recommended path for full history** — LinkedIn's unauthenticated public profile
|
||||||
|
page is server-side degraded: experience titles, past roles, education, and skills are blurred/omitted.
|
||||||
|
Only available without login: name, About summary (truncated), current employer name, certifications.
|
||||||
|
The "Import from LinkedIn data export zip" expander (Settings → Resume Profile and Wizard step 3) is the
|
||||||
|
correct path for full career history. UI already shows an `ℹ️` callout explaining this.
|
||||||
|
|
||||||
|
- **LinkedIn OAuth — decided: not viable** — LinkedIn's OAuth API is restricted to approved partner
|
||||||
|
programs. Even if approved, it only grants name + email (not career history, experience, or skills).
|
||||||
|
This is a deliberate LinkedIn platform restriction, not a technical gap. Do not pursue this path.
|
||||||
|
|
||||||
|
- **Selector test harness** (future) — A lightweight test that fetches a known-public LinkedIn profile
|
||||||
|
and asserts at least N fields non-empty would catch DOM breakage before users report it. Low priority
|
||||||
|
until selector breakage becomes a recurring support issue.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Cover Letter / Resume Generation
|
## Cover Letter / Resume Generation
|
||||||
|
|
||||||
- ~~**Iterative refinement feedback loop**~~ — ✅ Done (`94225c9`): `generate()` accepts `previous_result`/`feedback`; task_runner parses params JSON; Apply Workspace has "Refine with Feedback" expander. Same pattern available for wizard `expand_bullets` via `_run_wizard_generate`.
|
- ~~**Iterative refinement feedback loop**~~ — ✅ Done (`94225c9`): `generate()` accepts `previous_result`/`feedback`; task_runner parses params JSON; Apply Workspace has "Refine with Feedback" expander. Same pattern available for wizard `expand_bullets` via `_run_wizard_generate`.
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue