App: Peregrine Company: Circuit Forge LLC Source: github.com/pyr0ball/job-seeker (personal fork, not linked)
108 lines
3.9 KiB
Markdown
108 lines
3.9 KiB
Markdown
# Design: Job Ingestion Improvements
|
|
|
|
**Date:** 2026-02-24
|
|
**Status:** Approved
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Three improvements to how jobs enter the pipeline:
|
|
|
|
1. **Auto-parse LinkedIn Job Alert emails** — digest emails from `jobalerts-noreply@linkedin.com`
|
|
contain multiple structured job cards in plain text. Currently ingested as a single confusing
|
|
email lead. Instead, parse each card into a separate pending job and scrape it via a background
|
|
task.
|
|
|
|
2. **`scrape_url` background task** — new task type that takes a job record's URL, fetches
|
|
the full listing (title, company, description, salary, location), and updates the job row.
|
|
Shared by both the LinkedIn alert parser and the manual URL import feature.
|
|
|
|
3. **Add Job(s) by URL on Home page** — paste one URL per line, or upload a CSV with a URL
|
|
column. Each URL is inserted as a pending job and queued for background scraping.
|
|
|
|
---
|
|
|
|
## `scrape_url` Worker (`scripts/scrape_url.py`)
|
|
|
|
Single public function: `scrape_job_url(db_path, job_id) -> dict`
|
|
|
|
Board detection from URL hostname:
|
|
|
|
| URL pattern | Board | Scrape method |
|
|
|---|---|---|
|
|
| `linkedin.com/jobs/view/<id>/` | LinkedIn | LinkedIn guest jobs API (`/jobs-guest/jobs/api/jobPosting/<id>`) |
|
|
| `indeed.com/viewjob?jk=<key>` | Indeed | requests + BeautifulSoup HTML parse |
|
|
| `glassdoor.com/...` | Glassdoor | JobSpy internal scraper (same as `enrich_descriptions.py`) |
|
|
| anything else | generic | requests + JSON-LD → og:tags fallback |
|
|
|
|
On success: `UPDATE jobs SET title, company, description, salary, location, is_remote WHERE id=?`
|
|
On failure: job remains pending with its URL intact — user can still approve/reject it.
|
|
|
|
Requires a new `update_job_fields(db_path, job_id, fields: dict)` helper in `db.py`.
|
|
|
|
---
|
|
|
|
## LinkedIn Alert Parser (`imap_sync.py`)
|
|
|
|
New function `parse_linkedin_alert(body: str) -> list[dict]`
|
|
|
|
The plain-text body has a reliable block structure:
|
|
```
|
|
<Title>
|
|
<Company>
|
|
<Location>
|
|
[optional social proof lines like "2 school alumni"]
|
|
View job: https://www.linkedin.com/comm/jobs/view/<ID>/?<tracking>
|
|
|
|
---------------------------------------------------------
|
|
|
|
<next job block...>
|
|
```
|
|
|
|
Parser:
|
|
1. Split on lines of 10+ dashes
|
|
2. For each block: filter out social-proof lines (alumni, "Apply with", "actively hiring", etc.)
|
|
3. Extract: title (line 1), company (line 2), location (line 3), URL (line starting "View job:")
|
|
4. Canonicalize URL: strip tracking params → `https://www.linkedin.com/jobs/view/<id>/`
|
|
|
|
Detection in `_scan_unmatched_leads`: if `from_addr` contains
|
|
`jobalerts-noreply@linkedin.com`, skip the LLM path and call `parse_linkedin_alert` instead.
|
|
Each parsed card → `insert_job()` + `submit_task(db, "scrape_url", job_id)`.
|
|
The email itself is not stored as an email lead — it's a batch import trigger.
|
|
|
|
---
|
|
|
|
## Home Page URL Import
|
|
|
|
New section on `app/Home.py` between Email Sync and Danger Zone.
|
|
|
|
Two tabs:
|
|
- **Paste URLs** — `st.text_area`, one URL per line
|
|
- **Upload CSV** — `st.file_uploader`, auto-detects first column value starting with `http`
|
|
|
|
Both routes call a shared `_queue_url_imports(db_path, urls)` helper that:
|
|
1. Filters URLs already in the DB (dedup by URL)
|
|
2. Calls `insert_job({title="Importing…", source="manual", url=url, ...})`
|
|
3. Calls `submit_task(db, "scrape_url", job_id)` per new job
|
|
4. Shows `st.success(f"Queued N job(s)")`
|
|
|
|
A `@st.fragment(run_every=3)` status block below the form polls active `scrape_url` tasks
|
|
and shows per-job status (⏳ / ✅ / ❌ title - company).
|
|
|
|
---
|
|
|
|
## Search Settings (already applied)
|
|
|
|
`config/search_profiles.yaml`:
|
|
- `hours_old: 120 → 240` (cover LinkedIn's algo-sorted alerts)
|
|
- `results_per_board: 50 → 75`
|
|
- Added title: `Customer Engagement Manager`
|
|
|
|
---
|
|
|
|
## Out of Scope
|
|
|
|
- Scraping all 551 historical LinkedIn alert emails (run email sync going forward)
|
|
- Deduplication against Notion (URL dedup in SQLite is sufficient)
|
|
- Authentication-required boards (Indeed Easy Apply, etc.)
|