App: Peregrine Company: Circuit Forge LLC Source: github.com/pyr0ball/job-seeker (personal fork, not linked)
3.9 KiB
Design: Job Ingestion Improvements
Date: 2026-02-24 Status: Approved
Overview
Three improvements to how jobs enter the pipeline:
-
Auto-parse LinkedIn Job Alert emails — digest emails from
jobalerts-noreply@linkedin.comcontain multiple structured job cards in plain text. Currently ingested as a single confusing email lead. Instead, parse each card into a separate pending job and scrape it via a background task. -
scrape_urlbackground task — new task type that takes a job record's URL, fetches the full listing (title, company, description, salary, location), and updates the job row. Shared by both the LinkedIn alert parser and the manual URL import feature. -
Add Job(s) by URL on Home page — paste one URL per line, or upload a CSV with a URL column. Each URL is inserted as a pending job and queued for background scraping.
scrape_url Worker (scripts/scrape_url.py)
Single public function: scrape_job_url(db_path, job_id) -> dict
Board detection from URL hostname:
| URL pattern | Board | Scrape method |
|---|---|---|
linkedin.com/jobs/view/<id>/ |
LinkedIn guest jobs API (/jobs-guest/jobs/api/jobPosting/<id>) |
|
indeed.com/viewjob?jk=<key> |
Indeed | requests + BeautifulSoup HTML parse |
glassdoor.com/... |
Glassdoor | JobSpy internal scraper (same as enrich_descriptions.py) |
| anything else | generic | requests + JSON-LD → og:tags fallback |
On success: UPDATE jobs SET title, company, description, salary, location, is_remote WHERE id=?
On failure: job remains pending with its URL intact — user can still approve/reject it.
Requires a new update_job_fields(db_path, job_id, fields: dict) helper in db.py.
LinkedIn Alert Parser (imap_sync.py)
New function parse_linkedin_alert(body: str) -> list[dict]
The plain-text body has a reliable block structure:
<Title>
<Company>
<Location>
[optional social proof lines like "2 school alumni"]
View job: https://www.linkedin.com/comm/jobs/view/<ID>/?<tracking>
---------------------------------------------------------
<next job block...>
Parser:
- Split on lines of 10+ dashes
- For each block: filter out social-proof lines (alumni, "Apply with", "actively hiring", etc.)
- Extract: title (line 1), company (line 2), location (line 3), URL (line starting "View job:")
- Canonicalize URL: strip tracking params →
https://www.linkedin.com/jobs/view/<id>/
Detection in _scan_unmatched_leads: if from_addr contains
jobalerts-noreply@linkedin.com, skip the LLM path and call parse_linkedin_alert instead.
Each parsed card → insert_job() + submit_task(db, "scrape_url", job_id).
The email itself is not stored as an email lead — it's a batch import trigger.
Home Page URL Import
New section on app/Home.py between Email Sync and Danger Zone.
Two tabs:
- Paste URLs —
st.text_area, one URL per line - Upload CSV —
st.file_uploader, auto-detects first column value starting withhttp
Both routes call a shared _queue_url_imports(db_path, urls) helper that:
- Filters URLs already in the DB (dedup by URL)
- Calls
insert_job({title="Importing…", source="manual", url=url, ...}) - Calls
submit_task(db, "scrape_url", job_id)per new job - Shows
st.success(f"Queued N job(s)")
A @st.fragment(run_every=3) status block below the form polls active scrape_url tasks
and shows per-job status (⏳ / ✅ / ❌ title - company).
Search Settings (already applied)
config/search_profiles.yaml:
hours_old: 120 → 240(cover LinkedIn's algo-sorted alerts)results_per_board: 50 → 75- Added title:
Customer Engagement Manager
Out of Scope
- Scraping all 551 historical LinkedIn alert emails (run email sync going forward)
- Deduplication against Notion (URL dedup in SQLite is sufficient)
- Authentication-required boards (Indeed Easy Apply, etc.)