chore: move internal plans to circuitforge-plans repo
Some checks are pending
CI / test (push) Waiting to run
Some checks are pending
CI / test (push) Waiting to run
All docs/plans/ files migrated to pyr0ball/circuitforge-plans. Keeping docs/ for future user-facing documentation.
This commit is contained in:
parent
18efae71e1
commit
5f5319d8bf
37 changed files with 0 additions and 23835 deletions
0
docs/.gitkeep
Normal file
0
docs/.gitkeep
Normal file
|
|
@ -1,201 +0,0 @@
|
|||
# Job Seeker Platform — Design Document
|
||||
**Date:** 2026-02-20
|
||||
**Status:** Approved
|
||||
**Candidate:** Alex Rivera
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
A monorepo project at `/devl/job-seeker/` that integrates three FOSS tools into a
|
||||
cohesive job search pipeline: automated discovery (JobSpy), resume-to-listing keyword
|
||||
matching (Resume Matcher), and automated application submission (AIHawk). Job listings
|
||||
and interactive documents are tracked in Notion; source documents live in
|
||||
`/Library/Documents/JobSearch/`.
|
||||
|
||||
---
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
/devl/job-seeker/
|
||||
├── config/
|
||||
│ ├── search_profiles.yaml # JobSpy queries (titles, locations, boards)
|
||||
│ ├── llm.yaml # LLM router: backends + fallback order
|
||||
│ └── notion.yaml # Notion DB IDs and field mappings
|
||||
├── aihawk/ # git clone — Auto_Jobs_Applier_AIHawk
|
||||
├── resume_matcher/ # git clone — Resume-Matcher
|
||||
├── scripts/
|
||||
│ ├── discover.py # JobSpy → deduplicate → push to Notion
|
||||
│ ├── match.py # Notion job URL → Resume Matcher → write score back
|
||||
│ └── llm_router.py # LLM abstraction layer with priority fallback chain
|
||||
├── docs/plans/ # Design and implementation docs (no resume files)
|
||||
├── environment.yml # conda env spec (env name: job-seeker)
|
||||
└── .gitignore
|
||||
```
|
||||
|
||||
**Document storage rule:** Resumes, cover letters, and any interactable documents live
|
||||
in `/Library/Documents/JobSearch/` or Notion — never committed to this repo.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
### Data Flow
|
||||
|
||||
```
|
||||
JobSpy (LinkedIn / Indeed / Glassdoor / ZipRecruiter)
|
||||
└─▶ discover.py
|
||||
├─ deduplicate by URL against existing Notion records
|
||||
└─▶ Notion DB (Status: "New")
|
||||
|
||||
Notion DB (daily review — decide what to pursue)
|
||||
└─▶ match.py <notion-page-url>
|
||||
├─ fetch job description from listing URL
|
||||
├─ run Resume Matcher vs. /Library/Documents/JobSearch/Alex_Rivera_Resume_02-19-2025.pdf
|
||||
└─▶ write Match Score + Keyword Gaps back to Notion page
|
||||
|
||||
AIHawk (when ready to apply)
|
||||
├─ reads config pointing to same resume + personal_info.yaml
|
||||
├─ llm_router.py → best available LLM backend
|
||||
├─ submits LinkedIn Easy Apply
|
||||
└─▶ Notion status → "Applied"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Notion Database Schema
|
||||
|
||||
| Field | Type | Notes |
|
||||
|---------------|----------|------------------------------------------------------------|
|
||||
| Job Title | Title | Primary identifier |
|
||||
| Company | Text | |
|
||||
| Location | Text | |
|
||||
| Remote | Checkbox | |
|
||||
| URL | URL | Deduplication key |
|
||||
| Source | Select | LinkedIn / Indeed / Glassdoor / ZipRecruiter |
|
||||
| Status | Select | New → Reviewing → Applied → Interview → Offer → Rejected |
|
||||
| Match Score | Number | 0–100, written by match.py |
|
||||
| Keyword Gaps | Text | Comma-separated missing keywords from Resume Matcher |
|
||||
| Salary | Text | If listed |
|
||||
| Date Found | Date | Set at discovery time |
|
||||
| Notes | Text | Manual field |
|
||||
|
||||
---
|
||||
|
||||
## LLM Router (`scripts/llm_router.py`)
|
||||
|
||||
Single `complete(prompt, system=None)` interface. On each call: health-check each
|
||||
backend in configured order, use the first that responds. Falls back silently on
|
||||
connection error, timeout, or 5xx. Logs which backend was used.
|
||||
|
||||
All backends except Anthropic use the `openai` Python package (OpenAI-compatible
|
||||
endpoints). Anthropic uses the `anthropic` package.
|
||||
|
||||
### `config/llm.yaml`
|
||||
|
||||
```yaml
|
||||
fallback_order:
|
||||
- claude_code # port 3009 — Claude via local pipeline (highest quality)
|
||||
- ollama # port 11434 — local, always-on
|
||||
- vllm # port 8000 — start when needed
|
||||
- github_copilot # port 3010 — Copilot via gh token
|
||||
- anthropic # cloud fallback, burns API credits
|
||||
|
||||
backends:
|
||||
claude_code:
|
||||
type: openai_compat
|
||||
base_url: http://localhost:3009/v1
|
||||
model: claude-code-terminal
|
||||
api_key: "any"
|
||||
|
||||
ollama:
|
||||
type: openai_compat
|
||||
base_url: http://localhost:11434/v1
|
||||
model: llama3.2
|
||||
api_key: "ollama"
|
||||
|
||||
vllm:
|
||||
type: openai_compat
|
||||
base_url: http://localhost:8000/v1
|
||||
model: __auto__
|
||||
api_key: ""
|
||||
|
||||
github_copilot:
|
||||
type: openai_compat
|
||||
base_url: http://localhost:3010/v1
|
||||
model: gpt-4o
|
||||
api_key: "any"
|
||||
|
||||
anthropic:
|
||||
type: anthropic
|
||||
model: claude-sonnet-4-6
|
||||
api_key_env: ANTHROPIC_API_KEY
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Job Search Profile
|
||||
|
||||
### `config/search_profiles.yaml` (initial)
|
||||
|
||||
```yaml
|
||||
profiles:
|
||||
- name: cs_leadership
|
||||
titles:
|
||||
- "Customer Success Manager"
|
||||
- "Director of Customer Success"
|
||||
- "VP Customer Success"
|
||||
- "Head of Customer Success"
|
||||
- "Technical Account Manager"
|
||||
- "Revenue Operations Manager"
|
||||
- "Customer Experience Lead"
|
||||
locations:
|
||||
- "Remote"
|
||||
- "San Francisco Bay Area, CA"
|
||||
boards:
|
||||
- linkedin
|
||||
- indeed
|
||||
- glassdoor
|
||||
- zip_recruiter
|
||||
results_per_board: 25
|
||||
remote_only: false # remote preferred but Bay Area in-person ok
|
||||
hours_old: 72 # listings posted in last 3 days
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Conda Environment
|
||||
|
||||
New dedicated env `job-seeker` (not base). Core packages:
|
||||
|
||||
- `python-jobspy` — job scraping
|
||||
- `notion-client` — Notion API
|
||||
- `openai` — OpenAI-compatible calls (Ollama, vLLM, Copilot, Claude pipeline)
|
||||
- `anthropic` — Anthropic API fallback
|
||||
- `pyyaml` — config parsing
|
||||
- `pandas` — CSV handling and dedup
|
||||
- Resume Matcher dependencies (sentence-transformers, streamlit — installed from clone)
|
||||
|
||||
Resume Matcher Streamlit UI runs on port **8501** (confirmed clear).
|
||||
|
||||
---
|
||||
|
||||
## Port Map
|
||||
|
||||
| Port | Service | Status |
|
||||
|-------|--------------------------------|----------------|
|
||||
| 3009 | Claude Code OpenAI wrapper | Start via manage.sh in Post Fight Processing |
|
||||
| 3010 | GitHub Copilot wrapper | Start via manage-copilot.sh |
|
||||
| 11434 | Ollama | Running |
|
||||
| 8000 | vLLM | Start when needed |
|
||||
| 8501 | Resume Matcher (Streamlit) | Start when needed |
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope (this phase)
|
||||
|
||||
- Scheduled/cron automation (run discover.py manually for now)
|
||||
- Email/SMS alerts for new listings
|
||||
- ATS resume rebuild (separate task)
|
||||
- Applications to non-LinkedIn platforms via AIHawk
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -1,148 +0,0 @@
|
|||
# Job Seeker Platform — Web UI Design
|
||||
|
||||
**Date:** 2026-02-20
|
||||
**Status:** Approved
|
||||
|
||||
## Overview
|
||||
|
||||
A Streamlit multi-page web UI that gives Alex (and her partner) a friendly interface to review scraped job listings, curate them before they hit Notion, edit search/LLM/Notion settings, and fill out her AIHawk application profile. Designed to be usable by anyone — no technical knowledge required.
|
||||
|
||||
---
|
||||
|
||||
## Architecture & Data Flow
|
||||
|
||||
```
|
||||
discover.py → SQLite staging.db (status: pending)
|
||||
↓
|
||||
Streamlit UI
|
||||
review / approve / reject
|
||||
↓
|
||||
"Sync N approved jobs" button
|
||||
↓
|
||||
Notion DB (status: synced)
|
||||
```
|
||||
|
||||
`discover.py` is modified to write to SQLite instead of directly to Notion.
|
||||
A new `sync.py` handles the approved → Notion push.
|
||||
`db.py` provides shared SQLite helpers used by both scripts and UI pages.
|
||||
|
||||
### SQLite Schema (`staging.db`, gitignored)
|
||||
|
||||
```sql
|
||||
CREATE TABLE jobs (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
title TEXT,
|
||||
company TEXT,
|
||||
url TEXT UNIQUE,
|
||||
source TEXT,
|
||||
location TEXT,
|
||||
is_remote INTEGER,
|
||||
salary TEXT,
|
||||
description TEXT,
|
||||
match_score REAL,
|
||||
keyword_gaps TEXT,
|
||||
date_found TEXT,
|
||||
status TEXT DEFAULT 'pending', -- pending / approved / rejected / synced
|
||||
notion_page_id TEXT
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pages
|
||||
|
||||
### Home (Dashboard)
|
||||
- Stat cards: Pending / Approved / Rejected / Synced counts
|
||||
- "Run Discovery" button — runs `discover.py` as subprocess, streams output
|
||||
- "Sync N approved jobs → Notion" button — visible only when approved count > 0
|
||||
- Recent activity list (last 10 jobs found)
|
||||
|
||||
### Job Review
|
||||
- Filterable table/card view of pending jobs
|
||||
- Filters: source (LinkedIn/Indeed/etc), remote only toggle, minimum match score slider
|
||||
- Checkboxes for batch selection
|
||||
- "Approve Selected" / "Reject Selected" buttons
|
||||
- Rejected jobs hidden by default, togglable
|
||||
- Match score shown as colored badge (green ≥70, amber 40–69, red <40)
|
||||
|
||||
### Settings
|
||||
Three tabs:
|
||||
|
||||
**Search** — edit `config/search_profiles.yaml`:
|
||||
- Job titles (add/remove tags)
|
||||
- Locations (add/remove)
|
||||
- Boards checkboxes
|
||||
- Hours old slider
|
||||
- Results per board slider
|
||||
|
||||
**LLM Backends** — edit `config/llm.yaml`:
|
||||
- Fallback order (drag or up/down arrows)
|
||||
- Per-backend: URL, model name, enabled toggle
|
||||
- "Test connection" button per backend
|
||||
|
||||
**Notion** — edit `config/notion.yaml`:
|
||||
- Token field (masked, show/hide toggle)
|
||||
- Database ID
|
||||
- "Test connection" button
|
||||
|
||||
### Resume Editor
|
||||
Sectioned form over `aihawk/data_folder/plain_text_resume.yaml`:
|
||||
- **Personal Info** — name, email, phone, LinkedIn, city, zip
|
||||
- **Education** — list of entries, add/remove buttons
|
||||
- **Experience** — list of entries, add/remove buttons
|
||||
- **Skills & Interests** — tag-style inputs
|
||||
- **Preferences** — salary range, notice period, remote/relocation toggles
|
||||
- **Self-Identification** — gender, pronouns, veteran, disability, ethnicity (with "prefer not to say" options)
|
||||
- **Legal** — work authorization checkboxes
|
||||
|
||||
`FILL_IN` fields highlighted in amber with "Needs your attention" note.
|
||||
Save button writes back to YAML. No raw YAML shown by default.
|
||||
|
||||
---
|
||||
|
||||
## Theme & Styling
|
||||
|
||||
Central theme at `app/.streamlit/config.toml`:
|
||||
- Dark base, accent color teal/green (job search = growth)
|
||||
- Consistent font (Inter or system sans-serif)
|
||||
- Responsive column layouts — usable on tablet/mobile
|
||||
- No jargon — "Run Discovery" not "Execute scrape", "Sync to Notion" not "Push records"
|
||||
|
||||
---
|
||||
|
||||
## File Layout
|
||||
|
||||
```
|
||||
app/
|
||||
├── .streamlit/
|
||||
│ └── config.toml # central theme
|
||||
├── Home.py # dashboard
|
||||
└── pages/
|
||||
├── 1_Job_Review.py
|
||||
├── 2_Settings.py
|
||||
└── 3_Resume_Editor.py
|
||||
scripts/
|
||||
├── db.py # new: SQLite helpers
|
||||
├── sync.py # new: approved → Notion push
|
||||
├── discover.py # modified: write to SQLite not Notion
|
||||
├── match.py # unchanged
|
||||
└── llm_router.py # unchanged
|
||||
```
|
||||
|
||||
Run: `conda run -n job-seeker streamlit run app/Home.py`
|
||||
|
||||
---
|
||||
|
||||
## New Dependencies
|
||||
|
||||
None — `streamlit` already installed via resume_matcher deps.
|
||||
`sqlite3` is Python stdlib.
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Real-time collaboration
|
||||
- Mobile native app
|
||||
- Cover letter editor (handled separately via LoRA fine-tune task)
|
||||
- AIHawk trigger from UI (run manually for now)
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -1,100 +0,0 @@
|
|||
# Background Task Processing — Design
|
||||
|
||||
**Date:** 2026-02-21
|
||||
**Status:** Approved
|
||||
|
||||
## Problem
|
||||
|
||||
Cover letter generation (`4_Apply.py`) and company research (`6_Interview_Prep.py`) call LLM scripts synchronously inside `st.spinner()`. If the user navigates away during generation, Streamlit abandons the in-progress call and the result is lost. Both results are already persisted to SQLite on completion, so if the task kept running in the background the result would be available on return.
|
||||
|
||||
## Solution Overview
|
||||
|
||||
Python threading + SQLite task table. When a user clicks Generate, a daemon thread is spawned immediately and the task is recorded in a new `background_tasks` table. The thread writes results to the existing tables (`jobs.cover_letter`, `company_research`) and marks itself complete/failed. All pages share a sidebar indicator that auto-refreshes while tasks are active. Individual pages show task-level status inline.
|
||||
|
||||
## SQLite Schema
|
||||
|
||||
New table `background_tasks` added in `scripts/db.py`:
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS background_tasks (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
task_type TEXT NOT NULL, -- "cover_letter" | "company_research"
|
||||
job_id INTEGER NOT NULL,
|
||||
status TEXT NOT NULL DEFAULT 'queued', -- queued | running | completed | failed
|
||||
error TEXT,
|
||||
created_at DATETIME DEFAULT (datetime('now')),
|
||||
started_at DATETIME,
|
||||
finished_at DATETIME
|
||||
)
|
||||
```
|
||||
|
||||
## Deduplication Rule
|
||||
|
||||
Before inserting a new task, check for an existing `queued` or `running` row with the same `(task_type, job_id)`. If one exists, reject the submission (return the existing task's id). Different task types for the same job (e.g. cover letter + research) are allowed to run concurrently. Different jobs of the same type are allowed concurrently.
|
||||
|
||||
## Components
|
||||
|
||||
### `scripts/task_runner.py` (new)
|
||||
|
||||
- `submit_task(db, task_type, job_id) -> int` — dedup check, insert row, spawn daemon thread, return task id
|
||||
- `_run_task(db, task_id, task_type, job_id)` — thread body: mark running, call generator, save result, mark completed/failed
|
||||
- `get_active_tasks(db) -> list[dict]` — all queued/running rows with job title+company joined
|
||||
- `get_task_for_job(db, task_type, job_id) -> dict | None` — latest task row for a specific job+type
|
||||
|
||||
### `scripts/db.py` (modified)
|
||||
|
||||
- Add `init_background_tasks(conn)` called inside `init_db()`
|
||||
- Add `insert_task`, `update_task_status`, `get_active_tasks`, `get_task_for_job` helpers
|
||||
|
||||
### `app/app.py` (modified)
|
||||
|
||||
- After `st.navigation()`, call `get_active_tasks()` and render sidebar indicator
|
||||
- Use `st.fragment` with `time.sleep(3)` + `st.rerun(scope="fragment")` to poll while tasks are active
|
||||
- Sidebar shows: `⏳ N task(s) running` count + per-task line (type + company name)
|
||||
- Fragment polling stops when active task count reaches zero
|
||||
|
||||
### `app/pages/4_Apply.py` (modified)
|
||||
|
||||
- Generate button calls `submit_task(db, "cover_letter", job_id)` instead of running inline
|
||||
- If a task is `queued`/`running` for the selected job, disable button and show inline status fragment (polls every 3s)
|
||||
- On `completed`, load cover letter from `jobs` row (already saved by thread)
|
||||
- On `failed`, show error message and re-enable button
|
||||
|
||||
### `app/pages/6_Interview_Prep.py` (modified)
|
||||
|
||||
- Generate/Refresh buttons call `submit_task(db, "company_research", job_id)` instead of running inline
|
||||
- Same inline status fragment pattern as Apply page
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
User clicks Generate
|
||||
→ submit_task(db, type, job_id)
|
||||
→ dedup check (reject if already queued/running for same type+job)
|
||||
→ INSERT background_tasks row (status=queued)
|
||||
→ spawn daemon thread
|
||||
→ return task_id
|
||||
→ page shows inline "⏳ Queued…" fragment
|
||||
|
||||
Thread runs
|
||||
→ UPDATE status=running, started_at=now
|
||||
→ call generate_cover_letter.generate() OR research_company()
|
||||
→ write result to jobs.cover_letter OR company_research table
|
||||
→ UPDATE status=completed, finished_at=now
|
||||
(on exception: UPDATE status=failed, error=str(e))
|
||||
|
||||
Sidebar fragment (every 3s while active tasks > 0)
|
||||
→ get_active_tasks() → render count + list
|
||||
→ st.rerun(scope="fragment")
|
||||
|
||||
Page fragment (every 3s while task for this job is running)
|
||||
→ get_task_for_job() → render status
|
||||
→ on completed: st.rerun() (full rerun to reload cover letter / research)
|
||||
```
|
||||
|
||||
## What Is Not Changed
|
||||
|
||||
- `generate_cover_letter.generate()` and `research_company()` are called unchanged from the thread
|
||||
- `update_cover_letter()` and `save_research()` DB helpers are reused unchanged
|
||||
- No new Python packages required
|
||||
- No separate worker process — daemon threads die with the Streamlit server, but results already written to SQLite survive
|
||||
|
|
@ -1,933 +0,0 @@
|
|||
# Background Task Processing Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Replace synchronous LLM calls in Apply and Interview Prep pages with background threads so cover letter and research generation survive page navigation.
|
||||
|
||||
**Architecture:** A new `background_tasks` SQLite table tracks task state. `scripts/task_runner.py` spawns daemon threads that call existing generator functions and write results via existing DB helpers. The Streamlit sidebar polls active tasks every 3s via `@st.fragment(run_every=3)`; individual pages show per-job status with the same pattern.
|
||||
|
||||
**Tech Stack:** Python `threading` (stdlib), SQLite, Streamlit `st.fragment` (≥1.33 — already installed)
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Add background_tasks table and DB helpers
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/db.py`
|
||||
- Test: `tests/test_db.py`
|
||||
|
||||
### Step 1: Write the failing tests
|
||||
|
||||
Add to `tests/test_db.py`:
|
||||
|
||||
```python
|
||||
# ── background_tasks tests ────────────────────────────────────────────────────
|
||||
|
||||
def test_init_db_creates_background_tasks_table(tmp_path):
|
||||
"""init_db creates a background_tasks table."""
|
||||
from scripts.db import init_db
|
||||
db_path = tmp_path / "test.db"
|
||||
init_db(db_path)
|
||||
import sqlite3
|
||||
conn = sqlite3.connect(db_path)
|
||||
cur = conn.execute(
|
||||
"SELECT name FROM sqlite_master WHERE type='table' AND name='background_tasks'"
|
||||
)
|
||||
assert cur.fetchone() is not None
|
||||
conn.close()
|
||||
|
||||
|
||||
def test_insert_task_returns_id_and_true(tmp_path):
|
||||
"""insert_task returns (task_id, True) for a new task."""
|
||||
from scripts.db import init_db, insert_job, insert_task
|
||||
db_path = tmp_path / "test.db"
|
||||
init_db(db_path)
|
||||
job_id = insert_job(db_path, {
|
||||
"title": "CSM", "company": "Acme", "url": "https://ex.com/1",
|
||||
"source": "linkedin", "location": "Remote", "is_remote": True,
|
||||
"salary": "", "description": "", "date_found": "2026-02-20",
|
||||
})
|
||||
task_id, is_new = insert_task(db_path, "cover_letter", job_id)
|
||||
assert isinstance(task_id, int) and task_id > 0
|
||||
assert is_new is True
|
||||
|
||||
|
||||
def test_insert_task_deduplicates_active_task(tmp_path):
|
||||
"""insert_task returns (existing_id, False) if a queued/running task already exists."""
|
||||
from scripts.db import init_db, insert_job, insert_task
|
||||
db_path = tmp_path / "test.db"
|
||||
init_db(db_path)
|
||||
job_id = insert_job(db_path, {
|
||||
"title": "CSM", "company": "Acme", "url": "https://ex.com/1",
|
||||
"source": "linkedin", "location": "Remote", "is_remote": True,
|
||||
"salary": "", "description": "", "date_found": "2026-02-20",
|
||||
})
|
||||
first_id, _ = insert_task(db_path, "cover_letter", job_id)
|
||||
second_id, is_new = insert_task(db_path, "cover_letter", job_id)
|
||||
assert second_id == first_id
|
||||
assert is_new is False
|
||||
|
||||
|
||||
def test_insert_task_allows_different_types_same_job(tmp_path):
|
||||
"""insert_task allows cover_letter and company_research for the same job concurrently."""
|
||||
from scripts.db import init_db, insert_job, insert_task
|
||||
db_path = tmp_path / "test.db"
|
||||
init_db(db_path)
|
||||
job_id = insert_job(db_path, {
|
||||
"title": "CSM", "company": "Acme", "url": "https://ex.com/1",
|
||||
"source": "linkedin", "location": "Remote", "is_remote": True,
|
||||
"salary": "", "description": "", "date_found": "2026-02-20",
|
||||
})
|
||||
_, cl_new = insert_task(db_path, "cover_letter", job_id)
|
||||
_, res_new = insert_task(db_path, "company_research", job_id)
|
||||
assert cl_new is True
|
||||
assert res_new is True
|
||||
|
||||
|
||||
def test_update_task_status_running(tmp_path):
|
||||
"""update_task_status('running') sets started_at."""
|
||||
from scripts.db import init_db, insert_job, insert_task, update_task_status
|
||||
import sqlite3
|
||||
db_path = tmp_path / "test.db"
|
||||
init_db(db_path)
|
||||
job_id = insert_job(db_path, {
|
||||
"title": "CSM", "company": "Acme", "url": "https://ex.com/1",
|
||||
"source": "linkedin", "location": "Remote", "is_remote": True,
|
||||
"salary": "", "description": "", "date_found": "2026-02-20",
|
||||
})
|
||||
task_id, _ = insert_task(db_path, "cover_letter", job_id)
|
||||
update_task_status(db_path, task_id, "running")
|
||||
conn = sqlite3.connect(db_path)
|
||||
row = conn.execute("SELECT status, started_at FROM background_tasks WHERE id=?", (task_id,)).fetchone()
|
||||
conn.close()
|
||||
assert row[0] == "running"
|
||||
assert row[1] is not None
|
||||
|
||||
|
||||
def test_update_task_status_completed(tmp_path):
|
||||
"""update_task_status('completed') sets finished_at."""
|
||||
from scripts.db import init_db, insert_job, insert_task, update_task_status
|
||||
import sqlite3
|
||||
db_path = tmp_path / "test.db"
|
||||
init_db(db_path)
|
||||
job_id = insert_job(db_path, {
|
||||
"title": "CSM", "company": "Acme", "url": "https://ex.com/1",
|
||||
"source": "linkedin", "location": "Remote", "is_remote": True,
|
||||
"salary": "", "description": "", "date_found": "2026-02-20",
|
||||
})
|
||||
task_id, _ = insert_task(db_path, "cover_letter", job_id)
|
||||
update_task_status(db_path, task_id, "completed")
|
||||
conn = sqlite3.connect(db_path)
|
||||
row = conn.execute("SELECT status, finished_at FROM background_tasks WHERE id=?", (task_id,)).fetchone()
|
||||
conn.close()
|
||||
assert row[0] == "completed"
|
||||
assert row[1] is not None
|
||||
|
||||
|
||||
def test_update_task_status_failed_stores_error(tmp_path):
|
||||
"""update_task_status('failed') stores error message and sets finished_at."""
|
||||
from scripts.db import init_db, insert_job, insert_task, update_task_status
|
||||
import sqlite3
|
||||
db_path = tmp_path / "test.db"
|
||||
init_db(db_path)
|
||||
job_id = insert_job(db_path, {
|
||||
"title": "CSM", "company": "Acme", "url": "https://ex.com/1",
|
||||
"source": "linkedin", "location": "Remote", "is_remote": True,
|
||||
"salary": "", "description": "", "date_found": "2026-02-20",
|
||||
})
|
||||
task_id, _ = insert_task(db_path, "cover_letter", job_id)
|
||||
update_task_status(db_path, task_id, "failed", error="LLM timeout")
|
||||
conn = sqlite3.connect(db_path)
|
||||
row = conn.execute("SELECT status, error, finished_at FROM background_tasks WHERE id=?", (task_id,)).fetchone()
|
||||
conn.close()
|
||||
assert row[0] == "failed"
|
||||
assert row[1] == "LLM timeout"
|
||||
assert row[2] is not None
|
||||
|
||||
|
||||
def test_get_active_tasks_returns_only_active(tmp_path):
|
||||
"""get_active_tasks returns only queued/running tasks with job info joined."""
|
||||
from scripts.db import init_db, insert_job, insert_task, update_task_status, get_active_tasks
|
||||
db_path = tmp_path / "test.db"
|
||||
init_db(db_path)
|
||||
job_id = insert_job(db_path, {
|
||||
"title": "CSM", "company": "Acme", "url": "https://ex.com/1",
|
||||
"source": "linkedin", "location": "Remote", "is_remote": True,
|
||||
"salary": "", "description": "", "date_found": "2026-02-20",
|
||||
})
|
||||
active_id, _ = insert_task(db_path, "cover_letter", job_id)
|
||||
done_id, _ = insert_task(db_path, "company_research", job_id)
|
||||
update_task_status(db_path, done_id, "completed")
|
||||
|
||||
tasks = get_active_tasks(db_path)
|
||||
assert len(tasks) == 1
|
||||
assert tasks[0]["id"] == active_id
|
||||
assert tasks[0]["company"] == "Acme"
|
||||
assert tasks[0]["title"] == "CSM"
|
||||
|
||||
|
||||
def test_get_task_for_job_returns_latest(tmp_path):
|
||||
"""get_task_for_job returns the most recent task for the given type+job."""
|
||||
from scripts.db import init_db, insert_job, insert_task, update_task_status, get_task_for_job
|
||||
db_path = tmp_path / "test.db"
|
||||
init_db(db_path)
|
||||
job_id = insert_job(db_path, {
|
||||
"title": "CSM", "company": "Acme", "url": "https://ex.com/1",
|
||||
"source": "linkedin", "location": "Remote", "is_remote": True,
|
||||
"salary": "", "description": "", "date_found": "2026-02-20",
|
||||
})
|
||||
first_id, _ = insert_task(db_path, "cover_letter", job_id)
|
||||
update_task_status(db_path, first_id, "completed")
|
||||
second_id, _ = insert_task(db_path, "cover_letter", job_id) # allowed since first is done
|
||||
|
||||
task = get_task_for_job(db_path, "cover_letter", job_id)
|
||||
assert task is not None
|
||||
assert task["id"] == second_id
|
||||
|
||||
|
||||
def test_get_task_for_job_returns_none_when_absent(tmp_path):
|
||||
"""get_task_for_job returns None when no task exists for that job+type."""
|
||||
from scripts.db import init_db, insert_job, get_task_for_job
|
||||
db_path = tmp_path / "test.db"
|
||||
init_db(db_path)
|
||||
job_id = insert_job(db_path, {
|
||||
"title": "CSM", "company": "Acme", "url": "https://ex.com/1",
|
||||
"source": "linkedin", "location": "Remote", "is_remote": True,
|
||||
"salary": "", "description": "", "date_found": "2026-02-20",
|
||||
})
|
||||
assert get_task_for_job(db_path, "cover_letter", job_id) is None
|
||||
```
|
||||
|
||||
### Step 2: Run tests to verify they fail
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_db.py -v -k "background_tasks or insert_task or update_task_status or get_active_tasks or get_task_for_job"
|
||||
```
|
||||
|
||||
Expected: FAIL with `ImportError: cannot import name 'insert_task'`
|
||||
|
||||
### Step 3: Implement in scripts/db.py
|
||||
|
||||
Add the DDL constant after `CREATE_COMPANY_RESEARCH`:
|
||||
|
||||
```python
|
||||
CREATE_BACKGROUND_TASKS = """
|
||||
CREATE TABLE IF NOT EXISTS background_tasks (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
task_type TEXT NOT NULL,
|
||||
job_id INTEGER NOT NULL,
|
||||
status TEXT NOT NULL DEFAULT 'queued',
|
||||
error TEXT,
|
||||
created_at DATETIME DEFAULT (datetime('now')),
|
||||
started_at DATETIME,
|
||||
finished_at DATETIME
|
||||
)
|
||||
"""
|
||||
```
|
||||
|
||||
Add `conn.execute(CREATE_BACKGROUND_TASKS)` inside `init_db()`, after the existing three `conn.execute()` calls:
|
||||
|
||||
```python
|
||||
def init_db(db_path: Path = DEFAULT_DB) -> None:
|
||||
"""Create tables if they don't exist, then run migrations."""
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn.execute(CREATE_JOBS)
|
||||
conn.execute(CREATE_JOB_CONTACTS)
|
||||
conn.execute(CREATE_COMPANY_RESEARCH)
|
||||
conn.execute(CREATE_BACKGROUND_TASKS) # ← add this line
|
||||
conn.commit()
|
||||
conn.close()
|
||||
_migrate_db(db_path)
|
||||
```
|
||||
|
||||
Add the four helper functions at the end of `scripts/db.py`:
|
||||
|
||||
```python
|
||||
# ── Background task helpers ───────────────────────────────────────────────────
|
||||
|
||||
def insert_task(db_path: Path = DEFAULT_DB, task_type: str = "",
|
||||
job_id: int = None) -> tuple[int, bool]:
|
||||
"""Insert a new background task.
|
||||
|
||||
Returns (task_id, True) if inserted, or (existing_id, False) if a
|
||||
queued/running task for the same (task_type, job_id) already exists.
|
||||
"""
|
||||
conn = sqlite3.connect(db_path)
|
||||
existing = conn.execute(
|
||||
"SELECT id FROM background_tasks WHERE task_type=? AND job_id=? AND status IN ('queued','running')",
|
||||
(task_type, job_id),
|
||||
).fetchone()
|
||||
if existing:
|
||||
conn.close()
|
||||
return existing[0], False
|
||||
cur = conn.execute(
|
||||
"INSERT INTO background_tasks (task_type, job_id, status) VALUES (?, ?, 'queued')",
|
||||
(task_type, job_id),
|
||||
)
|
||||
task_id = cur.lastrowid
|
||||
conn.commit()
|
||||
conn.close()
|
||||
return task_id, True
|
||||
|
||||
|
||||
def update_task_status(db_path: Path = DEFAULT_DB, task_id: int = None,
|
||||
status: str = "", error: Optional[str] = None) -> None:
|
||||
"""Update a task's status and set the appropriate timestamp."""
|
||||
now = datetime.now().isoformat()[:16]
|
||||
conn = sqlite3.connect(db_path)
|
||||
if status == "running":
|
||||
conn.execute(
|
||||
"UPDATE background_tasks SET status=?, started_at=? WHERE id=?",
|
||||
(status, now, task_id),
|
||||
)
|
||||
elif status in ("completed", "failed"):
|
||||
conn.execute(
|
||||
"UPDATE background_tasks SET status=?, finished_at=?, error=? WHERE id=?",
|
||||
(status, now, error, task_id),
|
||||
)
|
||||
else:
|
||||
conn.execute("UPDATE background_tasks SET status=? WHERE id=?", (status, task_id))
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
|
||||
def get_active_tasks(db_path: Path = DEFAULT_DB) -> list[dict]:
|
||||
"""Return all queued/running tasks with job title and company joined in."""
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn.row_factory = sqlite3.Row
|
||||
rows = conn.execute("""
|
||||
SELECT bt.*, j.title, j.company
|
||||
FROM background_tasks bt
|
||||
LEFT JOIN jobs j ON j.id = bt.job_id
|
||||
WHERE bt.status IN ('queued', 'running')
|
||||
ORDER BY bt.created_at ASC
|
||||
""").fetchall()
|
||||
conn.close()
|
||||
return [dict(r) for r in rows]
|
||||
|
||||
|
||||
def get_task_for_job(db_path: Path = DEFAULT_DB, task_type: str = "",
|
||||
job_id: int = None) -> Optional[dict]:
|
||||
"""Return the most recent task row for a (task_type, job_id) pair, or None."""
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn.row_factory = sqlite3.Row
|
||||
row = conn.execute(
|
||||
"""SELECT * FROM background_tasks
|
||||
WHERE task_type=? AND job_id=?
|
||||
ORDER BY id DESC LIMIT 1""",
|
||||
(task_type, job_id),
|
||||
).fetchone()
|
||||
conn.close()
|
||||
return dict(row) if row else None
|
||||
```
|
||||
|
||||
### Step 4: Run tests to verify they pass
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_db.py -v -k "background_tasks or insert_task or update_task_status or get_active_tasks or get_task_for_job"
|
||||
```
|
||||
|
||||
Expected: all new tests PASS, no regressions
|
||||
|
||||
### Step 5: Run full test suite
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v
|
||||
```
|
||||
|
||||
Expected: all tests PASS
|
||||
|
||||
### Step 6: Commit
|
||||
|
||||
```bash
|
||||
git add scripts/db.py tests/test_db.py
|
||||
git commit -m "feat: add background_tasks table and DB helpers"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Create scripts/task_runner.py
|
||||
|
||||
**Files:**
|
||||
- Create: `scripts/task_runner.py`
|
||||
- Test: `tests/test_task_runner.py`
|
||||
|
||||
### Step 1: Write the failing tests
|
||||
|
||||
Create `tests/test_task_runner.py`:
|
||||
|
||||
```python
|
||||
import threading
|
||||
import time
|
||||
import pytest
|
||||
from pathlib import Path
|
||||
from unittest.mock import patch, MagicMock
|
||||
import sqlite3
|
||||
|
||||
|
||||
def _make_db(tmp_path):
|
||||
from scripts.db import init_db, insert_job
|
||||
db = tmp_path / "test.db"
|
||||
init_db(db)
|
||||
job_id = insert_job(db, {
|
||||
"title": "CSM", "company": "Acme", "url": "https://ex.com/1",
|
||||
"source": "linkedin", "location": "Remote", "is_remote": True,
|
||||
"salary": "", "description": "Great role.", "date_found": "2026-02-20",
|
||||
})
|
||||
return db, job_id
|
||||
|
||||
|
||||
def test_submit_task_returns_id_and_true(tmp_path):
|
||||
"""submit_task returns (task_id, True) and spawns a thread."""
|
||||
db, job_id = _make_db(tmp_path)
|
||||
with patch("scripts.task_runner._run_task"): # don't actually call LLM
|
||||
from scripts.task_runner import submit_task
|
||||
task_id, is_new = submit_task(db, "cover_letter", job_id)
|
||||
assert isinstance(task_id, int) and task_id > 0
|
||||
assert is_new is True
|
||||
|
||||
|
||||
def test_submit_task_deduplicates(tmp_path):
|
||||
"""submit_task returns (existing_id, False) for a duplicate in-flight task."""
|
||||
db, job_id = _make_db(tmp_path)
|
||||
with patch("scripts.task_runner._run_task"):
|
||||
from scripts.task_runner import submit_task
|
||||
first_id, _ = submit_task(db, "cover_letter", job_id)
|
||||
second_id, is_new = submit_task(db, "cover_letter", job_id)
|
||||
assert second_id == first_id
|
||||
assert is_new is False
|
||||
|
||||
|
||||
def test_run_task_cover_letter_success(tmp_path):
|
||||
"""_run_task marks running→completed and saves cover letter to DB."""
|
||||
db, job_id = _make_db(tmp_path)
|
||||
from scripts.db import insert_task, get_task_for_job, get_jobs_by_status
|
||||
task_id, _ = insert_task(db, "cover_letter", job_id)
|
||||
|
||||
with patch("scripts.generate_cover_letter.generate", return_value="Dear Hiring Manager,\nGreat fit!"):
|
||||
from scripts.task_runner import _run_task
|
||||
_run_task(db, task_id, "cover_letter", job_id)
|
||||
|
||||
task = get_task_for_job(db, "cover_letter", job_id)
|
||||
assert task["status"] == "completed"
|
||||
assert task["error"] is None
|
||||
|
||||
conn = sqlite3.connect(db)
|
||||
row = conn.execute("SELECT cover_letter FROM jobs WHERE id=?", (job_id,)).fetchone()
|
||||
conn.close()
|
||||
assert row[0] == "Dear Hiring Manager,\nGreat fit!"
|
||||
|
||||
|
||||
def test_run_task_company_research_success(tmp_path):
|
||||
"""_run_task marks running→completed and saves research to DB."""
|
||||
db, job_id = _make_db(tmp_path)
|
||||
from scripts.db import insert_task, get_task_for_job, get_research
|
||||
|
||||
task_id, _ = insert_task(db, "company_research", job_id)
|
||||
fake_result = {
|
||||
"raw_output": "raw", "company_brief": "brief",
|
||||
"ceo_brief": "ceo", "talking_points": "points",
|
||||
}
|
||||
with patch("scripts.company_research.research_company", return_value=fake_result):
|
||||
from scripts.task_runner import _run_task
|
||||
_run_task(db, task_id, "company_research", job_id)
|
||||
|
||||
task = get_task_for_job(db, "company_research", job_id)
|
||||
assert task["status"] == "completed"
|
||||
|
||||
research = get_research(db, job_id=job_id)
|
||||
assert research["company_brief"] == "brief"
|
||||
|
||||
|
||||
def test_run_task_marks_failed_on_exception(tmp_path):
|
||||
"""_run_task marks status=failed and stores error when generator raises."""
|
||||
db, job_id = _make_db(tmp_path)
|
||||
from scripts.db import insert_task, get_task_for_job
|
||||
task_id, _ = insert_task(db, "cover_letter", job_id)
|
||||
|
||||
with patch("scripts.generate_cover_letter.generate", side_effect=RuntimeError("LLM timeout")):
|
||||
from scripts.task_runner import _run_task
|
||||
_run_task(db, task_id, "cover_letter", job_id)
|
||||
|
||||
task = get_task_for_job(db, "cover_letter", job_id)
|
||||
assert task["status"] == "failed"
|
||||
assert "LLM timeout" in task["error"]
|
||||
|
||||
|
||||
def test_submit_task_actually_completes(tmp_path):
|
||||
"""Integration: submit_task spawns a thread that completes asynchronously."""
|
||||
db, job_id = _make_db(tmp_path)
|
||||
from scripts.db import get_task_for_job
|
||||
|
||||
with patch("scripts.generate_cover_letter.generate", return_value="Cover letter text"):
|
||||
from scripts.task_runner import submit_task
|
||||
task_id, _ = submit_task(db, "cover_letter", job_id)
|
||||
# Wait for thread to complete (max 5s)
|
||||
for _ in range(50):
|
||||
task = get_task_for_job(db, "cover_letter", job_id)
|
||||
if task and task["status"] in ("completed", "failed"):
|
||||
break
|
||||
time.sleep(0.1)
|
||||
|
||||
task = get_task_for_job(db, "cover_letter", job_id)
|
||||
assert task["status"] == "completed"
|
||||
```
|
||||
|
||||
### Step 2: Run tests to verify they fail
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_task_runner.py -v
|
||||
```
|
||||
|
||||
Expected: FAIL with `ModuleNotFoundError: No module named 'scripts.task_runner'`
|
||||
|
||||
### Step 3: Implement scripts/task_runner.py
|
||||
|
||||
Create `scripts/task_runner.py`:
|
||||
|
||||
```python
|
||||
# scripts/task_runner.py
|
||||
"""
|
||||
Background task runner for LLM generation tasks.
|
||||
|
||||
Submitting a task inserts a row in background_tasks and spawns a daemon thread.
|
||||
The thread calls the appropriate generator, writes results to existing tables,
|
||||
and marks the task completed or failed.
|
||||
|
||||
Deduplication: only one queued/running task per (task_type, job_id) is allowed.
|
||||
Different task types for the same job run concurrently (e.g. cover letter + research).
|
||||
"""
|
||||
import sqlite3
|
||||
import threading
|
||||
from pathlib import Path
|
||||
|
||||
from scripts.db import (
|
||||
DEFAULT_DB,
|
||||
insert_task,
|
||||
update_task_status,
|
||||
update_cover_letter,
|
||||
save_research,
|
||||
)
|
||||
|
||||
|
||||
def submit_task(db_path: Path = DEFAULT_DB, task_type: str = "",
|
||||
job_id: int = None) -> tuple[int, bool]:
|
||||
"""Submit a background LLM task.
|
||||
|
||||
Returns (task_id, True) if a new task was queued and a thread spawned.
|
||||
Returns (existing_id, False) if an identical task is already in-flight.
|
||||
"""
|
||||
task_id, is_new = insert_task(db_path, task_type, job_id)
|
||||
if is_new:
|
||||
t = threading.Thread(
|
||||
target=_run_task,
|
||||
args=(db_path, task_id, task_type, job_id),
|
||||
daemon=True,
|
||||
)
|
||||
t.start()
|
||||
return task_id, is_new
|
||||
|
||||
|
||||
def _run_task(db_path: Path, task_id: int, task_type: str, job_id: int) -> None:
|
||||
"""Thread body: run the generator and persist the result."""
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn.row_factory = sqlite3.Row
|
||||
row = conn.execute("SELECT * FROM jobs WHERE id=?", (job_id,)).fetchone()
|
||||
conn.close()
|
||||
if row is None:
|
||||
update_task_status(db_path, task_id, "failed", error=f"Job {job_id} not found")
|
||||
return
|
||||
|
||||
job = dict(row)
|
||||
update_task_status(db_path, task_id, "running")
|
||||
|
||||
try:
|
||||
if task_type == "cover_letter":
|
||||
from scripts.generate_cover_letter import generate
|
||||
result = generate(
|
||||
job.get("title", ""),
|
||||
job.get("company", ""),
|
||||
job.get("description", ""),
|
||||
)
|
||||
update_cover_letter(db_path, job_id, result)
|
||||
|
||||
elif task_type == "company_research":
|
||||
from scripts.company_research import research_company
|
||||
result = research_company(job)
|
||||
save_research(db_path, job_id=job_id, **result)
|
||||
|
||||
else:
|
||||
raise ValueError(f"Unknown task_type: {task_type!r}")
|
||||
|
||||
update_task_status(db_path, task_id, "completed")
|
||||
|
||||
except Exception as exc:
|
||||
update_task_status(db_path, task_id, "failed", error=str(exc))
|
||||
```
|
||||
|
||||
### Step 4: Run tests to verify they pass
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_task_runner.py -v
|
||||
```
|
||||
|
||||
Expected: all tests PASS
|
||||
|
||||
### Step 5: Run full test suite
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v
|
||||
```
|
||||
|
||||
Expected: all tests PASS
|
||||
|
||||
### Step 6: Commit
|
||||
|
||||
```bash
|
||||
git add scripts/task_runner.py tests/test_task_runner.py
|
||||
git commit -m "feat: add task_runner — background thread executor for LLM tasks"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 3: Add sidebar task indicator to app/app.py
|
||||
|
||||
**Files:**
|
||||
- Modify: `app/app.py`
|
||||
|
||||
No new tests needed — this is pure UI wiring.
|
||||
|
||||
### Step 1: Replace the contents of app/app.py
|
||||
|
||||
Current file is 33 lines. Replace entirely with:
|
||||
|
||||
```python
|
||||
# app/app.py
|
||||
"""
|
||||
Streamlit entry point — uses st.navigation() to control the sidebar.
|
||||
Main workflow pages are listed at the top; Settings is separated into
|
||||
a "System" section so it doesn't crowd the navigation.
|
||||
|
||||
Run: streamlit run app/app.py
|
||||
bash scripts/manage-ui.sh start
|
||||
"""
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
|
||||
import streamlit as st
|
||||
from scripts.db import DEFAULT_DB, init_db, get_active_tasks
|
||||
|
||||
st.set_page_config(
|
||||
page_title="Job Seeker",
|
||||
page_icon="💼",
|
||||
layout="wide",
|
||||
)
|
||||
|
||||
init_db(DEFAULT_DB)
|
||||
|
||||
# ── Background task sidebar indicator ─────────────────────────────────────────
|
||||
@st.fragment(run_every=3)
|
||||
def _task_sidebar() -> None:
|
||||
tasks = get_active_tasks(DEFAULT_DB)
|
||||
if not tasks:
|
||||
return
|
||||
with st.sidebar:
|
||||
st.divider()
|
||||
st.markdown(f"**⏳ {len(tasks)} task(s) running**")
|
||||
for t in tasks:
|
||||
icon = "⏳" if t["status"] == "running" else "🕐"
|
||||
label = "Cover letter" if t["task_type"] == "cover_letter" else "Research"
|
||||
st.caption(f"{icon} {label} — {t.get('company') or 'unknown'}")
|
||||
|
||||
_task_sidebar()
|
||||
|
||||
# ── Navigation ─────────────────────────────────────────────────────────────────
|
||||
pages = {
|
||||
"": [
|
||||
st.Page("Home.py", title="Home", icon="🏠"),
|
||||
st.Page("pages/1_Job_Review.py", title="Job Review", icon="📋"),
|
||||
st.Page("pages/4_Apply.py", title="Apply Workspace", icon="🚀"),
|
||||
st.Page("pages/5_Interviews.py", title="Interviews", icon="🎯"),
|
||||
st.Page("pages/6_Interview_Prep.py", title="Interview Prep", icon="📞"),
|
||||
],
|
||||
"System": [
|
||||
st.Page("pages/2_Settings.py", title="Settings", icon="⚙️"),
|
||||
],
|
||||
}
|
||||
|
||||
pg = st.navigation(pages)
|
||||
pg.run()
|
||||
```
|
||||
|
||||
### Step 2: Smoke-test by running the UI
|
||||
|
||||
```bash
|
||||
bash /devl/job-seeker/scripts/manage-ui.sh restart
|
||||
```
|
||||
|
||||
Navigate to http://localhost:8501 and confirm the app loads without error. The sidebar task indicator does not appear when no tasks are running (correct).
|
||||
|
||||
### Step 3: Commit
|
||||
|
||||
```bash
|
||||
git add app/app.py
|
||||
git commit -m "feat: sidebar background task indicator with 3s auto-refresh"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Update 4_Apply.py to use background generation
|
||||
|
||||
**Files:**
|
||||
- Modify: `app/pages/4_Apply.py`
|
||||
|
||||
No new unit tests — covered by existing test suite for DB layer. Smoke-test in browser.
|
||||
|
||||
### Step 1: Add imports at the top of 4_Apply.py
|
||||
|
||||
After the existing imports block (after `from scripts.db import ...`), add:
|
||||
|
||||
```python
|
||||
from scripts.db import get_task_for_job
|
||||
from scripts.task_runner import submit_task
|
||||
```
|
||||
|
||||
So the full import block becomes:
|
||||
|
||||
```python
|
||||
from scripts.db import (
|
||||
DEFAULT_DB, init_db, get_jobs_by_status,
|
||||
update_cover_letter, mark_applied,
|
||||
get_task_for_job,
|
||||
)
|
||||
from scripts.task_runner import submit_task
|
||||
```
|
||||
|
||||
### Step 2: Replace the Generate button section
|
||||
|
||||
Find this block (around line 174–185):
|
||||
|
||||
```python
|
||||
if st.button("✨ Generate / Regenerate", use_container_width=True):
|
||||
with st.spinner("Generating via LLM…"):
|
||||
try:
|
||||
from scripts.generate_cover_letter import generate as _gen
|
||||
st.session_state[_cl_key] = _gen(
|
||||
job.get("title", ""),
|
||||
job.get("company", ""),
|
||||
job.get("description", ""),
|
||||
)
|
||||
st.rerun()
|
||||
except Exception as e:
|
||||
st.error(f"Generation failed: {e}")
|
||||
```
|
||||
|
||||
Replace with:
|
||||
|
||||
```python
|
||||
_cl_task = get_task_for_job(DEFAULT_DB, "cover_letter", selected_id)
|
||||
_cl_running = _cl_task and _cl_task["status"] in ("queued", "running")
|
||||
|
||||
if st.button("✨ Generate / Regenerate", use_container_width=True, disabled=bool(_cl_running)):
|
||||
submit_task(DEFAULT_DB, "cover_letter", selected_id)
|
||||
st.rerun()
|
||||
|
||||
if _cl_running:
|
||||
@st.fragment(run_every=3)
|
||||
def _cl_status_fragment():
|
||||
t = get_task_for_job(DEFAULT_DB, "cover_letter", selected_id)
|
||||
if t and t["status"] in ("queued", "running"):
|
||||
lbl = "Queued…" if t["status"] == "queued" else "Generating via LLM…"
|
||||
st.info(f"⏳ {lbl}")
|
||||
else:
|
||||
st.rerun() # full page rerun — reloads cover letter from DB
|
||||
_cl_status_fragment()
|
||||
elif _cl_task and _cl_task["status"] == "failed":
|
||||
st.error(f"Generation failed: {_cl_task.get('error', 'unknown error')}")
|
||||
```
|
||||
|
||||
Also update the session-state initialiser just below (line 171–172) so it loads from DB after background completion. The existing code already does this correctly:
|
||||
|
||||
```python
|
||||
if _cl_key not in st.session_state:
|
||||
st.session_state[_cl_key] = job.get("cover_letter") or ""
|
||||
```
|
||||
|
||||
This is fine — `job` is fetched fresh on each full-page rerun, so when the background thread writes to `jobs.cover_letter`, the next full rerun picks it up.
|
||||
|
||||
### Step 3: Smoke-test in browser
|
||||
|
||||
1. Navigate to Apply Workspace
|
||||
2. Select an approved job
|
||||
3. Click "Generate / Regenerate"
|
||||
4. Navigate away to Home
|
||||
5. Navigate back to Apply Workspace for the same job
|
||||
6. Observe: button is disabled and "⏳ Generating via LLM…" shows while running; cover letter appears when done
|
||||
|
||||
### Step 4: Commit
|
||||
|
||||
```bash
|
||||
git add app/pages/4_Apply.py
|
||||
git commit -m "feat: cover letter generation runs in background, survives navigation"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 5: Update 6_Interview_Prep.py to use background research
|
||||
|
||||
**Files:**
|
||||
- Modify: `app/pages/6_Interview_Prep.py`
|
||||
|
||||
### Step 1: Add imports at the top of 6_Interview_Prep.py
|
||||
|
||||
After the existing `from scripts.db import (...)` block, add:
|
||||
|
||||
```python
|
||||
from scripts.db import get_task_for_job
|
||||
from scripts.task_runner import submit_task
|
||||
```
|
||||
|
||||
So the full import block becomes:
|
||||
|
||||
```python
|
||||
from scripts.db import (
|
||||
DEFAULT_DB, init_db,
|
||||
get_interview_jobs, get_contacts, get_research,
|
||||
save_research, get_task_for_job,
|
||||
)
|
||||
from scripts.task_runner import submit_task
|
||||
```
|
||||
|
||||
### Step 2: Replace the "no research yet" generate button block
|
||||
|
||||
Find this block (around line 99–111):
|
||||
|
||||
```python
|
||||
if not research:
|
||||
st.warning("No research brief yet for this job.")
|
||||
if st.button("🔬 Generate research brief", type="primary", use_container_width=True):
|
||||
with st.spinner("Generating… this may take 30–60 seconds"):
|
||||
try:
|
||||
from scripts.company_research import research_company
|
||||
result = research_company(job)
|
||||
save_research(DEFAULT_DB, job_id=selected_id, **result)
|
||||
st.success("Done!")
|
||||
st.rerun()
|
||||
except Exception as e:
|
||||
st.error(f"Error: {e}")
|
||||
st.stop()
|
||||
else:
|
||||
```
|
||||
|
||||
Replace with:
|
||||
|
||||
```python
|
||||
_res_task = get_task_for_job(DEFAULT_DB, "company_research", selected_id)
|
||||
_res_running = _res_task and _res_task["status"] in ("queued", "running")
|
||||
|
||||
if not research:
|
||||
if not _res_running:
|
||||
st.warning("No research brief yet for this job.")
|
||||
if _res_task and _res_task["status"] == "failed":
|
||||
st.error(f"Last attempt failed: {_res_task.get('error', '')}")
|
||||
if st.button("🔬 Generate research brief", type="primary", use_container_width=True):
|
||||
submit_task(DEFAULT_DB, "company_research", selected_id)
|
||||
st.rerun()
|
||||
|
||||
if _res_running:
|
||||
@st.fragment(run_every=3)
|
||||
def _res_status_initial():
|
||||
t = get_task_for_job(DEFAULT_DB, "company_research", selected_id)
|
||||
if t and t["status"] in ("queued", "running"):
|
||||
lbl = "Queued…" if t["status"] == "queued" else "Generating… this may take 30–60 seconds"
|
||||
st.info(f"⏳ {lbl}")
|
||||
else:
|
||||
st.rerun()
|
||||
_res_status_initial()
|
||||
|
||||
st.stop()
|
||||
else:
|
||||
```
|
||||
|
||||
### Step 3: Replace the "refresh" button block
|
||||
|
||||
Find this block (around line 113–124):
|
||||
|
||||
```python
|
||||
generated_at = research.get("generated_at", "")
|
||||
col_ts, col_btn = st.columns([3, 1])
|
||||
col_ts.caption(f"Research generated: {generated_at}")
|
||||
if col_btn.button("🔄 Refresh", use_container_width=True):
|
||||
with st.spinner("Refreshing…"):
|
||||
try:
|
||||
from scripts.company_research import research_company
|
||||
result = research_company(job)
|
||||
save_research(DEFAULT_DB, job_id=selected_id, **result)
|
||||
st.rerun()
|
||||
except Exception as e:
|
||||
st.error(f"Error: {e}")
|
||||
```
|
||||
|
||||
Replace with:
|
||||
|
||||
```python
|
||||
generated_at = research.get("generated_at", "")
|
||||
col_ts, col_btn = st.columns([3, 1])
|
||||
col_ts.caption(f"Research generated: {generated_at}")
|
||||
if col_btn.button("🔄 Refresh", use_container_width=True, disabled=bool(_res_running)):
|
||||
submit_task(DEFAULT_DB, "company_research", selected_id)
|
||||
st.rerun()
|
||||
|
||||
if _res_running:
|
||||
@st.fragment(run_every=3)
|
||||
def _res_status_refresh():
|
||||
t = get_task_for_job(DEFAULT_DB, "company_research", selected_id)
|
||||
if t and t["status"] in ("queued", "running"):
|
||||
lbl = "Queued…" if t["status"] == "queued" else "Refreshing research…"
|
||||
st.info(f"⏳ {lbl}")
|
||||
else:
|
||||
st.rerun()
|
||||
_res_status_refresh()
|
||||
elif _res_task and _res_task["status"] == "failed":
|
||||
st.error(f"Refresh failed: {_res_task.get('error', '')}")
|
||||
```
|
||||
|
||||
### Step 4: Smoke-test in browser
|
||||
|
||||
1. Move a job to Phone Screen on the Interviews page
|
||||
2. Navigate to Interview Prep, select that job
|
||||
3. Click "Generate research brief"
|
||||
4. Navigate away to Home
|
||||
5. Navigate back — observe "⏳ Generating…" inline indicator
|
||||
6. Wait for completion — research sections populate automatically
|
||||
|
||||
### Step 5: Run full test suite one final time
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v
|
||||
```
|
||||
|
||||
Expected: all tests PASS
|
||||
|
||||
### Step 6: Commit
|
||||
|
||||
```bash
|
||||
git add app/pages/6_Interview_Prep.py
|
||||
git commit -m "feat: company research generation runs in background, survives navigation"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary of Changes
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `scripts/db.py` | Add `CREATE_BACKGROUND_TASKS`, `init_db` call, 4 new helpers |
|
||||
| `scripts/task_runner.py` | New file — `submit_task` + `_run_task` thread body |
|
||||
| `app/app.py` | Add `_task_sidebar` fragment with 3s auto-refresh |
|
||||
| `app/pages/4_Apply.py` | Generate button → `submit_task`; inline status fragment |
|
||||
| `app/pages/6_Interview_Prep.py` | Generate/Refresh buttons → `submit_task`; inline status fragments |
|
||||
| `tests/test_db.py` | 9 new tests for background_tasks helpers |
|
||||
| `tests/test_task_runner.py` | New file — 6 tests for task_runner |
|
||||
|
|
@ -1,91 +0,0 @@
|
|||
# Email Handling Design
|
||||
|
||||
**Date:** 2026-02-21
|
||||
**Status:** Approved
|
||||
|
||||
## Problem
|
||||
|
||||
IMAP sync already pulls emails for active pipeline jobs, but two gaps exist:
|
||||
1. Inbound emails suggesting a stage change (e.g. "let's schedule a call") produce no signal — the recruiter's message just sits in the email log.
|
||||
2. Recruiter outreach to email addresses not yet in the pipeline is invisible — those leads never enter Job Review.
|
||||
|
||||
## Goals
|
||||
|
||||
- Surface stage-change suggestions inline on the Interviews kanban card (suggest-only, never auto-advance).
|
||||
- Capture recruiter leads from unmatched inbound email and surface them in Job Review.
|
||||
- Make email sync a background task triggerable from the UI (Home page + Interviews sidebar).
|
||||
|
||||
## Data Model
|
||||
|
||||
**No new tables.** Two columns added to `job_contacts`:
|
||||
|
||||
```sql
|
||||
ALTER TABLE job_contacts ADD COLUMN stage_signal TEXT;
|
||||
ALTER TABLE job_contacts ADD COLUMN suggestion_dismissed INTEGER DEFAULT 0;
|
||||
```
|
||||
|
||||
- `stage_signal` — one of: `interview_scheduled`, `offer_received`, `rejected`, `positive_response`, `neutral` (or NULL if not yet classified).
|
||||
- `suggestion_dismissed` — 1 when the user clicks Dismiss; prevents the banner re-appearing.
|
||||
|
||||
Email leads reuse the existing `jobs` table with `source = 'email'` and `status = 'pending'`. No new columns needed.
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Stage Signal Classification (`scripts/imap_sync.py`)
|
||||
|
||||
After saving each **inbound** contact row, call `phi3:mini` via Ollama to classify the email into one of the five labels. Store the result in `stage_signal`. If classification fails, default to `NULL` (no suggestion shown).
|
||||
|
||||
**Model:** `phi3:mini` via `LLMRouter.complete(model_override="phi3:mini", fallback_order=["ollama_research"])`.
|
||||
Benchmarked at 100% accuracy / 3.0 s per email on a 12-case test suite. Runner-up Qwen2.5-3B untested but phi3-mini is the safe choice.
|
||||
|
||||
### 2. Recruiter Lead Extraction (`scripts/imap_sync.py`)
|
||||
|
||||
A second pass after per-job sync: scan INBOX broadly for recruitment-keyword emails that don't match any known pipeline company. For each unmatched email, call **Nemotron 1.5B** (already in use for company research) to extract `{company, title}`. If extraction returns a company name not already in the DB, insert a new job row `source='email', status='pending'`.
|
||||
|
||||
**Dedup:** checked by `message_id` against all known contacts (cross-job), plus `url` uniqueness on the jobs table (the email lead URL is set to a synthetic `email://<from_domain>/<message_id>` value).
|
||||
|
||||
### 3. Background Task (`scripts/task_runner.py`)
|
||||
|
||||
New task type: `email_sync` with `job_id = 0`.
|
||||
`submit_task(db, "email_sync", 0)` → daemon thread → `sync_all()` → returns summary via task `error` field.
|
||||
|
||||
Deduplication: only one `email_sync` can be queued/running at a time (existing insert_task logic handles this).
|
||||
|
||||
### 4. UI — Sync Button (Home + Interviews)
|
||||
|
||||
**Home.py:** New "Sync Emails" section alongside Find Jobs / Score / Notion sync.
|
||||
**5_Interviews.py:** Existing sync button already present in sidebar; convert from synchronous `sync_all()` call to `submit_task()` + fragment polling.
|
||||
|
||||
### 5. UI — Email Leads (Job Review)
|
||||
|
||||
When `show_status == "pending"`, prepend email leads (`source = 'email'`) at the top of the list with a distinct `📧 Email Lead` badge. Actions are identical to scraped pending jobs (Approve / Reject).
|
||||
|
||||
### 6. UI — Stage Suggestion Banner (Interviews Kanban)
|
||||
|
||||
Inside `_render_card()`, before the advance/reject buttons, check for unseen stage signals:
|
||||
|
||||
```
|
||||
💡 Email suggests: interview_scheduled
|
||||
From: sarah@company.com · "Let's book a call"
|
||||
[→ Move to Phone Screen] [Dismiss]
|
||||
```
|
||||
|
||||
- "Move" calls `advance_to_stage()` + `submit_task("company_research")` then reruns.
|
||||
- "Dismiss" calls `dismiss_stage_signal(contact_id)` then reruns.
|
||||
- Only the most recent undismissed signal is shown per card.
|
||||
|
||||
## Error Handling
|
||||
|
||||
| Failure | Behaviour |
|
||||
|---------|-----------|
|
||||
| IMAP connection fails | Error stored in task `error` field; shown as warning in UI after sync |
|
||||
| Classifier call fails | `stage_signal` left NULL; no suggestion shown; sync continues |
|
||||
| Lead extractor fails | Email skipped; appended to `result["errors"]`; sync continues |
|
||||
| Duplicate `email_sync` task | `insert_task` returns existing id; no new thread spawned |
|
||||
| LLM extraction returns no company | Email silently skipped (not a lead) |
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Auto-advancing pipeline stage (suggest only).
|
||||
- Sending email replies from the app (draft helper already exists).
|
||||
- OAuth / token-refresh IMAP (config/email.yaml credentials only).
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -1,187 +0,0 @@
|
|||
# Research Workflow Redesign
|
||||
|
||||
**Date:** 2026-02-22
|
||||
**Status:** Approved
|
||||
|
||||
## Problem
|
||||
|
||||
The current `company_research.py` produces shallow output:
|
||||
- Resume context is a hardcoded 2-sentence blurb — talking points aren't grounded in Alex's actual experience
|
||||
- Search coverage is limited: CEO, HQ, LinkedIn, one generic news query
|
||||
- Output has 4 sections; new data categories (tech stack, funding, culture, competitors) have nowhere to go
|
||||
- No skills/keyword config to drive experience matching against the JD
|
||||
|
||||
## Approach: Query Expansion + Parallel JSON Searches + Single LLM Pass
|
||||
|
||||
Run all searches (companyScraper sequential + new parallel SearXNG JSON queries), aggregate into a structured context block, pre-select resume experiences by keyword score, single LLM call produces all expanded sections.
|
||||
|
||||
---
|
||||
|
||||
## Design
|
||||
|
||||
### 1. Search Pipeline
|
||||
|
||||
**Phase 1 — companyScraper (unchanged, sequential)**
|
||||
- CEO name, HQ address, LinkedIn URL
|
||||
|
||||
**Phase 1b — Parallel SearXNG JSON queries (new/expanded)**
|
||||
|
||||
Six queries run concurrently via daemon threads:
|
||||
|
||||
| Intent | Query pattern |
|
||||
|---|---|
|
||||
| Recent news/press | `"{company}" news 2025 2026` |
|
||||
| Funding & investors | `"{company}" funding round investors Series valuation` |
|
||||
| Tech stack | `"{company}" tech stack engineering technology platform` |
|
||||
| Competitors | `"{company}" competitors alternatives vs market` |
|
||||
| Culture / Glassdoor | `"{company}" glassdoor culture reviews employees` |
|
||||
| CEO press (if found) | `"{ceo}" "{company}"` |
|
||||
|
||||
Each returns 3–4 deduplicated snippets (title + content + URL), labeled by type.
|
||||
Results are best-effort — any failed query is silently skipped.
|
||||
|
||||
---
|
||||
|
||||
### 2. Resume Matching
|
||||
|
||||
**`config/resume_keywords.yaml`** — three categories, tag-managed via Settings UI:
|
||||
|
||||
```yaml
|
||||
skills:
|
||||
- Customer Success
|
||||
- Technical Account Management
|
||||
- Revenue Operations
|
||||
- Salesforce
|
||||
- Gainsight
|
||||
- data analysis
|
||||
- stakeholder management
|
||||
|
||||
domains:
|
||||
- B2B SaaS
|
||||
- enterprise software
|
||||
- security / compliance
|
||||
- post-sale lifecycle
|
||||
|
||||
keywords:
|
||||
- QBR
|
||||
- churn reduction
|
||||
- NRR / ARR
|
||||
- onboarding
|
||||
- renewal
|
||||
- executive sponsorship
|
||||
- VOC
|
||||
```
|
||||
|
||||
**Matching logic:**
|
||||
1. Case-insensitive substring check of all keywords against JD text → `matched_keywords` list
|
||||
2. Score each experience entry: count of matched keywords appearing in position title + responsibility bullets
|
||||
3. Top 2 by score → included in prompt as full detail (position, company, period, all bullets)
|
||||
4. Remaining entries → condensed one-liners ("Founder @ M3 Consulting, 2023–present")
|
||||
|
||||
**UpGuard NDA rule** (explicit in prompt): reference as "enterprise security vendor" in general; only name UpGuard directly if the role has a strong security/compliance focus.
|
||||
|
||||
---
|
||||
|
||||
### 3. LLM Context Block Structure
|
||||
|
||||
```
|
||||
## Role Context
|
||||
{title} at {company}
|
||||
|
||||
## Job Description
|
||||
{JD text, up to 2500 chars}
|
||||
|
||||
## Alex's Matched Experience
|
||||
[Top 2 scored experience entries — full detail]
|
||||
|
||||
Also in Alex's background: [remaining entries as one-liners]
|
||||
|
||||
## Matched Skills & Keywords
|
||||
Skills matching this JD: {matched_keywords joined}
|
||||
|
||||
## Live Company Data
|
||||
- CEO: {name}
|
||||
- HQ: {location}
|
||||
- LinkedIn: {url}
|
||||
|
||||
## News & Press
|
||||
[snippets]
|
||||
|
||||
## Funding & Investors
|
||||
[snippets]
|
||||
|
||||
## Tech Stack
|
||||
[snippets]
|
||||
|
||||
## Competitors
|
||||
[snippets]
|
||||
|
||||
## Culture & Employee Signals
|
||||
[snippets]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. Output Sections (7, up from 4)
|
||||
|
||||
| Section header | Purpose |
|
||||
|---|---|
|
||||
| `## Company Overview` | What they do, business model, size/stage, market position |
|
||||
| `## Leadership & Culture` | CEO background, leadership team, philosophy |
|
||||
| `## Tech Stack & Product` | What they build, relevant technology, product direction |
|
||||
| `## Funding & Market Position` | Stage, investors, recent rounds, competitor landscape |
|
||||
| `## Recent Developments` | News, launches, pivots, exec moves |
|
||||
| `## Red Flags & Watch-outs` | Culture issues, layoffs, exec departures, financial stress |
|
||||
| `## Talking Points for Alex` | 5 role-matched, resume-grounded, UpGuard-aware talking points ready to speak aloud |
|
||||
|
||||
Talking points prompt instructs LLM to: cite the specific matched experience by name, reference matched skills, apply UpGuard NDA rule, frame each as a ready-to-speak sentence.
|
||||
|
||||
---
|
||||
|
||||
### 5. DB Schema Changes
|
||||
|
||||
Add columns to `company_research` table:
|
||||
|
||||
```sql
|
||||
ALTER TABLE company_research ADD COLUMN tech_brief TEXT;
|
||||
ALTER TABLE company_research ADD COLUMN funding_brief TEXT;
|
||||
ALTER TABLE company_research ADD COLUMN competitors_brief TEXT;
|
||||
ALTER TABLE company_research ADD COLUMN red_flags TEXT;
|
||||
```
|
||||
|
||||
Existing columns (`company_brief`, `ceo_brief`, `talking_points`, `raw_output`) unchanged.
|
||||
|
||||
---
|
||||
|
||||
### 6. Settings UI — Skills & Keywords Tab
|
||||
|
||||
New tab in `app/pages/2_Settings.py`:
|
||||
- One expander or subheader per category (Skills, Domains, Keywords)
|
||||
- Tag chips rendered with `st.pills` or columns of `st.badge`-style buttons with ×
|
||||
- Inline text input + Add button per category
|
||||
- Each add/remove saves immediately to `config/resume_keywords.yaml`
|
||||
|
||||
---
|
||||
|
||||
### 7. Interview Prep UI Changes
|
||||
|
||||
`app/pages/6_Interview_Prep.py` — render new sections alongside existing ones:
|
||||
- Tech Stack & Product (new panel)
|
||||
- Funding & Market Position (new panel)
|
||||
- Red Flags & Watch-outs (new panel, visually distinct — e.g. orange/amber)
|
||||
- Talking Points promoted to top (most useful during a live call)
|
||||
|
||||
---
|
||||
|
||||
## Files Affected
|
||||
|
||||
| File | Change |
|
||||
|---|---|
|
||||
| `scripts/company_research.py` | Parallel search queries, resume matching, expanded prompt + sections |
|
||||
| `scripts/db.py` | Add 4 new columns to `company_research`; update `save_research` / `get_research` |
|
||||
| `config/resume_keywords.yaml` | New file |
|
||||
| `config/resume_keywords.yaml.example` | New committed template |
|
||||
| `app/pages/2_Settings.py` | New Skills & Keywords tab |
|
||||
| `app/pages/6_Interview_Prep.py` | Render new sections |
|
||||
| `tests/test_db.py` | Tests for new columns |
|
||||
| `tests/test_company_research.py` | New test file for matching logic + section parsing |
|
||||
|
|
@ -1,869 +0,0 @@
|
|||
# Research Workflow Redesign — Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Expand company research to gather richer web data (funding, tech stack, competitors, culture/Glassdoor, news), match Alex's resume experience against the JD, and produce a 7-section brief with role-grounded talking points.
|
||||
|
||||
**Architecture:** Parallel SearXNG JSON queries (6 types) feed a structured context block alongside tiered resume experience (top-2 scored full, rest condensed) from `config/resume_keywords.yaml`. Single LLM call produces 7 output sections stored in expanded DB columns.
|
||||
|
||||
**Tech Stack:** Python threading, requests (SearXNG JSON API at `http://localhost:8888/search?format=json`), PyYAML, SQLite ALTER TABLE migrations, Streamlit `st.pills` / column chips.
|
||||
|
||||
**Design doc:** `docs/plans/2026-02-22-research-workflow-design.md`
|
||||
|
||||
**Run tests:** `/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v`
|
||||
**Python:** `conda run -n job-seeker python <script>`
|
||||
|
||||
---
|
||||
|
||||
### Task 1: DB migration — add 4 new columns to `company_research`
|
||||
|
||||
The project uses `_RESEARCH_MIGRATIONS` list + `_migrate_db()` pattern (see `scripts/db.py:81-107`). Add columns there so existing DBs are upgraded automatically on `init_db()`.
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/db.py`
|
||||
- Modify: `tests/test_db.py`
|
||||
|
||||
**Step 1: Write the failing tests**
|
||||
|
||||
Add to `tests/test_db.py`:
|
||||
|
||||
```python
|
||||
def test_company_research_has_new_columns(tmp_path):
|
||||
db = tmp_path / "test.db"
|
||||
init_db(db)
|
||||
conn = sqlite3.connect(db)
|
||||
cols = [r[1] for r in conn.execute("PRAGMA table_info(company_research)").fetchall()]
|
||||
conn.close()
|
||||
assert "tech_brief" in cols
|
||||
assert "funding_brief" in cols
|
||||
assert "competitors_brief" in cols
|
||||
assert "red_flags" in cols
|
||||
|
||||
def test_save_and_get_research_new_fields(tmp_path):
|
||||
db = tmp_path / "test.db"
|
||||
init_db(db)
|
||||
# Insert a job first
|
||||
conn = sqlite3.connect(db)
|
||||
conn.execute("INSERT INTO jobs (title, company) VALUES ('TAM', 'Acme')")
|
||||
job_id = conn.execute("SELECT last_insert_rowid()").fetchone()[0]
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
save_research(db, job_id=job_id,
|
||||
company_brief="overview", ceo_brief="ceo",
|
||||
talking_points="points", raw_output="raw",
|
||||
tech_brief="tech stack", funding_brief="series B",
|
||||
competitors_brief="vs competitors", red_flags="none")
|
||||
r = get_research(db, job_id=job_id)
|
||||
assert r["tech_brief"] == "tech stack"
|
||||
assert r["funding_brief"] == "series B"
|
||||
assert r["competitors_brief"] == "vs competitors"
|
||||
assert r["red_flags"] == "none"
|
||||
```
|
||||
|
||||
**Step 2: Run to confirm failure**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_db.py::test_company_research_has_new_columns tests/test_db.py::test_save_and_get_research_new_fields -v
|
||||
```
|
||||
|
||||
Expected: FAIL — columns and parameters don't exist yet.
|
||||
|
||||
**Step 3: Add `_RESEARCH_MIGRATIONS` and wire into `_migrate_db`**
|
||||
|
||||
In `scripts/db.py`, after `_CONTACT_MIGRATIONS` (line ~53), add:
|
||||
|
||||
```python
|
||||
_RESEARCH_MIGRATIONS = [
|
||||
("tech_brief", "TEXT"),
|
||||
("funding_brief", "TEXT"),
|
||||
("competitors_brief", "TEXT"),
|
||||
("red_flags", "TEXT"),
|
||||
]
|
||||
```
|
||||
|
||||
In `_migrate_db()`, after the `_CONTACT_MIGRATIONS` loop, add:
|
||||
|
||||
```python
|
||||
for col, coltype in _RESEARCH_MIGRATIONS:
|
||||
try:
|
||||
conn.execute(f"ALTER TABLE company_research ADD COLUMN {col} {coltype}")
|
||||
except sqlite3.OperationalError:
|
||||
pass
|
||||
```
|
||||
|
||||
**Step 4: Update `save_research` signature and SQL**
|
||||
|
||||
Replace the existing `save_research` function:
|
||||
|
||||
```python
|
||||
def save_research(db_path: Path = DEFAULT_DB, job_id: int = None,
|
||||
company_brief: str = "", ceo_brief: str = "",
|
||||
talking_points: str = "", raw_output: str = "",
|
||||
tech_brief: str = "", funding_brief: str = "",
|
||||
competitors_brief: str = "", red_flags: str = "") -> None:
|
||||
"""Insert or replace a company research record for a job."""
|
||||
now = datetime.now().isoformat()[:16]
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn.execute(
|
||||
"""INSERT INTO company_research
|
||||
(job_id, generated_at, company_brief, ceo_brief, talking_points,
|
||||
raw_output, tech_brief, funding_brief, competitors_brief, red_flags)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
||||
ON CONFLICT(job_id) DO UPDATE SET
|
||||
generated_at = excluded.generated_at,
|
||||
company_brief = excluded.company_brief,
|
||||
ceo_brief = excluded.ceo_brief,
|
||||
talking_points = excluded.talking_points,
|
||||
raw_output = excluded.raw_output,
|
||||
tech_brief = excluded.tech_brief,
|
||||
funding_brief = excluded.funding_brief,
|
||||
competitors_brief = excluded.competitors_brief,
|
||||
red_flags = excluded.red_flags""",
|
||||
(job_id, now, company_brief, ceo_brief, talking_points, raw_output,
|
||||
tech_brief, funding_brief, competitors_brief, red_flags),
|
||||
)
|
||||
conn.commit()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
(`get_research` uses `SELECT *` so it picks up new columns automatically — no change needed.)
|
||||
|
||||
**Step 5: Run tests**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_db.py -v
|
||||
```
|
||||
|
||||
Expected: all pass.
|
||||
|
||||
**Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/db.py tests/test_db.py
|
||||
git commit -m "feat: add tech_brief, funding_brief, competitors_brief, red_flags to company_research"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Create `config/resume_keywords.yaml` and example
|
||||
|
||||
**Files:**
|
||||
- Create: `config/resume_keywords.yaml`
|
||||
- Create: `config/resume_keywords.yaml.example`
|
||||
|
||||
**Step 1: Create `config/resume_keywords.yaml`**
|
||||
|
||||
```yaml
|
||||
skills:
|
||||
- Customer Success
|
||||
- Technical Account Management
|
||||
- Revenue Operations
|
||||
- Salesforce
|
||||
- Gainsight
|
||||
- data analysis
|
||||
- stakeholder management
|
||||
- project management
|
||||
- onboarding
|
||||
- renewal management
|
||||
|
||||
domains:
|
||||
- B2B SaaS
|
||||
- enterprise software
|
||||
- security / compliance
|
||||
- post-sale lifecycle
|
||||
- SaaS metrics
|
||||
|
||||
keywords:
|
||||
- QBR
|
||||
- churn reduction
|
||||
- NRR
|
||||
- ARR
|
||||
- MRR
|
||||
- executive sponsorship
|
||||
- VOC
|
||||
- health score
|
||||
- escalation management
|
||||
- cross-functional
|
||||
- product feedback loop
|
||||
- customer advocacy
|
||||
```
|
||||
|
||||
**Step 2: Copy to `.example`**
|
||||
|
||||
```bash
|
||||
cp config/resume_keywords.yaml config/resume_keywords.yaml.example
|
||||
```
|
||||
|
||||
**Step 3: Add to `.gitignore` if personal, or commit both**
|
||||
|
||||
`resume_keywords.yaml` contains Alex's personal keywords — commit both (no secrets).
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add config/resume_keywords.yaml config/resume_keywords.yaml.example
|
||||
git commit -m "feat: add resume_keywords.yaml for research experience matching"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Resume matching logic in `company_research.py`
|
||||
|
||||
Load the resume YAML and keywords config, score experience entries against the JD, return tiered context string.
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/company_research.py`
|
||||
- Create: `tests/test_company_research.py`
|
||||
|
||||
**Step 1: Write failing tests**
|
||||
|
||||
Create `tests/test_company_research.py`:
|
||||
|
||||
```python
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
|
||||
from scripts.company_research import _score_experiences, _build_resume_context
|
||||
|
||||
|
||||
RESUME_YAML = {
|
||||
"experience_details": [
|
||||
{
|
||||
"position": "Lead Technical Account Manager",
|
||||
"company": "UpGuard",
|
||||
"employment_period": "10/2022 - 05/2023",
|
||||
"key_responsibilities": [
|
||||
{"r1": "Managed enterprise security accounts worth $2M ARR"},
|
||||
{"r2": "Led QBR cadence with C-suite stakeholders"},
|
||||
],
|
||||
},
|
||||
{
|
||||
"position": "Founder and Principal Consultant",
|
||||
"company": "M3 Consulting Services",
|
||||
"employment_period": "07/2023 - Present",
|
||||
"key_responsibilities": [
|
||||
{"r1": "Revenue operations consulting for SaaS clients"},
|
||||
{"r2": "Built customer success frameworks"},
|
||||
],
|
||||
},
|
||||
{
|
||||
"position": "Customer Success Manager",
|
||||
"company": "Generic Co",
|
||||
"employment_period": "01/2020 - 09/2022",
|
||||
"key_responsibilities": [
|
||||
{"r1": "Managed SMB portfolio"},
|
||||
],
|
||||
},
|
||||
]
|
||||
}
|
||||
|
||||
KEYWORDS = ["ARR", "QBR", "enterprise", "security", "stakeholder"]
|
||||
JD = "Looking for a TAM with enterprise ARR experience and QBR facilitation skills."
|
||||
|
||||
|
||||
def test_score_experiences_returns_sorted():
|
||||
scored = _score_experiences(RESUME_YAML["experience_details"], KEYWORDS, JD)
|
||||
# UpGuard should score highest (ARR + QBR + enterprise + stakeholder all in bullets)
|
||||
assert scored[0]["company"] == "UpGuard"
|
||||
|
||||
|
||||
def test_build_resume_context_top2_full_rest_condensed():
|
||||
ctx = _build_resume_context(RESUME_YAML, KEYWORDS, JD)
|
||||
# Full detail for top 2
|
||||
assert "Lead Technical Account Manager" in ctx
|
||||
assert "Managed enterprise security accounts" in ctx
|
||||
# Condensed for rest
|
||||
assert "Also in Alex" in ctx
|
||||
assert "Generic Co" in ctx
|
||||
# UpGuard NDA note present
|
||||
assert "NDA" in ctx or "enterprise security vendor" in ctx
|
||||
```
|
||||
|
||||
**Step 2: Run to confirm failure**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_company_research.py -v
|
||||
```
|
||||
|
||||
Expected: FAIL — functions don't exist.
|
||||
|
||||
**Step 3: Implement `_score_experiences` and `_build_resume_context`**
|
||||
|
||||
Add to `scripts/company_research.py`, after the `_parse_sections` function:
|
||||
|
||||
```python
|
||||
_RESUME_YAML = Path(__file__).parent.parent / "aihawk" / "data_folder" / "plain_text_resume.yaml"
|
||||
_KEYWORDS_YAML = Path(__file__).parent.parent / "config" / "resume_keywords.yaml"
|
||||
|
||||
# Companies where Alex has an NDA — reference engagement but not specifics
|
||||
# unless the role is a strong security/compliance match (score >= 3 on JD).
|
||||
_NDA_COMPANIES = {"upguard"}
|
||||
|
||||
|
||||
def _score_experiences(experiences: list[dict], keywords: list[str], jd: str) -> list[dict]:
|
||||
"""
|
||||
Score each experience entry by how many keywords appear in its text.
|
||||
Returns experiences sorted descending by score, with 'score' key added.
|
||||
"""
|
||||
jd_lower = jd.lower()
|
||||
scored = []
|
||||
for exp in experiences:
|
||||
text = " ".join([
|
||||
exp.get("position", ""),
|
||||
exp.get("company", ""),
|
||||
" ".join(
|
||||
v
|
||||
for resp in exp.get("key_responsibilities", [])
|
||||
for v in resp.values()
|
||||
),
|
||||
]).lower()
|
||||
score = sum(1 for kw in keywords if kw.lower() in text and kw.lower() in jd_lower)
|
||||
scored.append({**exp, "score": score})
|
||||
return sorted(scored, key=lambda x: x["score"], reverse=True)
|
||||
|
||||
|
||||
def _build_resume_context(resume: dict, keywords: list[str], jd: str) -> str:
|
||||
"""
|
||||
Build the resume section of the LLM context block.
|
||||
Top 2 scored experiences included in full detail; rest as one-liners.
|
||||
Applies UpGuard NDA rule: reference as 'enterprise security vendor' unless
|
||||
the role is security-focused (score >= 3).
|
||||
"""
|
||||
import yaml as _yaml
|
||||
|
||||
experiences = resume.get("experience_details", [])
|
||||
if not experiences:
|
||||
return ""
|
||||
|
||||
scored = _score_experiences(experiences, keywords, jd)
|
||||
top2 = scored[:2]
|
||||
rest = scored[2:]
|
||||
|
||||
def _exp_label(exp: dict) -> str:
|
||||
company = exp.get("company", "")
|
||||
if company.lower() in _NDA_COMPANIES and exp.get("score", 0) < 3:
|
||||
company = "enterprise security vendor (NDA)"
|
||||
return f"{exp.get('position', '')} @ {company} ({exp.get('employment_period', '')})"
|
||||
|
||||
def _exp_bullets(exp: dict) -> str:
|
||||
bullets = []
|
||||
for resp in exp.get("key_responsibilities", []):
|
||||
bullets.extend(resp.values())
|
||||
return "\n".join(f" - {b}" for b in bullets)
|
||||
|
||||
lines = ["## Alex's Matched Experience"]
|
||||
for exp in top2:
|
||||
lines.append(f"\n**{_exp_label(exp)}** (match score: {exp['score']})")
|
||||
lines.append(_exp_bullets(exp))
|
||||
|
||||
if rest:
|
||||
condensed = ", ".join(_exp_label(e) for e in rest)
|
||||
lines.append(f"\nAlso in Alex's background: {condensed}")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def _load_resume_and_keywords() -> tuple[dict, list[str]]:
|
||||
"""Load resume YAML and keywords config. Returns (resume_dict, all_keywords)."""
|
||||
import yaml as _yaml
|
||||
|
||||
resume = {}
|
||||
if _RESUME_YAML.exists():
|
||||
resume = _yaml.safe_load(_RESUME_YAML.read_text()) or {}
|
||||
|
||||
keywords: list[str] = []
|
||||
if _KEYWORDS_YAML.exists():
|
||||
kw_cfg = _yaml.safe_load(_KEYWORDS_YAML.read_text()) or {}
|
||||
for lst in kw_cfg.values():
|
||||
if isinstance(lst, list):
|
||||
keywords.extend(lst)
|
||||
|
||||
return resume, keywords
|
||||
```
|
||||
|
||||
**Step 4: Run tests**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_company_research.py -v
|
||||
```
|
||||
|
||||
Expected: all pass.
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/company_research.py tests/test_company_research.py
|
||||
git commit -m "feat: add resume experience matching and tiered context builder"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Parallel search queries (Phase 1b expansion)
|
||||
|
||||
Replace the current single-threaded news fetch with 6 parallel SearXNG queries. Each runs in its own daemon thread and writes to a shared results dict.
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/company_research.py`
|
||||
|
||||
**Step 1: Replace `_fetch_recent_news` with `_fetch_search_data`**
|
||||
|
||||
Remove the existing `_fetch_recent_news` function and replace with:
|
||||
|
||||
```python
|
||||
_SEARCH_QUERIES = {
|
||||
"news": '"{company}" news 2025 2026',
|
||||
"funding": '"{company}" funding round investors Series valuation',
|
||||
"tech": '"{company}" tech stack engineering technology platform',
|
||||
"competitors": '"{company}" competitors alternatives vs market',
|
||||
"culture": '"{company}" glassdoor culture reviews employees',
|
||||
"ceo_press": '"{ceo}" "{company}"', # only used if ceo is known
|
||||
}
|
||||
|
||||
|
||||
def _run_search_query(query: str, results: dict, key: str) -> None:
|
||||
"""Thread target: run one SearXNG JSON query, store up to 4 snippets in results[key]."""
|
||||
import requests
|
||||
|
||||
snippets: list[str] = []
|
||||
seen: set[str] = set()
|
||||
try:
|
||||
resp = requests.get(
|
||||
"http://localhost:8888/search",
|
||||
params={"q": query, "format": "json", "language": "en-US"},
|
||||
timeout=12,
|
||||
)
|
||||
if resp.status_code != 200:
|
||||
return
|
||||
for r in resp.json().get("results", [])[:4]:
|
||||
url = r.get("url", "")
|
||||
if url in seen:
|
||||
continue
|
||||
seen.add(url)
|
||||
title = r.get("title", "").strip()
|
||||
content = r.get("content", "").strip()
|
||||
if title or content:
|
||||
snippets.append(f"- **{title}**\n {content}\n <{url}>")
|
||||
except Exception:
|
||||
pass
|
||||
results[key] = "\n\n".join(snippets)
|
||||
|
||||
|
||||
def _fetch_search_data(company: str, ceo: str = "") -> dict[str, str]:
|
||||
"""
|
||||
Run all search queries in parallel threads.
|
||||
Returns dict keyed by search type (news, funding, tech, competitors, culture, ceo_press).
|
||||
Missing/failed queries produce empty strings.
|
||||
"""
|
||||
import threading
|
||||
|
||||
results: dict[str, str] = {}
|
||||
threads = []
|
||||
|
||||
for key, pattern in _SEARCH_QUERIES.items():
|
||||
if key == "ceo_press" and (not ceo or ceo.lower() in ("not found", "")):
|
||||
continue
|
||||
query = pattern.format(company=company, ceo=ceo)
|
||||
t = threading.Thread(
|
||||
target=_run_search_query,
|
||||
args=(query, results, key),
|
||||
daemon=True,
|
||||
)
|
||||
threads.append(t)
|
||||
t.start()
|
||||
|
||||
for t in threads:
|
||||
t.join(timeout=15) # don't block the task indefinitely
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
**Step 2: Update Phase 1b in `research_company()` to call `_fetch_search_data`**
|
||||
|
||||
Replace the Phase 1b block:
|
||||
|
||||
```python
|
||||
# ── Phase 1b: parallel search queries ────────────────────────────────────
|
||||
search_data: dict[str, str] = {}
|
||||
if use_scraper and _searxng_running():
|
||||
try:
|
||||
ceo_name = (live_data.get("ceo") or "") if live_data else ""
|
||||
search_data = _fetch_search_data(company, ceo=ceo_name)
|
||||
except BaseException:
|
||||
pass # best-effort; never fail the whole task
|
||||
```
|
||||
|
||||
**Step 3: Build per-section notes for the prompt**
|
||||
|
||||
After the Phase 1b block, add:
|
||||
|
||||
```python
|
||||
def _section_note(key: str, label: str) -> str:
|
||||
text = search_data.get(key, "").strip()
|
||||
return f"\n\n## {label} (live web search)\n\n{text}" if text else ""
|
||||
|
||||
news_note = _section_note("news", "News & Press")
|
||||
funding_note = _section_note("funding", "Funding & Investors")
|
||||
tech_note = _section_note("tech", "Tech Stack")
|
||||
competitors_note= _section_note("competitors", "Competitors")
|
||||
culture_note = _section_note("culture", "Culture & Employee Signals")
|
||||
ceo_press_note = _section_note("ceo_press", "CEO in the News")
|
||||
```
|
||||
|
||||
**Step 4: No automated test (threading + network) — manual smoke test**
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python scripts/company_research.py --job-id <any_valid_id>
|
||||
```
|
||||
|
||||
Verify log output shows 6 search threads completing within ~15s total.
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/company_research.py
|
||||
git commit -m "feat: parallel SearXNG search queries (funding, tech, competitors, culture, news)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Expanded LLM prompt and section parsing
|
||||
|
||||
Wire resume context + all search data into the prompt, update section headers, update `_parse_sections` mapping, update `research_company()` return dict.
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/company_research.py`
|
||||
|
||||
**Step 1: Load resume in `research_company()` and build context**
|
||||
|
||||
At the top of `research_company()`, after `jd_excerpt`, add:
|
||||
|
||||
```python
|
||||
resume, keywords = _load_resume_and_keywords()
|
||||
matched_keywords = [kw for kw in keywords if kw.lower() in jd_excerpt.lower()]
|
||||
resume_context = _build_resume_context(resume, keywords, jd_excerpt)
|
||||
keywords_note = (
|
||||
f"\n\n## Matched Skills & Keywords\nSkills matching this JD: {', '.join(matched_keywords)}"
|
||||
if matched_keywords else ""
|
||||
)
|
||||
```
|
||||
|
||||
**Step 2: Replace the Phase 2 LLM prompt**
|
||||
|
||||
Replace the existing `prompt = f"""..."""` block with:
|
||||
|
||||
```python
|
||||
prompt = f"""You are preparing Alex Rivera for a job interview.
|
||||
|
||||
Role: **{title}** at **{company}**
|
||||
|
||||
## Job Description
|
||||
{jd_excerpt}
|
||||
{resume_context}{keywords_note}
|
||||
|
||||
## Live Company Data (SearXNG)
|
||||
{scrape_note.strip() or "_(scrape unavailable)_"}
|
||||
{news_note}{funding_note}{tech_note}{competitors_note}{culture_note}{ceo_press_note}
|
||||
|
||||
---
|
||||
|
||||
Produce a structured research brief using **exactly** these seven markdown section headers
|
||||
(include all seven even if a section has limited data — say so honestly):
|
||||
|
||||
## Company Overview
|
||||
What {company} does, core product/service, business model, size/stage (startup / scale-up / enterprise), market positioning.
|
||||
|
||||
## Leadership & Culture
|
||||
CEO background and leadership style, key execs, mission/values statements, Glassdoor themes.
|
||||
|
||||
## Tech Stack & Product
|
||||
Technologies, platforms, and product direction relevant to the {title} role.
|
||||
|
||||
## Funding & Market Position
|
||||
Funding stage, key investors, recent rounds, burn/growth signals, competitor landscape.
|
||||
|
||||
## Recent Developments
|
||||
News, launches, acquisitions, exec moves, pivots, or press from the past 12–18 months.
|
||||
Draw on the live snippets above; if none available, note what is publicly known.
|
||||
|
||||
## Red Flags & Watch-outs
|
||||
Culture issues, layoffs, exec departures, financial stress, or Glassdoor concerns worth knowing before the call.
|
||||
If nothing notable, write "No significant red flags identified."
|
||||
|
||||
## Talking Points for Alex
|
||||
Five specific talking points for the phone screen. Each must:
|
||||
- Reference a concrete experience from Alex's matched background by name
|
||||
(UpGuard NDA rule: say "enterprise security vendor" unless role has clear security focus)
|
||||
- Connect to a specific signal from the JD or company context above
|
||||
- Be 1–2 sentences, ready to speak aloud
|
||||
- Never give generic advice
|
||||
|
||||
---
|
||||
⚠️ This brief combines live web data and LLM training knowledge. Verify key facts before the call.
|
||||
"""
|
||||
```
|
||||
|
||||
**Step 3: Update the return dict**
|
||||
|
||||
Replace the existing return block:
|
||||
|
||||
```python
|
||||
return {
|
||||
"raw_output": raw,
|
||||
"company_brief": sections.get("Company Overview", ""),
|
||||
"ceo_brief": sections.get("Leadership & Culture", ""),
|
||||
"tech_brief": sections.get("Tech Stack & Product", ""),
|
||||
"funding_brief": sections.get("Funding & Market Position", ""),
|
||||
"talking_points": sections.get("Talking Points for Alex", ""),
|
||||
# Recent Developments and Red Flags stored in raw_output; rendered from there
|
||||
# (avoids adding more columns right now — can migrate later if needed)
|
||||
}
|
||||
```
|
||||
|
||||
Wait — `Recent Developments` and `Red Flags` aren't in the return dict above. We have `red_flags` column from Task 1. Add them:
|
||||
|
||||
```python
|
||||
return {
|
||||
"raw_output": raw,
|
||||
"company_brief": sections.get("Company Overview", ""),
|
||||
"ceo_brief": sections.get("Leadership & Culture", ""),
|
||||
"tech_brief": sections.get("Tech Stack & Product", ""),
|
||||
"funding_brief": sections.get("Funding & Market Position", ""),
|
||||
"competitors_brief": sections.get("Funding & Market Position", ""), # same section
|
||||
"red_flags": sections.get("Red Flags & Watch-outs", ""),
|
||||
"talking_points": sections.get("Talking Points for Alex", ""),
|
||||
}
|
||||
```
|
||||
|
||||
Note: `competitors_brief` pulls from the Funding & Market Position section (which includes competitors). `recent_developments` is only in `raw_output` — no separate column needed.
|
||||
|
||||
**Step 4: Manual smoke test**
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python scripts/company_research.py --job-id <valid_id>
|
||||
```
|
||||
|
||||
Verify all 7 sections appear in output and `save_research` receives all fields.
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/company_research.py
|
||||
git commit -m "feat: expanded research prompt with resume context, 7 output sections"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Interview Prep UI — render new sections
|
||||
|
||||
**Files:**
|
||||
- Modify: `app/pages/6_Interview_Prep.py`
|
||||
|
||||
**Step 1: Replace the left-panel section rendering**
|
||||
|
||||
Find the existing section block (after `st.divider()` at line ~145) and replace with:
|
||||
|
||||
```python
|
||||
# ── Talking Points (top — most useful during a live call) ─────────────────
|
||||
st.subheader("🎯 Talking Points")
|
||||
tp = research.get("talking_points", "").strip()
|
||||
if tp:
|
||||
st.markdown(tp)
|
||||
else:
|
||||
st.caption("_No talking points extracted — try regenerating._")
|
||||
|
||||
st.divider()
|
||||
|
||||
# ── Company brief ─────────────────────────────────────────────────────────
|
||||
st.subheader("🏢 Company Overview")
|
||||
st.markdown(research.get("company_brief") or "_—_")
|
||||
|
||||
st.divider()
|
||||
|
||||
# ── Leadership & culture ──────────────────────────────────────────────────
|
||||
st.subheader("👤 Leadership & Culture")
|
||||
st.markdown(research.get("ceo_brief") or "_—_")
|
||||
|
||||
st.divider()
|
||||
|
||||
# ── Tech Stack ────────────────────────────────────────────────────────────
|
||||
tech = research.get("tech_brief", "").strip()
|
||||
if tech:
|
||||
st.subheader("⚙️ Tech Stack & Product")
|
||||
st.markdown(tech)
|
||||
st.divider()
|
||||
|
||||
# ── Funding & Market ──────────────────────────────────────────────────────
|
||||
funding = research.get("funding_brief", "").strip()
|
||||
if funding:
|
||||
st.subheader("💰 Funding & Market Position")
|
||||
st.markdown(funding)
|
||||
st.divider()
|
||||
|
||||
# ── Red Flags ─────────────────────────────────────────────────────────────
|
||||
red = research.get("red_flags", "").strip()
|
||||
if red and "no significant red flags" not in red.lower():
|
||||
st.subheader("⚠️ Red Flags & Watch-outs")
|
||||
st.warning(red)
|
||||
st.divider()
|
||||
|
||||
# ── Practice Q&A ──────────────────────────────────────────────────────────
|
||||
with st.expander("🎤 Practice Q&A (pre-call prep)", expanded=False):
|
||||
# ... existing Q&A code unchanged ...
|
||||
```
|
||||
|
||||
Note: The existing Practice Q&A expander code stays exactly as-is inside the expander — only move/restructure the section headers above it.
|
||||
|
||||
**Step 2: Restart Streamlit and visually verify**
|
||||
|
||||
```bash
|
||||
bash scripts/manage-ui.sh restart
|
||||
```
|
||||
|
||||
Navigate to Interview Prep → verify new sections appear, Red Flags renders in amber warning box, Tech/Funding sections only show when populated.
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add app/pages/6_Interview_Prep.py
|
||||
git commit -m "feat: render tech, funding, red flags sections in Interview Prep"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Settings UI — Skills & Keywords tab
|
||||
|
||||
**Files:**
|
||||
- Modify: `app/pages/2_Settings.py`
|
||||
|
||||
**Step 1: Add `KEYWORDS_CFG` path constant**
|
||||
|
||||
After the existing config path constants (line ~19), add:
|
||||
|
||||
```python
|
||||
KEYWORDS_CFG = CONFIG_DIR / "resume_keywords.yaml"
|
||||
```
|
||||
|
||||
**Step 2: Add the tab to the tab bar**
|
||||
|
||||
Change:
|
||||
```python
|
||||
tab_search, tab_llm, tab_notion, tab_services, tab_resume, tab_email = st.tabs(
|
||||
["🔎 Search", "🤖 LLM Backends", "📚 Notion", "🔌 Services", "📝 Resume Profile", "📧 Email"]
|
||||
)
|
||||
```
|
||||
To:
|
||||
```python
|
||||
tab_search, tab_llm, tab_notion, tab_services, tab_resume, tab_email, tab_skills = st.tabs(
|
||||
["🔎 Search", "🤖 LLM Backends", "📚 Notion", "🔌 Services", "📝 Resume Profile", "📧 Email", "🏷️ Skills"]
|
||||
)
|
||||
```
|
||||
|
||||
**Step 3: Add the Skills & Keywords tab body**
|
||||
|
||||
Append at the end of the file:
|
||||
|
||||
```python
|
||||
# ── Skills & Keywords tab ─────────────────────────────────────────────────────
|
||||
with tab_skills:
|
||||
st.subheader("🏷️ Skills & Keywords")
|
||||
st.caption(
|
||||
"These are matched against job descriptions to select Alex's most relevant "
|
||||
"experience and highlight keyword overlap in the research brief."
|
||||
)
|
||||
|
||||
if not KEYWORDS_CFG.exists():
|
||||
st.warning("resume_keywords.yaml not found — create it at config/resume_keywords.yaml")
|
||||
st.stop()
|
||||
|
||||
kw_data = load_yaml(KEYWORDS_CFG)
|
||||
|
||||
changed = False
|
||||
for category in ["skills", "domains", "keywords"]:
|
||||
st.markdown(f"**{category.title()}**")
|
||||
tags: list[str] = kw_data.get(category, [])
|
||||
|
||||
# Render existing tags as removable chips
|
||||
cols = st.columns(min(len(tags), 6) or 1)
|
||||
to_remove = None
|
||||
for i, tag in enumerate(tags):
|
||||
with cols[i % 6]:
|
||||
if st.button(f"× {tag}", key=f"rm_{category}_{i}", use_container_width=True):
|
||||
to_remove = tag
|
||||
if to_remove:
|
||||
tags.remove(to_remove)
|
||||
kw_data[category] = tags
|
||||
changed = True
|
||||
|
||||
# Add new tag
|
||||
new_col, btn_col = st.columns([4, 1])
|
||||
new_tag = new_col.text_input(
|
||||
"Add", key=f"new_{category}", label_visibility="collapsed",
|
||||
placeholder=f"Add {category[:-1] if category.endswith('s') else category}…"
|
||||
)
|
||||
if btn_col.button("+ Add", key=f"add_{category}"):
|
||||
tag = new_tag.strip()
|
||||
if tag and tag not in tags:
|
||||
tags.append(tag)
|
||||
kw_data[category] = tags
|
||||
changed = True
|
||||
|
||||
st.markdown("---")
|
||||
|
||||
if changed:
|
||||
save_yaml(KEYWORDS_CFG, kw_data)
|
||||
st.success("Saved.")
|
||||
st.rerun()
|
||||
```
|
||||
|
||||
**Step 4: Restart and verify**
|
||||
|
||||
```bash
|
||||
bash scripts/manage-ui.sh restart
|
||||
```
|
||||
|
||||
Navigate to Settings → Skills tab. Verify:
|
||||
- Tags render as `× tag` buttons; clicking one removes it immediately
|
||||
- Text input + Add button appends new tag
|
||||
- Changes persist to `config/resume_keywords.yaml`
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add app/pages/2_Settings.py
|
||||
git commit -m "feat: add Skills & Keywords tag editor to Settings"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 8: Run full test suite + final smoke test
|
||||
|
||||
**Step 1: Full test suite**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v
|
||||
```
|
||||
|
||||
Expected: all existing + new tests pass.
|
||||
|
||||
**Step 2: End-to-end smoke test**
|
||||
|
||||
With SearXNG running (`docker compose up -d` in `/Library/Development/scrapers/SearXNG/`):
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python scripts/company_research.py --job-id <valid_id>
|
||||
```
|
||||
|
||||
Verify:
|
||||
- 6 search threads complete
|
||||
- All 7 sections present in output
|
||||
- Talking points reference real experience entries (not generic blurb)
|
||||
- `get_research()` returns all new fields populated
|
||||
|
||||
**Step 3: Final commit if any cleanup needed**
|
||||
|
||||
```bash
|
||||
git add -p # stage only intentional changes
|
||||
git commit -m "chore: research workflow final cleanup"
|
||||
```
|
||||
|
|
@ -1,176 +0,0 @@
|
|||
# Survey Assistant — Design Doc
|
||||
|
||||
**Date:** 2026-02-23
|
||||
**Status:** Approved
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Add a real-time Survey Assistant to the job application pipeline that helps the user answer culture-fit and values surveys during the application process. Supports timed surveys via screenshot ingestion and text paste, with a quick ("just give me the answer") or detailed ("explain each option") mode toggle.
|
||||
|
||||
---
|
||||
|
||||
## Pipeline Stage
|
||||
|
||||
A new `survey` stage is inserted between `applied` and `phone_screen`:
|
||||
|
||||
```
|
||||
pending → approved → applied → survey → phone_screen → interviewing → offer → hired
|
||||
```
|
||||
|
||||
- Promotion to `survey` is triggered manually (banner prompt) or automatically when the email classifier detects a `survey_received` signal.
|
||||
- Jobs can skip `survey` entirely — it is not required.
|
||||
- `survey_at` timestamp column added to `jobs` table.
|
||||
|
||||
---
|
||||
|
||||
## Email Classifier
|
||||
|
||||
`classify_stage_signal` in `scripts/imap_sync.py` gains a 6th label: `survey_received`.
|
||||
|
||||
When detected:
|
||||
- The Interviews page shows the existing stage-suggestion banner style: "Survey email received — move to Survey stage?"
|
||||
- One-click promote button moves the job to `survey` and records `survey_at`.
|
||||
|
||||
---
|
||||
|
||||
## Kanban Consolidation (Interviews Page)
|
||||
|
||||
### Change A — Pre-kanban section
|
||||
`applied` and `survey` jobs appear above the kanban columns in a pre-pipeline section, not as their own columns. Visual differentiation: `survey` jobs show a badge/chip.
|
||||
|
||||
### Change B — Offer + Hired merged
|
||||
`offer` and `hired` are combined into one column. `hired` jobs are visually differentiated (e.g. green highlight or checkmark icon) rather than occupying a separate column.
|
||||
|
||||
**Result:** Kanban columns are `phone_screen | interviewing | offer/hired` (3 columns), with applied/survey as a pre-section above.
|
||||
|
||||
---
|
||||
|
||||
## Survey Assistant Page (`app/pages/7_Survey.py`)
|
||||
|
||||
### Layout
|
||||
|
||||
**Left panel — Input**
|
||||
- Job selector dropdown (defaults to `survey`-stage jobs, allows any job)
|
||||
- Survey name field (optional label, e.g. "Culture Fit Round 1")
|
||||
- Mode toggle: **Quick** / **Detailed** (persisted in session state)
|
||||
- Two input tabs:
|
||||
- **Paste Text** — textarea for pasted survey content
|
||||
- **Screenshot** — `streamlit-paste-button` (clipboard paste) + file uploader side by side; either method populates an image preview
|
||||
- Analyze button
|
||||
|
||||
**Right panel — Output**
|
||||
- **Quick mode:** numbered list, each item is bold option letter + one-line rationale
|
||||
e.g. `**B** — most aligns with a collaborative, team-first culture`
|
||||
- **Detailed mode:** each question expanded — option-by-option breakdown, recommendation, brief "why"
|
||||
- "Save to Job" button — persists Q&A to `survey_responses`; shows reported score field before saving
|
||||
|
||||
**Below both panels — History**
|
||||
- Accordion: prior saved survey responses for the selected job, newest first
|
||||
- Shows survey name, mode, reported score, timestamp, and LLM output summary
|
||||
|
||||
---
|
||||
|
||||
## Data Model
|
||||
|
||||
### `survey_responses` table (new)
|
||||
|
||||
```sql
|
||||
CREATE TABLE survey_responses (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
job_id INTEGER NOT NULL REFERENCES jobs(id),
|
||||
survey_name TEXT, -- e.g. "Culture Fit Round 1"
|
||||
received_at DATETIME, -- when the survey email arrived (if known)
|
||||
source TEXT, -- 'text_paste' | 'screenshot'
|
||||
raw_input TEXT, -- pasted text content, or NULL for screenshots
|
||||
image_path TEXT, -- path to saved screenshot, or NULL
|
||||
mode TEXT, -- 'quick' | 'detailed'
|
||||
llm_output TEXT, -- full LLM response
|
||||
reported_score TEXT, -- optional score shown by the survey app
|
||||
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
Screenshots saved to `data/survey_screenshots/<job_id>/<timestamp>.png` (directory gitignored). Stored by path, not BLOB.
|
||||
|
||||
Multiple rows per job are allowed (multiple survey rounds).
|
||||
|
||||
### `jobs` table addition
|
||||
- `survey_at DATETIME` — timestamp when job entered `survey` stage
|
||||
|
||||
---
|
||||
|
||||
## Vision Service (`scripts/vision_service/`)
|
||||
|
||||
A dedicated, optional FastAPI microservice for image-based survey analysis. Independent of thoth.
|
||||
|
||||
### Model
|
||||
- **Primary:** `moondream2` (~1.5GB VRAM at 4-bit quantization)
|
||||
- **Reserve:** `Qwen2.5-VL-3B` if moondream2 accuracy proves insufficient
|
||||
|
||||
### Architecture
|
||||
- Separate conda env: `job-seeker-vision` (torch + transformers + FastAPI + moondream2)
|
||||
- Port: **8002** (avoids conflict with vLLM on 8000 and thoth on 8001)
|
||||
- Model loaded lazily on first request, stays resident (no reload between calls)
|
||||
- GPU loaded on first inference request; 4-bit quantization keeps VRAM footprint ~1.5GB
|
||||
|
||||
### Endpoints
|
||||
```
|
||||
POST /analyze
|
||||
Body: { "prompt": str, "image_base64": str }
|
||||
Returns: { "text": str }
|
||||
|
||||
GET /health
|
||||
Returns: { "status": "ok"|"loading", "model": str, "gpu": bool }
|
||||
```
|
||||
|
||||
### Management
|
||||
`scripts/manage-vision.sh start|stop|restart|status|logs` — same pattern as `manage-ui.sh`.
|
||||
|
||||
### Optional install
|
||||
- If the vision service is not running, the Screenshot tab on the Survey page is hidden
|
||||
- A note in its place explains how to enable: "Install vision service — see docs/vision-service.md"
|
||||
- Text Paste mode always available regardless of vision service status
|
||||
|
||||
---
|
||||
|
||||
## LLM Router Changes (`scripts/llm_router.py`)
|
||||
|
||||
`LLMRouter.complete()` gains an optional `images` parameter:
|
||||
|
||||
```python
|
||||
def complete(self, prompt: str, images: list[str] | None = None) -> str:
|
||||
# images: list of base64-encoded PNG/JPG strings
|
||||
```
|
||||
|
||||
- Backends that don't support images are skipped when `images` is provided
|
||||
- Survey analysis fallback order: `vision_service → claude_code`
|
||||
- `vision_service` backend entry added to `config/llm.yaml` (enabled: false by default — optional install)
|
||||
|
||||
---
|
||||
|
||||
## Generalized Version Notes
|
||||
|
||||
- Vision service is an **optional feature** in the generalized app
|
||||
- `config/llm.yaml` ships with `vision_service.enabled: false`
|
||||
- `scripts/manage-vision.sh` and `scripts/vision_service/` included but documented as optional
|
||||
- Survey page renders in degraded (text-only) mode if vision service is absent
|
||||
- Install instructions in `docs/vision-service.md` (to be written during implementation)
|
||||
|
||||
---
|
||||
|
||||
## Files Affected
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `app/pages/7_Survey.py` | New page |
|
||||
| `app/pages/5_Interviews.py` | Kanban consolidation (A+B), survey banner |
|
||||
| `scripts/imap_sync.py` | Add `survey_received` classifier label |
|
||||
| `scripts/db.py` | `survey_responses` table, `survey_at` column, CRUD helpers |
|
||||
| `scripts/llm_router.py` | `images=` parameter, skip non-vision backends |
|
||||
| `scripts/vision_service/main.py` | New FastAPI vision service |
|
||||
| `scripts/vision_service/environment.yml` | New conda env spec |
|
||||
| `scripts/manage-vision.sh` | New management script |
|
||||
| `config/llm.yaml` | Add `vision_service` backend entry (enabled: false) |
|
||||
| `config/llm.yaml.example` | Same |
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -1,174 +0,0 @@
|
|||
# Design: Craigslist Custom Board Scraper
|
||||
|
||||
**Date:** 2026-02-24
|
||||
**Status:** Approved
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Add a Craigslist scraper to `scripts/custom_boards/craigslist.py` following the existing
|
||||
adzuna/theladders pattern. Craigslist is regional (one subdomain per metro), has no native
|
||||
remote filter, and exposes an RSS feed that gives clean structured data without Playwright.
|
||||
|
||||
Discovery uses RSS for speed and reliability. Full job description is populated by the
|
||||
existing `scrape_url` background task. Company name and salary — not present in Craigslist
|
||||
listings as structured fields — are extracted from the description body by the existing
|
||||
`enrich_descriptions` LLM pipeline after the posting is fetched.
|
||||
|
||||
---
|
||||
|
||||
## Files
|
||||
|
||||
| Action | File |
|
||||
|---|---|
|
||||
| Create | `scripts/custom_boards/craigslist.py` |
|
||||
| Create | `config/craigslist.yaml` (gitignored) |
|
||||
| Create | `config/craigslist.yaml.example` |
|
||||
| Create | `tests/test_craigslist.py` |
|
||||
| Modify | `scripts/discover.py` — add to `CUSTOM_SCRAPERS` registry |
|
||||
| Modify | `scripts/enrich_descriptions.py` — add company/salary extraction for craigslist source |
|
||||
| Modify | `config/search_profiles.yaml` — add `craigslist` to `custom_boards` on relevant profiles |
|
||||
| Modify | `.gitignore` — add `config/craigslist.yaml` |
|
||||
|
||||
---
|
||||
|
||||
## Config (`config/craigslist.yaml`)
|
||||
|
||||
Gitignored. `.example` committed alongside it.
|
||||
|
||||
```yaml
|
||||
# Craigslist metro subdomains to search.
|
||||
# Full list at: https://www.craigslist.org/about/sites
|
||||
metros:
|
||||
- sfbay
|
||||
- newyork
|
||||
- chicago
|
||||
- losangeles
|
||||
- seattle
|
||||
- austin
|
||||
|
||||
# Maps search profile location strings to a single metro subdomain.
|
||||
# Locations not listed here are skipped silently.
|
||||
location_map:
|
||||
"San Francisco Bay Area, CA": sfbay
|
||||
"New York, NY": newyork
|
||||
"Chicago, IL": chicago
|
||||
"Los Angeles, CA": losangeles
|
||||
"Seattle, WA": seattle
|
||||
"Austin, TX": austin
|
||||
|
||||
# Craigslist job category. Defaults to 'jjj' (general jobs) if omitted.
|
||||
# Other useful values: csr (customer service), mar (marketing), sof (software)
|
||||
# category: jjj
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Scraper Architecture
|
||||
|
||||
### RSS URL pattern
|
||||
```
|
||||
https://{metro}.craigslist.org/search/{category}?query={title}&format=rss&sort=date
|
||||
```
|
||||
|
||||
Default category: `jjj`. Overridable via `category` key in config.
|
||||
|
||||
### `scrape(profile, location, results_wanted)` flow
|
||||
|
||||
1. Load `config/craigslist.yaml` — return `[]` with a printed warning if missing or malformed
|
||||
2. Determine metros to search:
|
||||
- `location.lower() == "remote"` → all configured metros (Craigslist has no native remote filter)
|
||||
- Any other string → `location_map.get(location)` → single metro; skip silently if not mapped
|
||||
3. For each metro × each title in `profile["titles"]`:
|
||||
- Fetch RSS via `requests.get` with a standard User-Agent header
|
||||
- Parse with `xml.etree.ElementTree` (stdlib — no extra deps)
|
||||
- Filter `<item>` entries by `<pubDate>` against `profile["hours_old"]`
|
||||
- Extract title, URL, and description snippet from each item
|
||||
- `time.sleep(0.5)` between fetches (polite pacing; easy to make configurable later)
|
||||
4. Dedup by URL within the run via a `seen_urls` set
|
||||
5. Stop when `results_wanted` is reached
|
||||
6. Return list of job dicts
|
||||
|
||||
### Return dict shape
|
||||
|
||||
```python
|
||||
{
|
||||
"title": "<RSS item title, cleaned>",
|
||||
"company": "", # not in Craigslist — filled by LLM enrichment
|
||||
"url": "<item link>",
|
||||
"source": "craigslist",
|
||||
"location": "<metro> (Craigslist)",
|
||||
"is_remote": True, # if remote search, else False
|
||||
"salary": "", # not reliably structured — filled by LLM enrichment
|
||||
"description": "", # scrape_url background task fills this in
|
||||
}
|
||||
```
|
||||
|
||||
### Error handling
|
||||
|
||||
- Missing config → `[]` + printed warning, never raises
|
||||
- `requests.RequestException` → skip that metro/title, print warning, continue
|
||||
- Malformed RSS XML → skip that response, print warning, continue
|
||||
- HTTP non-200 → skip, print status code
|
||||
|
||||
---
|
||||
|
||||
## LLM Enrichment for company/salary
|
||||
|
||||
Craigslist postings frequently include company name and salary in the body text, but not as
|
||||
structured fields. After `scrape_url` populates `description`, the `enrich_descriptions`
|
||||
task handles extraction.
|
||||
|
||||
**Trigger condition:** `source == "craigslist"` AND `company == ""` AND `description != ""`
|
||||
|
||||
**Prompt addition:** Extend the existing enrichment prompt to also extract:
|
||||
- Company name (if present in the posting body)
|
||||
- Salary or compensation range (if mentioned)
|
||||
|
||||
Results written back via `update_job_fields`. If the LLM cannot extract a company name,
|
||||
the field stays blank — this is expected and acceptable for Craigslist.
|
||||
|
||||
---
|
||||
|
||||
## discover.py Integration
|
||||
|
||||
One-line addition to the `CUSTOM_SCRAPERS` registry:
|
||||
|
||||
```python
|
||||
from scripts.custom_boards import craigslist as _craigslist
|
||||
|
||||
CUSTOM_SCRAPERS: dict[str, object] = {
|
||||
"adzuna": _adzuna.scrape,
|
||||
"theladders": _theladders.scrape,
|
||||
"craigslist": _craigslist.scrape, # new
|
||||
}
|
||||
```
|
||||
|
||||
Add `craigslist` to `custom_boards` in `config/search_profiles.yaml` for relevant profiles.
|
||||
|
||||
---
|
||||
|
||||
## Tests (`tests/test_craigslist.py`)
|
||||
|
||||
All tests use mocked `requests.get` with fixture RSS XML — no network calls.
|
||||
|
||||
| Test | Asserts |
|
||||
|---|---|
|
||||
| `test_scrape_returns_empty_on_missing_config` | Missing yaml → `[]`, no raise |
|
||||
| `test_scrape_remote_hits_all_metros` | `location="Remote"` → one fetch per configured metro |
|
||||
| `test_scrape_location_map_resolves` | `"San Francisco Bay Area, CA"` → `sfbay` only |
|
||||
| `test_scrape_location_not_in_map_returns_empty` | Unknown location → `[]`, no raise |
|
||||
| `test_hours_old_filter` | Items older than `hours_old` are excluded |
|
||||
| `test_dedup_within_run` | Same URL appearing in two metros only returned once |
|
||||
| `test_http_error_graceful` | `RequestException` → `[]`, no raise |
|
||||
| `test_results_wanted_cap` | Never returns more than `results_wanted` |
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Playwright-based scraping (RSS is sufficient; Playwright adds a dep for no gain)
|
||||
- Craigslist subcategory multi-search per profile (config `category` override is sufficient)
|
||||
- Salary/company extraction directly in the scraper (LLM enrichment is the right layer)
|
||||
- Windows support (deferred globally)
|
||||
|
|
@ -1,728 +0,0 @@
|
|||
# Craigslist Scraper Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Add a Craigslist RSS-based job scraper to `scripts/custom_boards/craigslist.py`, wired into the existing discovery pipeline, with LLM extraction of company name and salary from the fetched posting body.
|
||||
|
||||
**Architecture:** RSS fetch per metro × title → `scrape_url` background task fills description → new `enrich_craigslist` task type extracts company/salary via LLM. Config-driven metro list in `config/craigslist.yaml`. Integrates via the existing `CUSTOM_SCRAPERS` registry in `discover.py`.
|
||||
|
||||
**Tech Stack:** Python 3.11, `requests`, `xml.etree.ElementTree` (stdlib), `PyYAML`, `email.utils.parsedate_to_datetime` (stdlib), existing `llm_router.py`
|
||||
|
||||
**Test runner:** `/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v`
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Config files + .gitignore
|
||||
|
||||
**Files:**
|
||||
- Create: `config/craigslist.yaml.example`
|
||||
- Create: `config/craigslist.yaml`
|
||||
- Modify: `.gitignore`
|
||||
|
||||
**Step 1: Create `config/craigslist.yaml.example`**
|
||||
|
||||
```yaml
|
||||
# Craigslist metro subdomains to search.
|
||||
# Copy to config/craigslist.yaml and adjust for your markets.
|
||||
# Full subdomain list: https://www.craigslist.org/about/sites
|
||||
metros:
|
||||
- sfbay
|
||||
- newyork
|
||||
- chicago
|
||||
- losangeles
|
||||
- seattle
|
||||
- austin
|
||||
|
||||
# Maps search profile location strings → Craigslist metro subdomain.
|
||||
# Locations not listed here are silently skipped.
|
||||
location_map:
|
||||
"San Francisco Bay Area, CA": sfbay
|
||||
"New York, NY": newyork
|
||||
"Chicago, IL": chicago
|
||||
"Los Angeles, CA": losangeles
|
||||
"Seattle, WA": seattle
|
||||
"Austin, TX": austin
|
||||
|
||||
# Craigslist job category. Defaults to 'jjj' (general jobs) if omitted.
|
||||
# Other options: csr (customer service), mar (marketing), sof (software/qa/dba)
|
||||
# category: jjj
|
||||
```
|
||||
|
||||
**Step 2: Create `config/craigslist.yaml`** (personal config — gitignored)
|
||||
|
||||
Copy `.example` as-is (Alex targets sfbay + remote, so this default is correct).
|
||||
|
||||
**Step 3: Add to `.gitignore`**
|
||||
|
||||
Add `config/craigslist.yaml` after the existing `config/adzuna.yaml` line:
|
||||
|
||||
```
|
||||
config/adzuna.yaml
|
||||
config/craigslist.yaml
|
||||
```
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add config/craigslist.yaml.example .gitignore
|
||||
git commit -m "feat: add craigslist config template and gitignore entry"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Core scraper tests (write failing first)
|
||||
|
||||
**Files:**
|
||||
- Create: `tests/test_craigslist.py`
|
||||
|
||||
**Step 1: Create `tests/test_craigslist.py` with all fixtures and tests**
|
||||
|
||||
```python
|
||||
"""Tests for Craigslist RSS scraper."""
|
||||
from datetime import datetime, timezone, timedelta
|
||||
from email.utils import format_datetime
|
||||
from unittest.mock import patch, MagicMock
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
import pytest
|
||||
import requests
|
||||
|
||||
|
||||
# ── RSS fixture helpers ────────────────────────────────────────────────────────
|
||||
|
||||
def _make_rss(items: list[dict]) -> bytes:
|
||||
"""Build minimal Craigslist-style RSS XML from a list of item dicts."""
|
||||
channel = ET.Element("channel")
|
||||
for item_data in items:
|
||||
item = ET.SubElement(channel, "item")
|
||||
for tag, value in item_data.items():
|
||||
el = ET.SubElement(item, tag)
|
||||
el.text = value
|
||||
rss = ET.Element("rss")
|
||||
rss.append(channel)
|
||||
return ET.tostring(rss, encoding="utf-8", xml_declaration=True)
|
||||
|
||||
|
||||
def _pubdate(hours_ago: float = 1.0) -> str:
|
||||
"""Return an RFC 2822 pubDate string for N hours ago."""
|
||||
dt = datetime.now(tz=timezone.utc) - timedelta(hours=hours_ago)
|
||||
return format_datetime(dt)
|
||||
|
||||
|
||||
def _mock_resp(content: bytes, status_code: int = 200) -> MagicMock:
|
||||
mock = MagicMock()
|
||||
mock.status_code = status_code
|
||||
mock.content = content
|
||||
mock.raise_for_status = MagicMock()
|
||||
if status_code >= 400:
|
||||
mock.raise_for_status.side_effect = requests.HTTPError(f"HTTP {status_code}")
|
||||
return mock
|
||||
|
||||
|
||||
# ── Fixtures ──────────────────────────────────────────────────────────────────
|
||||
|
||||
_SAMPLE_RSS = _make_rss([{
|
||||
"title": "Customer Success Manager",
|
||||
"link": "https://sfbay.craigslist.org/jjj/d/csm-role/1234567890.html",
|
||||
"description": "Great CSM role at Acme Corp. Salary $120k.",
|
||||
"pubDate": _pubdate(1),
|
||||
}])
|
||||
|
||||
_TWO_ITEM_RSS = _make_rss([
|
||||
{
|
||||
"title": "Customer Success Manager",
|
||||
"link": "https://sfbay.craigslist.org/jjj/d/csm-role/1111111111.html",
|
||||
"description": "CSM role 1.",
|
||||
"pubDate": _pubdate(1),
|
||||
},
|
||||
{
|
||||
"title": "Account Manager",
|
||||
"link": "https://sfbay.craigslist.org/jjj/d/am-role/2222222222.html",
|
||||
"description": "AM role.",
|
||||
"pubDate": _pubdate(2),
|
||||
},
|
||||
])
|
||||
|
||||
_OLD_ITEM_RSS = _make_rss([{
|
||||
"title": "Old Job",
|
||||
"link": "https://sfbay.craigslist.org/jjj/d/old-job/9999999999.html",
|
||||
"description": "Very old posting.",
|
||||
"pubDate": _pubdate(hours_ago=500),
|
||||
}])
|
||||
|
||||
_TWO_METRO_CONFIG = {
|
||||
"metros": ["sfbay", "newyork"],
|
||||
"location_map": {
|
||||
"San Francisco Bay Area, CA": "sfbay",
|
||||
"New York, NY": "newyork",
|
||||
},
|
||||
"category": "jjj",
|
||||
}
|
||||
|
||||
_SINGLE_METRO_CONFIG = {
|
||||
"metros": ["sfbay"],
|
||||
"location_map": {"San Francisco Bay Area, CA": "sfbay"},
|
||||
}
|
||||
|
||||
_PROFILE = {"titles": ["Customer Success Manager"], "hours_old": 240}
|
||||
|
||||
|
||||
# ── Tests ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
def test_scrape_returns_empty_on_missing_config(tmp_path):
|
||||
"""Missing craigslist.yaml → returns [] without raising."""
|
||||
with patch("scripts.custom_boards.craigslist._CONFIG_PATH",
|
||||
tmp_path / "craigslist.yaml"):
|
||||
import importlib
|
||||
import scripts.custom_boards.craigslist as cl
|
||||
importlib.reload(cl)
|
||||
result = cl.scrape(_PROFILE, "San Francisco Bay Area, CA")
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_scrape_remote_hits_all_metros():
|
||||
"""location='Remote' triggers one RSS fetch per configured metro."""
|
||||
with patch("scripts.custom_boards.craigslist._load_config",
|
||||
return_value=_TWO_METRO_CONFIG):
|
||||
with patch("scripts.custom_boards.craigslist.requests.get",
|
||||
return_value=_mock_resp(_SAMPLE_RSS)) as mock_get:
|
||||
from scripts.custom_boards import craigslist
|
||||
result = craigslist.scrape(_PROFILE, "Remote")
|
||||
|
||||
assert mock_get.call_count == 2
|
||||
fetched_urls = [call.args[0] for call in mock_get.call_args_list]
|
||||
assert any("sfbay" in u for u in fetched_urls)
|
||||
assert any("newyork" in u for u in fetched_urls)
|
||||
assert all(r["is_remote"] for r in result)
|
||||
|
||||
|
||||
def test_scrape_location_map_resolves():
|
||||
"""Known location string maps to exactly one metro."""
|
||||
with patch("scripts.custom_boards.craigslist._load_config",
|
||||
return_value=_TWO_METRO_CONFIG):
|
||||
with patch("scripts.custom_boards.craigslist.requests.get",
|
||||
return_value=_mock_resp(_SAMPLE_RSS)) as mock_get:
|
||||
from scripts.custom_boards import craigslist
|
||||
result = craigslist.scrape(_PROFILE, "San Francisco Bay Area, CA")
|
||||
|
||||
assert mock_get.call_count == 1
|
||||
assert "sfbay" in mock_get.call_args.args[0]
|
||||
assert len(result) == 1
|
||||
assert result[0]["is_remote"] is False
|
||||
|
||||
|
||||
def test_scrape_location_not_in_map_returns_empty():
|
||||
"""Location not in location_map → [] without raising."""
|
||||
with patch("scripts.custom_boards.craigslist._load_config",
|
||||
return_value=_SINGLE_METRO_CONFIG):
|
||||
with patch("scripts.custom_boards.craigslist.requests.get") as mock_get:
|
||||
from scripts.custom_boards import craigslist
|
||||
result = craigslist.scrape(_PROFILE, "Portland, OR")
|
||||
|
||||
assert result == []
|
||||
mock_get.assert_not_called()
|
||||
|
||||
|
||||
def test_hours_old_filter():
|
||||
"""Items older than hours_old are excluded."""
|
||||
profile = {"titles": ["Customer Success Manager"], "hours_old": 48}
|
||||
with patch("scripts.custom_boards.craigslist._load_config",
|
||||
return_value=_SINGLE_METRO_CONFIG):
|
||||
with patch("scripts.custom_boards.craigslist.requests.get",
|
||||
return_value=_mock_resp(_OLD_ITEM_RSS)):
|
||||
from scripts.custom_boards import craigslist
|
||||
result = craigslist.scrape(profile, "San Francisco Bay Area, CA")
|
||||
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_dedup_within_run():
|
||||
"""Same URL from two different metros is only returned once."""
|
||||
same_url_rss = _make_rss([{
|
||||
"title": "CSM Role",
|
||||
"link": "https://sfbay.craigslist.org/jjj/d/csm/1234.html",
|
||||
"description": "Same job.",
|
||||
"pubDate": _pubdate(1),
|
||||
}])
|
||||
with patch("scripts.custom_boards.craigslist._load_config",
|
||||
return_value=_TWO_METRO_CONFIG):
|
||||
with patch("scripts.custom_boards.craigslist.requests.get",
|
||||
return_value=_mock_resp(same_url_rss)):
|
||||
from scripts.custom_boards import craigslist
|
||||
result = craigslist.scrape(_PROFILE, "Remote")
|
||||
|
||||
urls = [r["url"] for r in result]
|
||||
assert len(urls) == len(set(urls))
|
||||
|
||||
|
||||
def test_http_error_graceful():
|
||||
"""HTTP error → [] without raising."""
|
||||
with patch("scripts.custom_boards.craigslist._load_config",
|
||||
return_value=_SINGLE_METRO_CONFIG):
|
||||
with patch("scripts.custom_boards.craigslist.requests.get",
|
||||
side_effect=requests.RequestException("timeout")):
|
||||
from scripts.custom_boards import craigslist
|
||||
result = craigslist.scrape(_PROFILE, "San Francisco Bay Area, CA")
|
||||
|
||||
assert result == []
|
||||
|
||||
|
||||
def test_results_wanted_cap():
|
||||
"""Never returns more than results_wanted items."""
|
||||
with patch("scripts.custom_boards.craigslist._load_config",
|
||||
return_value=_TWO_METRO_CONFIG):
|
||||
with patch("scripts.custom_boards.craigslist.requests.get",
|
||||
return_value=_mock_resp(_TWO_ITEM_RSS)):
|
||||
from scripts.custom_boards import craigslist
|
||||
result = craigslist.scrape(_PROFILE, "Remote", results_wanted=1)
|
||||
|
||||
assert len(result) <= 1
|
||||
```
|
||||
|
||||
**Step 2: Run tests to verify they all fail**
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_craigslist.py -v
|
||||
```
|
||||
|
||||
Expected: `ModuleNotFoundError: No module named 'scripts.custom_boards.craigslist'`
|
||||
|
||||
---
|
||||
|
||||
## Task 3: Implement `scripts/custom_boards/craigslist.py`
|
||||
|
||||
**Files:**
|
||||
- Create: `scripts/custom_boards/craigslist.py`
|
||||
|
||||
**Step 1: Create the scraper**
|
||||
|
||||
```python
|
||||
"""Craigslist job scraper — RSS-based.
|
||||
|
||||
Uses Craigslist's native RSS feed endpoint for discovery.
|
||||
Full job description is populated by the scrape_url background task.
|
||||
Company name and salary (not structured in Craigslist listings) are
|
||||
extracted from the description body by the enrich_craigslist task.
|
||||
|
||||
Config: config/craigslist.yaml (gitignored — metro list + location map)
|
||||
config/craigslist.yaml.example (committed template)
|
||||
|
||||
Returns a list of dicts compatible with scripts.db.insert_job().
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import time
|
||||
import xml.etree.ElementTree as ET
|
||||
from datetime import datetime, timezone
|
||||
from email.utils import parsedate_to_datetime
|
||||
from pathlib import Path
|
||||
from urllib.parse import quote_plus
|
||||
|
||||
import requests
|
||||
import yaml
|
||||
|
||||
_CONFIG_PATH = Path(__file__).parent.parent.parent / "config" / "craigslist.yaml"
|
||||
_DEFAULT_CATEGORY = "jjj"
|
||||
_HEADERS = {
|
||||
"User-Agent": (
|
||||
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
|
||||
)
|
||||
}
|
||||
_TIMEOUT = 15
|
||||
_SLEEP = 0.5 # seconds between requests — easy to make configurable later
|
||||
|
||||
|
||||
def _load_config() -> dict:
|
||||
if not _CONFIG_PATH.exists():
|
||||
raise FileNotFoundError(
|
||||
f"Craigslist config not found: {_CONFIG_PATH}\n"
|
||||
"Copy config/craigslist.yaml.example → config/craigslist.yaml "
|
||||
"and configure your target metros."
|
||||
)
|
||||
cfg = yaml.safe_load(_CONFIG_PATH.read_text()) or {}
|
||||
if not cfg.get("metros"):
|
||||
raise ValueError(
|
||||
"config/craigslist.yaml must contain at least one entry under 'metros'."
|
||||
)
|
||||
return cfg
|
||||
|
||||
|
||||
def _rss_url(metro: str, category: str, query: str) -> str:
|
||||
return (
|
||||
f"https://{metro}.craigslist.org/search/{category}"
|
||||
f"?query={quote_plus(query)}&format=rss&sort=date"
|
||||
)
|
||||
|
||||
|
||||
def _parse_pubdate(pubdate_str: str) -> datetime | None:
|
||||
"""Parse an RSS pubDate string to a timezone-aware datetime."""
|
||||
try:
|
||||
return parsedate_to_datetime(pubdate_str)
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def _fetch_rss(url: str) -> list[dict]:
|
||||
"""Fetch and parse a Craigslist RSS feed. Returns list of raw item dicts."""
|
||||
resp = requests.get(url, headers=_HEADERS, timeout=_TIMEOUT)
|
||||
resp.raise_for_status()
|
||||
try:
|
||||
root = ET.fromstring(resp.content)
|
||||
except ET.ParseError as exc:
|
||||
raise ValueError(f"Malformed RSS XML: {exc}") from exc
|
||||
|
||||
items = []
|
||||
for item in root.findall(".//item"):
|
||||
def _text(tag: str, _item=item) -> str:
|
||||
el = _item.find(tag)
|
||||
return (el.text or "").strip() if el is not None else ""
|
||||
|
||||
items.append({
|
||||
"title": _text("title"),
|
||||
"link": _text("link"),
|
||||
"description": _text("description"),
|
||||
"pubDate": _text("pubDate"),
|
||||
})
|
||||
return items
|
||||
|
||||
|
||||
def scrape(profile: dict, location: str, results_wanted: int = 50) -> list[dict]:
|
||||
"""Fetch jobs from Craigslist RSS for a single location.
|
||||
|
||||
Args:
|
||||
profile: Search profile dict from search_profiles.yaml.
|
||||
location: Location string (e.g. "Remote" or "San Francisco Bay Area, CA").
|
||||
results_wanted: Maximum results to return across all metros and titles.
|
||||
|
||||
Returns:
|
||||
List of job dicts with keys: title, company, url, source, location,
|
||||
is_remote, salary, description.
|
||||
company/salary are empty — filled later by enrich_craigslist task.
|
||||
"""
|
||||
try:
|
||||
cfg = _load_config()
|
||||
except (FileNotFoundError, ValueError) as exc:
|
||||
print(f" [craigslist] Skipped — {exc}")
|
||||
return []
|
||||
|
||||
metros_all: list[str] = cfg.get("metros", [])
|
||||
location_map: dict[str, str] = cfg.get("location_map", {})
|
||||
category: str = cfg.get("category") or _DEFAULT_CATEGORY
|
||||
|
||||
is_remote_search = location.lower() == "remote"
|
||||
if is_remote_search:
|
||||
metros = metros_all
|
||||
else:
|
||||
metro = location_map.get(location)
|
||||
if not metro:
|
||||
print(f" [craigslist] No metro mapping for '{location}' — skipping")
|
||||
return []
|
||||
metros = [metro]
|
||||
|
||||
titles: list[str] = profile.get("titles", [])
|
||||
hours_old: int = profile.get("hours_old", 240)
|
||||
cutoff = datetime.now(tz=timezone.utc).timestamp() - (hours_old * 3600)
|
||||
|
||||
seen_urls: set[str] = set()
|
||||
results: list[dict] = []
|
||||
|
||||
for metro in metros:
|
||||
if len(results) >= results_wanted:
|
||||
break
|
||||
|
||||
for title in titles:
|
||||
if len(results) >= results_wanted:
|
||||
break
|
||||
|
||||
url = _rss_url(metro, category, title)
|
||||
try:
|
||||
items = _fetch_rss(url)
|
||||
except requests.RequestException as exc:
|
||||
print(f" [craigslist] HTTP error ({metro}/{title}): {exc}")
|
||||
time.sleep(_SLEEP)
|
||||
continue
|
||||
except ValueError as exc:
|
||||
print(f" [craigslist] Parse error ({metro}/{title}): {exc}")
|
||||
time.sleep(_SLEEP)
|
||||
continue
|
||||
|
||||
for item in items:
|
||||
if len(results) >= results_wanted:
|
||||
break
|
||||
|
||||
item_url = item.get("link", "")
|
||||
if not item_url or item_url in seen_urls:
|
||||
continue
|
||||
|
||||
pub = _parse_pubdate(item.get("pubDate", ""))
|
||||
if pub and pub.timestamp() < cutoff:
|
||||
continue
|
||||
|
||||
seen_urls.add(item_url)
|
||||
results.append({
|
||||
"title": item.get("title", ""),
|
||||
"company": "",
|
||||
"url": item_url,
|
||||
"source": "craigslist",
|
||||
"location": f"{metro} (Craigslist)",
|
||||
"is_remote": is_remote_search,
|
||||
"salary": "",
|
||||
"description": "",
|
||||
})
|
||||
|
||||
time.sleep(_SLEEP)
|
||||
|
||||
return results[:results_wanted]
|
||||
```
|
||||
|
||||
**Step 2: Run tests**
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_craigslist.py -v
|
||||
```
|
||||
|
||||
Expected: all 8 PASS
|
||||
|
||||
**Step 3: Run full test suite to check for regressions**
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v
|
||||
```
|
||||
|
||||
Expected: all existing tests still PASS
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/custom_boards/craigslist.py tests/test_craigslist.py
|
||||
git commit -m "feat: add Craigslist RSS scraper to custom_boards"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Wire into discover.py + search_profiles.yaml
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/discover.py:20-32`
|
||||
- Modify: `config/search_profiles.yaml`
|
||||
|
||||
**Step 1: Add to `CUSTOM_SCRAPERS` registry in `discover.py`**
|
||||
|
||||
Find this block (around line 20):
|
||||
|
||||
```python
|
||||
from scripts.custom_boards import adzuna as _adzuna
|
||||
from scripts.custom_boards import theladders as _theladders
|
||||
```
|
||||
|
||||
Replace with:
|
||||
|
||||
```python
|
||||
from scripts.custom_boards import adzuna as _adzuna
|
||||
from scripts.custom_boards import theladders as _theladders
|
||||
from scripts.custom_boards import craigslist as _craigslist
|
||||
```
|
||||
|
||||
Find:
|
||||
|
||||
```python
|
||||
CUSTOM_SCRAPERS: dict[str, object] = {
|
||||
"adzuna": _adzuna.scrape,
|
||||
"theladders": _theladders.scrape,
|
||||
}
|
||||
```
|
||||
|
||||
Replace with:
|
||||
|
||||
```python
|
||||
CUSTOM_SCRAPERS: dict[str, object] = {
|
||||
"adzuna": _adzuna.scrape,
|
||||
"theladders": _theladders.scrape,
|
||||
"craigslist": _craigslist.scrape,
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Add `craigslist` to relevant profiles in `config/search_profiles.yaml`**
|
||||
|
||||
For each profile that has `custom_boards:`, add `- craigslist`. Example — the `cs_leadership` profile currently has:
|
||||
|
||||
```yaml
|
||||
custom_boards:
|
||||
- adzuna
|
||||
- theladders
|
||||
```
|
||||
|
||||
Change to:
|
||||
|
||||
```yaml
|
||||
custom_boards:
|
||||
- adzuna
|
||||
- theladders
|
||||
- craigslist
|
||||
```
|
||||
|
||||
Repeat for all profiles where Craigslist makes sense (all of them — remote + SF Bay Area are both mapped).
|
||||
|
||||
**Step 3: Verify discover.py imports cleanly**
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python -c "from scripts.discover import CUSTOM_SCRAPERS; print(list(CUSTOM_SCRAPERS.keys()))"
|
||||
```
|
||||
|
||||
Expected: `['adzuna', 'theladders', 'craigslist']`
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/discover.py config/search_profiles.yaml
|
||||
git commit -m "feat: register craigslist scraper in discover.py and search profiles"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 5: LLM enrichment — extract company + salary for Craigslist jobs
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/enrich_descriptions.py`
|
||||
- Modify: `scripts/task_runner.py`
|
||||
|
||||
**Step 1: Read `scripts/task_runner.py`** to understand the `scrape_url` completion handler before editing.
|
||||
|
||||
**Step 2: Add `enrich_craigslist_fields()` to `enrich_descriptions.py`**
|
||||
|
||||
Add this function after `enrich_all_descriptions` (before `if __name__ == "__main__"`):
|
||||
|
||||
```python
|
||||
def enrich_craigslist_fields(
|
||||
db_path: Path = DEFAULT_DB,
|
||||
job_id: int = None,
|
||||
) -> dict:
|
||||
"""
|
||||
Use LLM to extract company name and salary from a Craigslist job description.
|
||||
|
||||
Called after scrape_url populates the description for a craigslist job.
|
||||
Only runs when: source='craigslist', company='', description non-empty.
|
||||
|
||||
Returns dict with keys 'company' and/or 'salary' (may be empty strings).
|
||||
"""
|
||||
import sqlite3 as _sq
|
||||
conn = _sq.connect(db_path)
|
||||
conn.row_factory = _sq.Row
|
||||
row = conn.execute(
|
||||
"SELECT id, description, company, source FROM jobs WHERE id=?", (job_id,)
|
||||
).fetchone()
|
||||
conn.close()
|
||||
|
||||
if not row:
|
||||
return {}
|
||||
if row["source"] != "craigslist":
|
||||
return {}
|
||||
if row["company"]: # already populated
|
||||
return {}
|
||||
if not (row["description"] or "").strip():
|
||||
return {}
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
from scripts.llm_router import LLMRouter
|
||||
|
||||
prompt = (
|
||||
"Extract the following from this job posting. "
|
||||
"Return JSON only, no commentary.\n\n"
|
||||
'{"company": "<company name or empty string>", '
|
||||
'"salary": "<salary/compensation or empty string>"}\n\n'
|
||||
f"Posting:\n{row['description'][:3000]}"
|
||||
)
|
||||
|
||||
try:
|
||||
router = LLMRouter()
|
||||
raw = router.complete(prompt)
|
||||
except Exception as exc:
|
||||
print(f"[enrich_craigslist] LLM error for job {job_id}: {exc}")
|
||||
return {}
|
||||
|
||||
import json, re
|
||||
try:
|
||||
# Strip markdown code fences if present
|
||||
clean = re.sub(r"```(?:json)?|```", "", raw).strip()
|
||||
fields = json.loads(clean)
|
||||
except (json.JSONDecodeError, ValueError):
|
||||
print(f"[enrich_craigslist] Could not parse LLM response for job {job_id}: {raw!r}")
|
||||
return {}
|
||||
|
||||
extracted = {
|
||||
k: (fields.get(k) or "").strip()
|
||||
for k in ("company", "salary")
|
||||
if (fields.get(k) or "").strip()
|
||||
}
|
||||
|
||||
if extracted:
|
||||
from scripts.db import update_job_fields
|
||||
update_job_fields(db_path, job_id, extracted)
|
||||
print(f"[enrich_craigslist] job {job_id}: "
|
||||
f"company={extracted.get('company', '—')} "
|
||||
f"salary={extracted.get('salary', '—')}")
|
||||
|
||||
return extracted
|
||||
```
|
||||
|
||||
Also add `import sys` to the top of `enrich_descriptions.py` if not already present.
|
||||
|
||||
**Step 3: Add `enrich_craigslist` task type to `task_runner.py`**
|
||||
|
||||
In `_run_task`, add a new `elif` branch. Find the block that handles `scrape_url` and add after it:
|
||||
|
||||
```python
|
||||
elif task_type == "enrich_craigslist":
|
||||
from scripts.enrich_descriptions import enrich_craigslist_fields
|
||||
extracted = enrich_craigslist_fields(db_path, job_id)
|
||||
company = extracted.get("company", "")
|
||||
msg = f"company={company}" if company else "no company found"
|
||||
update_task_status(db_path, task_id, "completed", error=msg)
|
||||
return
|
||||
```
|
||||
|
||||
**Step 4: Auto-submit `enrich_craigslist` after `scrape_url` for Craigslist jobs**
|
||||
|
||||
Still in `task_runner.py`, find the `scrape_url` completion handler. After the `update_task_status` call for `scrape_url`, add:
|
||||
|
||||
```python
|
||||
# Auto-enrich company/salary for Craigslist jobs
|
||||
import sqlite3 as _sq
|
||||
_conn = _sq.connect(db_path)
|
||||
_conn.row_factory = _sq.Row
|
||||
_job = _conn.execute(
|
||||
"SELECT source, company FROM jobs WHERE id=?", (job_id,)
|
||||
).fetchone()
|
||||
_conn.close()
|
||||
if _job and _job["source"] == "craigslist" and not _job["company"]:
|
||||
submit_task(db_path, "enrich_craigslist", job_id)
|
||||
```
|
||||
|
||||
**Step 5: Smoke test — run a discovery cycle and check a craigslist job**
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python -c "
|
||||
from scripts.custom_boards.craigslist import scrape
|
||||
jobs = scrape({'titles': ['Customer Success Manager'], 'hours_old': 48}, 'San Francisco Bay Area, CA', results_wanted=3)
|
||||
for j in jobs:
|
||||
print(j['title'], '|', j['url'])
|
||||
"
|
||||
```
|
||||
|
||||
Expected: 0–3 job dicts printed (may be 0 if no recent postings — that's fine).
|
||||
|
||||
**Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/enrich_descriptions.py scripts/task_runner.py
|
||||
git commit -m "feat: add enrich_craigslist task for LLM company/salary extraction"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Final: push to remote
|
||||
|
||||
```bash
|
||||
git push origin main
|
||||
```
|
||||
|
|
@ -1,291 +0,0 @@
|
|||
# Expanded First-Run Wizard — Design
|
||||
|
||||
**Date:** 2026-02-24
|
||||
**Status:** Approved
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Replace the current 5-step surface-level wizard with a comprehensive onboarding flow that covers resume upload/parsing/building, guided config walkthroughs, LLM-assisted generation for key sections, and tier-based feature gating — while enforcing a minimum viable setup before the user can access the main app.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
`0_Setup.py` becomes a thin orchestrator. All step logic moves into a new `app/wizard/` package. Resume parsing moves into `scripts/resume_parser.py`.
|
||||
|
||||
```
|
||||
app/
|
||||
app.py # gate: user.yaml exists AND wizard_complete: true
|
||||
wizard/
|
||||
tiers.py # tier definitions, feature gates, can_use() helper
|
||||
step_hardware.py # Step 1: GPU detection → profile recommendation
|
||||
step_tier.py # Step 2: free/paid/premium + dev_tier_override
|
||||
step_identity.py # Step 3: name/email/phone/linkedin/career_summary
|
||||
step_resume.py # Step 4: upload→parse OR guided form builder
|
||||
step_inference.py # Step 5: LLM backend config + API keys
|
||||
step_search.py # Step 6: job titles, locations, boards, keywords
|
||||
step_integrations.py # Step 7: optional cloud/calendar/notification services
|
||||
pages/
|
||||
0_Setup.py # imports steps, drives progress state
|
||||
scripts/
|
||||
resume_parser.py # PDF/DOCX text extraction → LLM structuring
|
||||
integrations/
|
||||
__init__.py # registry: {name: IntegrationBase subclass}
|
||||
base.py # IntegrationBase: connect(), test(), sync(), fields()
|
||||
notion.py
|
||||
google_drive.py
|
||||
google_sheets.py
|
||||
airtable.py
|
||||
dropbox.py
|
||||
onedrive.py
|
||||
mega.py
|
||||
nextcloud.py
|
||||
google_calendar.py
|
||||
apple_calendar.py # CalDAV
|
||||
slack.py
|
||||
discord.py # webhook only
|
||||
home_assistant.py
|
||||
config/
|
||||
integrations/ # one gitignored yaml per connected service
|
||||
notion.yaml.example
|
||||
google_drive.yaml.example
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Gate Logic
|
||||
|
||||
`app.py` gate changes from a single existence check to:
|
||||
|
||||
```python
|
||||
if not UserProfile.exists(_USER_YAML):
|
||||
show_wizard()
|
||||
elif not _profile.wizard_complete:
|
||||
show_wizard() # resumes at last incomplete mandatory step
|
||||
```
|
||||
|
||||
`wizard_complete: false` is written to `user.yaml` at the start of Step 3 (identity). It is only flipped to `true` when all mandatory steps pass validation on the final Finish action.
|
||||
|
||||
---
|
||||
|
||||
## Mandatory Steps
|
||||
|
||||
The wizard cannot be exited until all six mandatory steps pass validation.
|
||||
|
||||
| Step | File | Minimum to pass |
|
||||
|------|------|----------------|
|
||||
| 1. Hardware | `step_hardware.py` | Profile selected (auto-detected default accepted) |
|
||||
| 2. Tier | `step_tier.py` | Tier selected (free is valid) |
|
||||
| 3. Identity | `step_identity.py` | name + email + career_summary non-empty |
|
||||
| 4. Resume | `step_resume.py` | At least one work experience entry |
|
||||
| 5. Inference | `step_inference.py` | At least one working LLM endpoint confirmed |
|
||||
| 6. Search | `step_search.py` | At least one job title + one location |
|
||||
|
||||
Each mandatory step's module exports `validate(data: dict) -> list[str]` — an errors list; empty = pass. These are pure functions, fully testable without Streamlit.
|
||||
|
||||
---
|
||||
|
||||
## Tier System
|
||||
|
||||
### `app/wizard/tiers.py`
|
||||
|
||||
```python
|
||||
TIERS = ["free", "paid", "premium"]
|
||||
|
||||
FEATURES = {
|
||||
# Wizard LLM generation
|
||||
"llm_career_summary": "paid",
|
||||
"llm_expand_bullets": "paid",
|
||||
"llm_suggest_skills": "paid",
|
||||
"llm_voice_guidelines": "premium",
|
||||
"llm_job_titles": "paid",
|
||||
"llm_keywords_blocklist": "paid",
|
||||
"llm_mission_notes": "paid",
|
||||
|
||||
# App features
|
||||
"company_research": "paid",
|
||||
"interview_prep": "paid",
|
||||
"email_classifier": "paid",
|
||||
"survey_assistant": "paid",
|
||||
"model_fine_tuning": "premium",
|
||||
"shared_cover_writer_model": "paid",
|
||||
"multi_user": "premium",
|
||||
"search_profiles_limit": {free: 1, paid: 5, premium: None},
|
||||
|
||||
# Integrations
|
||||
"notion_sync": "paid",
|
||||
"google_sheets_sync": "paid",
|
||||
"airtable_sync": "paid",
|
||||
"google_calendar_sync": "paid",
|
||||
"apple_calendar_sync": "paid",
|
||||
"slack_notifications": "paid",
|
||||
}
|
||||
# Free-tier integrations: google_drive, dropbox, onedrive, mega,
|
||||
# nextcloud, discord, home_assistant
|
||||
```
|
||||
|
||||
### Storage in `user.yaml`
|
||||
|
||||
```yaml
|
||||
tier: free # free | paid | premium
|
||||
dev_tier_override: premium # overrides tier locally — for testing only
|
||||
```
|
||||
|
||||
### Dev override UI
|
||||
|
||||
Settings → Developer tab (visible when `dev_tier_override` is set or `DEV_MODE=true` in `.env`). Single selectbox to switch tier instantly — page reruns, all gates re-evaluate, no restart needed. Also exposes a "Reset wizard" button that sets `wizard_complete: false` to re-enter the wizard without deleting existing config.
|
||||
|
||||
### Gated UI behaviour
|
||||
|
||||
Paid/premium features show a muted `tier_label()` badge (`🔒 Paid` / `⭐ Premium`) and a disabled state rather than being hidden entirely — free users see what they're missing. Clicking a locked `✨` button opens an upsell tooltip, not an error.
|
||||
|
||||
---
|
||||
|
||||
## Resume Handling (Step 4)
|
||||
|
||||
### Fast path — upload
|
||||
|
||||
1. PDF → `pdfminer.six` extracts raw text
|
||||
2. DOCX → `python-docx` extracts paragraphs
|
||||
3. Raw text → LLM structures into `plain_text_resume.yaml` fields via background task
|
||||
4. Populated form rendered for review/correction
|
||||
|
||||
### Fallback — guided form builder
|
||||
|
||||
Walks through `plain_text_resume.yaml` section by section:
|
||||
- Personal info (pre-filled from Step 3)
|
||||
- Work experience (add/remove entries)
|
||||
- Education
|
||||
- Skills
|
||||
- Achievements (optional)
|
||||
|
||||
Both paths converge on the same review form before saving. `career_summary` from the resume is fed back to populate Step 3 if not already set.
|
||||
|
||||
### Outputs
|
||||
|
||||
- `aihawk/data_folder/plain_text_resume.yaml`
|
||||
- `career_summary` written back to `user.yaml`
|
||||
|
||||
---
|
||||
|
||||
## LLM Generation Map
|
||||
|
||||
All `✨` actions submit a background task via `task_runner.py` using task type `wizard_generate` with a `section` parameter. The wizard step polls via `@st.fragment(run_every=3)` and shows inline status stages. Results land in `session_state` keyed by section and auto-populate the field on completion.
|
||||
|
||||
**Status stages for all wizard generation tasks:**
|
||||
`Queued → Analyzing → Generating → Done`
|
||||
|
||||
| Step | Action | Tier | Input | Output |
|
||||
|------|--------|------|-------|--------|
|
||||
| Identity | ✨ Generate career summary | Paid | Resume text | `career_summary` in user.yaml |
|
||||
| Resume | ✨ Expand bullet points | Paid | Rough responsibility notes | Polished STAR-format bullets |
|
||||
| Resume | ✨ Suggest skills | Paid | Experience descriptions | Skills list additions |
|
||||
| Resume | ✨ Infer voice guidelines | Premium | Resume + uploaded cover letters | Voice/tone hints in user.yaml |
|
||||
| Search | ✨ Suggest job titles | Paid | Resume + current titles | Additional title suggestions |
|
||||
| Search | ✨ Suggest keywords | Paid | Resume + titles | `resume_keywords.yaml` additions |
|
||||
| Search | ✨ Suggest blocklist | Paid | Resume + titles | `blocklist.yaml` additions |
|
||||
| My Profile (post-wizard) | ✨ Suggest mission notes | Paid | Resume + LinkedIn URL | `mission_preferences` notes |
|
||||
|
||||
---
|
||||
|
||||
## Optional Steps — Home Banners
|
||||
|
||||
After wizard completion, dismissible banners on the Home page surface remaining setup. Dismissed state stored as `dismissed_banners: [...]` in `user.yaml`.
|
||||
|
||||
| Banner | Links to |
|
||||
|--------|---------|
|
||||
| Connect a cloud service | Settings → Integrations |
|
||||
| Set up email sync | Settings → Email |
|
||||
| Set up email labels | Settings → Email (label guide) |
|
||||
| Tune your mission preferences | Settings → My Profile |
|
||||
| Configure keywords & blocklist | Settings → Search |
|
||||
| Upload cover letter corpus | Settings → Fine-Tune |
|
||||
| Configure LinkedIn Easy Apply | Settings → AIHawk |
|
||||
| Set up company research | Settings → Services (SearXNG) |
|
||||
| Build a target company list | Settings → Search |
|
||||
| Set up notifications | Settings → Integrations |
|
||||
| Tune a model | Settings → Fine-Tune |
|
||||
| Review training data | Settings → Fine-Tune |
|
||||
| Set up calendar sync | Settings → Integrations |
|
||||
|
||||
---
|
||||
|
||||
## Integrations Architecture
|
||||
|
||||
The registry pattern means adding a new integration requires one file in `scripts/integrations/` and one `.yaml.example` in `config/integrations/` — the wizard and Settings tab auto-discover it.
|
||||
|
||||
```python
|
||||
class IntegrationBase:
|
||||
name: str
|
||||
label: str
|
||||
tier: str
|
||||
def connect(self, config: dict) -> bool: ...
|
||||
def test(self) -> bool: ...
|
||||
def sync(self, jobs: list[dict]) -> int: ...
|
||||
def fields(self) -> list[dict]: ... # form field definitions for wizard card
|
||||
```
|
||||
|
||||
Integration configs written to `config/integrations/<name>.yaml` only after a successful `test()` — never on partial input.
|
||||
|
||||
### v1 Integration List
|
||||
|
||||
| Integration | Purpose | Tier |
|
||||
|-------------|---------|------|
|
||||
| Notion | Job tracking DB sync | Paid |
|
||||
| Notion Calendar | Covered by Notion integration | Paid |
|
||||
| Google Sheets | Simpler tracker alternative | Paid |
|
||||
| Airtable | Alternative tracker | Paid |
|
||||
| Google Drive | Resume/cover letter storage | Free |
|
||||
| Dropbox | Document storage | Free |
|
||||
| OneDrive | Document storage | Free |
|
||||
| MEGA | Document storage (privacy-first, cross-platform) | Free |
|
||||
| Nextcloud | Self-hosted document storage | Free |
|
||||
| Google Calendar | Write interview dates | Paid |
|
||||
| Apple Calendar | Write interview dates (CalDAV) | Paid |
|
||||
| Slack | Stage change notifications | Paid |
|
||||
| Discord | Stage change notifications (webhook) | Free |
|
||||
| Home Assistant | Notifications + automations (self-hosted) | Free |
|
||||
|
||||
---
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
Wizard step → Written to
|
||||
──────────────────────────────────────────────────────────────
|
||||
Hardware → user.yaml (inference_profile)
|
||||
Tier → user.yaml (tier, dev_tier_override)
|
||||
Identity → user.yaml (name, email, phone, linkedin,
|
||||
career_summary, wizard_complete: false)
|
||||
Resume (upload) → aihawk/data_folder/plain_text_resume.yaml
|
||||
Resume (builder) → aihawk/data_folder/plain_text_resume.yaml
|
||||
Inference → user.yaml (services block)
|
||||
.env (ANTHROPIC_API_KEY, OPENAI_COMPAT_URL/KEY)
|
||||
Search → config/search_profiles.yaml
|
||||
config/resume_keywords.yaml
|
||||
config/blocklist.yaml
|
||||
Finish → user.yaml (wizard_complete: true)
|
||||
config/llm.yaml (via apply_service_urls())
|
||||
Integrations → config/integrations/<name>.yaml (per service,
|
||||
only after successful test())
|
||||
Background tasks → staging.db background_tasks table
|
||||
LLM results → session_state[section] → field → user saves step
|
||||
```
|
||||
|
||||
**Key rules:**
|
||||
- Each mandatory step writes immediately on "Next" — partial progress survives crash or browser close
|
||||
- `apply_service_urls()` called once at Finish, not per-step
|
||||
- Integration configs never written on partial input — only after `test()` passes
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
- **Tier switching:** Settings → Developer tab selectbox — instant rerun, no restart
|
||||
- **Wizard re-entry:** Settings → Developer "Reset wizard" button sets `wizard_complete: false`
|
||||
- **Unit tests:** `validate(data) -> list[str]` on each step module — pure functions, no Streamlit
|
||||
- **Integration tests:** `tests/test_wizard_flow.py` — full step sequence with mock LLM router and mock file writes
|
||||
- **`DEV_MODE=true`** in `.env` makes Developer tab always visible regardless of `dev_tier_override`
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -1,108 +0,0 @@
|
|||
# Session Handoff — Generalization Implementation
|
||||
|
||||
**Date:** 2026-02-24
|
||||
**For:** Next Claude session implementing the public fork
|
||||
|
||||
---
|
||||
|
||||
## Current State
|
||||
|
||||
The personal version (`/devl/job-seeker/`) is **complete and working** on `main`.
|
||||
|
||||
### What was completed in the 2026-02-24 session
|
||||
- Survey Assistant page (`app/pages/7_Survey.py`) — text paste + screenshot via moondream2
|
||||
- Vision Service (`scripts/vision_service/`) — FastAPI on port 8002, `job-seeker-vision` conda env
|
||||
- LLM Router `images=` parameter — vision-aware routing
|
||||
- `survey_responses` table + `survey_at` column in SQLite
|
||||
- Kanban consolidation — applied+survey as pre-kanban section; offer+hired merged column
|
||||
- `survey_received` email classifier label
|
||||
- Forgejo remote: https://git.opensourcesolarpunk.com/pyr0ball/job-seeker.git
|
||||
|
||||
### Remote repo
|
||||
```
|
||||
git remote: https://git.opensourcesolarpunk.com/pyr0ball/job-seeker.git
|
||||
branch: main (up to date as of 2026-02-24)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What to Implement Next
|
||||
|
||||
Follow the plan at `docs/plans/2026-02-24-job-seeker-app-generalize.md`.
|
||||
The design doc is at `docs/plans/2026-02-24-generalize-design.md`.
|
||||
|
||||
**Target directory:** `/Library/Development/devl/job-seeker-app/` (new repo, no shared history)
|
||||
|
||||
**CRITICAL:** Do NOT start implementing the public fork until explicitly asked. The user confirmed this.
|
||||
|
||||
---
|
||||
|
||||
## Complete List of Hardcoded Personal References
|
||||
|
||||
Everything that must be extracted into `config/user.yaml` via a `UserProfile` class:
|
||||
|
||||
| File | Hardcoded value | Generalized as |
|
||||
|------|----------------|----------------|
|
||||
| `company_research.py` | `"Alex Rivera"` in prompts | `profile.name` |
|
||||
| `company_research.py` | `_NDA_COMPANIES = {"upguard"}` | `profile.nda_companies` |
|
||||
| `company_research.py` | `_SCRAPER_DIR = Path("/Library/...")` | bundled in Docker image |
|
||||
| `generate_cover_letter.py` | `SYSTEM_CONTEXT` with Alex's bio | `profile.career_summary` |
|
||||
| `generate_cover_letter.py` | `LETTERS_DIR = Path("/Library/...")` | `profile.docs_dir` |
|
||||
| `4_Apply.py` | contact block (name/email/phone) | `profile.*` |
|
||||
| `4_Apply.py` | `DOCS_DIR = Path("/Library/...")` | `profile.docs_dir` |
|
||||
| `5_Interviews.py` | email assistant persona "Alex Rivera is a Customer Success..." | `profile.name + profile.career_summary` |
|
||||
| `6_Interview_Prep.py` | `"Alex"` in interviewer prompts | `profile.name` |
|
||||
| `7_Survey.py` | `_SURVEY_SYSTEM` — "The candidate values collaborative teamwork..." | `profile.career_summary` or survey persona field |
|
||||
| `scripts/vision_service/main.py` | `model_id = "vikhyatk/moondream2"`, `revision = "2025-01-09"` | `config/llm.yaml` vision_service block |
|
||||
| `match.py` | `RESUME_PATH = Path("/Library/...Alex_Rivera_Resume...")` | configurable in Settings |
|
||||
| `Home.py` | `"Alex's Job Search"` | `f"{profile.name}'s Job Search"` |
|
||||
| `finetune_local.py` | all `/Library/` paths + `"alex-cover-writer"` | `profile.*` |
|
||||
| `2_Settings.py` | `PFP_DIR`, host service paths (manage-services.sh etc.) | removed / compose-driven |
|
||||
| `config/llm.yaml` | hard-coded `base_url` values | auto-generated from `user.yaml` |
|
||||
|
||||
---
|
||||
|
||||
## New Components to Dockerize
|
||||
|
||||
### Vision Service
|
||||
- Currently: `job-seeker-vision` conda env, port 8002, `manage-vision.sh`
|
||||
- In public fork: separate container in `single-gpu` / `dual-gpu` profiles only
|
||||
- In `remote` / `cpu` profiles: vision falls back to cloud backends
|
||||
- Model configurable via env var in container (default: moondream2)
|
||||
|
||||
### CompanyScraper
|
||||
- Currently: `/Library/Development/scrapers/companyScraper.py` (external path)
|
||||
- In public fork: bundled directly in the app image at a fixed internal path
|
||||
|
||||
---
|
||||
|
||||
## Key Architectural Decisions (from design doc)
|
||||
|
||||
1. **`UserProfile` class** wraps `config/user.yaml` — imported everywhere personal data is used
|
||||
2. **Four Docker Compose profiles:** `remote`, `cpu`, `single-gpu`, `dual-gpu`
|
||||
3. **First-run wizard** gates the app until `config/user.yaml` exists (5-step flow)
|
||||
4. **No shared git history** with personal repo — fresh `git init` in target dir
|
||||
5. **`.env` file** generated by wizard (never hand-edited), gitignored, contains resolved paths
|
||||
6. **`config/llm.yaml` base URLs** are derived values auto-generated from `user.yaml` services block
|
||||
7. **Claude Code Wrapper + Copilot Wrapper** removed from Services tab entirely
|
||||
|
||||
---
|
||||
|
||||
## Files/Paths in Personal Repo to Reference
|
||||
|
||||
- Entry point: `app/app.py`
|
||||
- All pages: `app/pages/`
|
||||
- DB helpers: `scripts/db.py` (single source of truth for schema)
|
||||
- LLM router: `scripts/llm_router.py`
|
||||
- Config: `config/llm.yaml`, `config/search_profiles.yaml`
|
||||
- Vision service: `scripts/vision_service/` (FastAPI + environment.yml)
|
||||
- Test suite: `tests/`
|
||||
|
||||
---
|
||||
|
||||
## Skill to Use
|
||||
|
||||
When starting the generalization session:
|
||||
1. Load `superpowers:executing-plans` skill
|
||||
2. Reference `docs/plans/2026-02-24-job-seeker-app-generalize.md` as the plan
|
||||
3. Work task-by-task with review checkpoints
|
||||
|
|
@ -1,276 +0,0 @@
|
|||
# Design: Generalizing Job Seeker for Public Use
|
||||
|
||||
**Date:** 2026-02-24
|
||||
**Status:** Approved
|
||||
**Target directory:** `/Library/Development/devl/job-seeker-app/`
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Fork the personal job-seeker app into a fully generalized version suitable for any job seeker.
|
||||
The personal version (`/devl/job-seeker/`) is preserved as-is on `main`.
|
||||
The public version is a separate local directory with a fresh git repo — no shared history.
|
||||
|
||||
Core goals:
|
||||
- Extract every hard-coded personal reference into a `config/user.yaml` profile
|
||||
- Docker Compose stack with profiles covering all GPU/inference configurations
|
||||
- First-run wizard that gates the app until the user is configured
|
||||
- Optional fine-tune wizard in Settings for users with a cover letter corpus and a GPU
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
The app runs via `docker compose` with four named profiles:
|
||||
|
||||
| Profile | Containers | Use case |
|
||||
|---|---|---|
|
||||
| `remote` | app + searxng | No GPU; all LLM calls go to external APIs |
|
||||
| `cpu` | app + ollama + searxng | No GPU; local models run on CPU (slow) |
|
||||
| `single-gpu` | app + ollama + searxng | One GPU shared for cover letters + research |
|
||||
| `dual-gpu` | app + ollama + vllm + searxng | GPU 0 = Ollama, GPU 1 = vLLM |
|
||||
|
||||
**SearXNG always runs** regardless of profile — it's lightweight and useful in every mode.
|
||||
|
||||
**Vision Service** runs as a separate container only in `single-gpu` and `dual-gpu` profiles.
|
||||
In `remote` profile, vision falls back to `claude_code` / `anthropic` backends.
|
||||
In `cpu` profile, vision falls back to cloud backends (moondream2 on CPU is impractically slow).
|
||||
|
||||
SQLite lives in a named Docker volume mount (`./data/`). No separate DB container.
|
||||
|
||||
CompanyScraper (`companyScraper.py`) is bundled directly into the app image — no external
|
||||
path dependency on the host.
|
||||
|
||||
The Claude Code Wrapper and GitHub Copilot Wrapper service entries are removed from the
|
||||
Services tab entirely. Users bring their own OpenAI-compatible endpoints via `config/llm.yaml`.
|
||||
|
||||
---
|
||||
|
||||
## User Profile (`config/user.yaml`)
|
||||
|
||||
Single source of truth for all personal data. Checked at startup — if absent, the first-run
|
||||
wizard is shown before any other page is accessible.
|
||||
|
||||
```yaml
|
||||
# Identity — drives all LLM personas, PDF headers, UI labels
|
||||
name: ""
|
||||
email: ""
|
||||
phone: ""
|
||||
linkedin: ""
|
||||
career_summary: "" # paragraph injected into cover letter system prompt
|
||||
|
||||
# Sensitive employers — masked in research briefs
|
||||
nda_companies: [] # e.g. ["UpGuard"] → "enterprise security vendor (NDA)"
|
||||
|
||||
# Local file paths
|
||||
docs_dir: "~/Documents/JobSearch" # cover letter PDFs + corpus
|
||||
ollama_models_dir: "~/models/ollama" # maps to OLLAMA_MODELS in container
|
||||
vllm_models_dir: "~/models/vllm" # mounted into vllm container
|
||||
|
||||
# Active hardware profile
|
||||
inference_profile: "remote" # remote | cpu | single-gpu | dual-gpu
|
||||
|
||||
# Service connection config
|
||||
services:
|
||||
streamlit_port: 8501
|
||||
|
||||
ollama_host: localhost
|
||||
ollama_port: 11434
|
||||
ollama_ssl: false
|
||||
ollama_ssl_verify: true # set false for self-signed certs
|
||||
|
||||
vllm_host: localhost
|
||||
vllm_port: 8000
|
||||
vllm_ssl: false
|
||||
vllm_ssl_verify: true
|
||||
|
||||
searxng_host: localhost
|
||||
searxng_port: 8888
|
||||
searxng_ssl: false
|
||||
searxng_ssl_verify: true
|
||||
```
|
||||
|
||||
All service base URLs in `config/llm.yaml` are **derived values** — auto-generated from the
|
||||
`services` block whenever the user saves their profile. Users never hand-edit URLs.
|
||||
|
||||
Health checks in the Services tab switch from raw TCP socket checks to
|
||||
`requests.get(url, verify=ssl_verify)` so they work against HTTPS endpoints and self-signed certs.
|
||||
|
||||
---
|
||||
|
||||
## First-Run Wizard
|
||||
|
||||
A dedicated Streamlit page shown instead of normal navigation when `config/user.yaml` is absent.
|
||||
Five steps with a progress bar; all steps write to a staging dict, committed to disk on the
|
||||
final step only.
|
||||
|
||||
### Step 1 — Hardware Detection
|
||||
- Auto-detect CUDA GPUs via `nvidia-smi` or `torch.cuda.device_count()`
|
||||
- Check NVIDIA Container Toolkit availability (`docker info | grep nvidia`)
|
||||
- Suggest a profile based on findings; user can override
|
||||
- Warn if suggested profile requires toolkit not installed, with link to docs
|
||||
|
||||
### Step 2 — Identity
|
||||
- Name, email, phone, LinkedIn URL
|
||||
- Career summary (multi-line text area): used as the LLM cover letter persona
|
||||
- Example placeholder text drawn from the resume profile YAML if AIHawk is present
|
||||
|
||||
### Step 3 — Sensitive Employers
|
||||
- Optional; skip button prominent
|
||||
- Chip-based add/remove (same UI as Skills tab)
|
||||
- Explanation: "Employers listed here will appear as 'previous employer (NDA)' in research briefs"
|
||||
|
||||
### Step 4 — Inference & API Keys
|
||||
- Shows only fields relevant to the selected profile
|
||||
- `remote`: Anthropic API key, optional OpenAI-compat endpoint URL + key
|
||||
- `cpu` / `single-gpu` / `dual-gpu`: Ollama model name for cover letters, vLLM model path
|
||||
- Port/host/SSL fields for each active service (collapsed under "Advanced" by default)
|
||||
|
||||
### Step 5 — Notion (Optional)
|
||||
- Integration token + database ID
|
||||
- Test connection button
|
||||
- Skip button prominent; can be configured later in Settings
|
||||
|
||||
**On completion:** writes `config/user.yaml`, `config/notion.yaml` (if provided),
|
||||
auto-generates `config/llm.yaml` base URLs from service config, redirects to Home.
|
||||
|
||||
---
|
||||
|
||||
## Settings Changes
|
||||
|
||||
### New: My Profile tab
|
||||
Editable form for all `user.yaml` fields post-setup. Saving regenerates `config/llm.yaml`
|
||||
base URLs automatically. Replaces scattered "Alex's" references in existing tab captions.
|
||||
|
||||
### Updated: Services tab
|
||||
- Reads port/host from `profile.services.*` instead of hard-coded values
|
||||
- Start/stop commands switch to `docker compose --profile <profile> up/stop <service>`
|
||||
- Health checks use `requests.get` with SSL support
|
||||
- Claude Code Wrapper and Copilot Wrapper entries removed
|
||||
- vLLM model dir reads from `profile.vllm_models_dir`
|
||||
- SearXNG Docker cwd replaced with compose command (no host path needed)
|
||||
|
||||
### New: Fine-Tune Wizard tab (optional, GPU only)
|
||||
Shown only when `inference_profile` is `single-gpu` or `dual-gpu`.
|
||||
|
||||
1. **Upload corpus** — drag-and-drop cover letters (PDF, DOCX, TXT)
|
||||
2. **Preview pairs** — shows extracted (job description snippet → cover letter) training pairs;
|
||||
user can remove bad examples
|
||||
3. **Configure & train** — base model selector (defaults to currently loaded Ollama model),
|
||||
epochs slider, runs `finetune_local.py` as a background task
|
||||
4. **Register** — on completion, `ollama create <username>-cover-writer -f Modelfile`,
|
||||
updates `config/llm.yaml` to use the new model
|
||||
|
||||
Skipped entirely in `remote` and `cpu` profiles with a clear explanation.
|
||||
|
||||
---
|
||||
|
||||
## Code Changes — Hard-Coded Reference Extraction
|
||||
|
||||
A `UserProfile` class (thin wrapper around `config/user.yaml`) is imported wherever
|
||||
personal data is currently hard-coded.
|
||||
|
||||
| Location | Current | Generalized |
|
||||
|---|---|---|
|
||||
| `company_research.py` prompts | `"Alex Rivera"` | `profile.name` |
|
||||
| `company_research.py` | `_NDA_COMPANIES = {"upguard"}` | `profile.nda_companies` |
|
||||
| `company_research.py` | `_SCRAPER_DIR = Path("/Library/...")` | bundled in container |
|
||||
| `generate_cover_letter.py` | `SYSTEM_CONTEXT` with Alex's bio | `profile.career_summary` |
|
||||
| `generate_cover_letter.py` | `LETTERS_DIR = Path("/Library/...")` | `profile.docs_dir` |
|
||||
| `generate_cover_letter.py` | `_MISSION_SIGNALS` / `_MISSION_NOTES` (hardcoded) | `profile.mission_industries` list; First-Run Wizard step |
|
||||
| `4_Apply.py` | contact block with name/email/phone | `profile.*` |
|
||||
| `4_Apply.py` | `DOCS_DIR = Path("/Library/...")` | `profile.docs_dir` |
|
||||
| `5_Interviews.py` email assistant | `"Alex Rivera is a Customer Success..."` | `profile.name + profile.career_summary` |
|
||||
| `6_Interview_Prep.py` | `"Alex"` in interviewer prompts | `profile.name` |
|
||||
| `7_Survey.py` `_SURVEY_SYSTEM` | "The candidate values collaborative teamwork, clear communication, growth, and impact." | `profile.career_summary` or user-editable survey persona field |
|
||||
| `scripts/vision_service/main.py` | `model_id = "vikhyatk/moondream2"`, `revision = "2025-01-09"` | configurable in `config/llm.yaml` vision_service block |
|
||||
| `match.py` | `RESUME_PATH = Path("/Library/...Alex_Rivera_Resume...")` | configurable in Settings |
|
||||
| `Home.py` | `"Alex's Job Search"` | `f"{profile.name}'s Job Search"` |
|
||||
| `finetune_local.py` | all `/Library/` paths + `"alex-cover-writer"` | `profile.*` |
|
||||
| `2_Settings.py` | `PFP_DIR`, hard-coded service paths | removed / compose-driven |
|
||||
| `config/llm.yaml` | hard-coded `base_url` values | auto-generated from `user.yaml` |
|
||||
| `config/search_profiles.yaml` | `mission_tags` on profiles (implicit) | `profile.mission_industries` drives profile generation in wizard |
|
||||
| `config/adzuna.yaml` | per-user API credentials | First-Run Wizard step → `config/adzuna.yaml` (gitignored) |
|
||||
|
||||
### New fields needed in `config/user.yaml` (generalization)
|
||||
|
||||
```yaml
|
||||
# Mission-aligned industries — drives cover letter Para 3 and research accessibility section
|
||||
# Options: music, animal_welfare, education (extensible)
|
||||
mission_industries: []
|
||||
|
||||
# Accessibility priority — adds Inclusion & Accessibility section to every research brief.
|
||||
# This is for the candidate's personal decision-making; never disclosed in applications.
|
||||
accessibility_priority: true
|
||||
|
||||
# Custom board API credentials
|
||||
custom_boards:
|
||||
adzuna:
|
||||
app_id: ""
|
||||
app_key: ""
|
||||
# theladders: no credentials needed (curl_cffi scraper)
|
||||
```
|
||||
|
||||
The First-Run Wizard gains a **Step 2b — Personal Preferences** screen (between Identity and Sensitive Employers):
|
||||
- Checkboxes for preferred industries (Music, Animal Welfare, Education, Other...)
|
||||
- "Other" opens a free-text field to add custom industry signals
|
||||
- Accessibility priority toggle (on by default, explains what it does: "Adds an accessibility assessment to every company research brief so you can evaluate companies on your own terms. This information stays private — it's never sent to employers.")
|
||||
- Custom board credentials (Adzuna app ID/key) with a "Test" button
|
||||
|
||||
---
|
||||
|
||||
## Docker Compose Structure
|
||||
|
||||
```
|
||||
compose.yml # all services + profiles
|
||||
.env # generated by wizard (resolved paths, ports)
|
||||
Dockerfile # app image (Streamlit + companyScraper bundled)
|
||||
docker/
|
||||
searxng/
|
||||
settings.yml # pre-configured for JSON format output
|
||||
ollama/
|
||||
entrypoint.sh # pulls default model on first start if none present
|
||||
```
|
||||
|
||||
GPU passthrough uses `deploy.resources.reservations.devices` (NVIDIA Container Toolkit).
|
||||
Wizard warns and links to install docs if toolkit is missing when a GPU profile is selected.
|
||||
|
||||
The `.env` file is generated (never hand-edited) and gitignored. It contains resolved
|
||||
absolute paths for volume mounts (tilde-expanded from `user.yaml`) and port numbers.
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope (this version)
|
||||
|
||||
- conda + local install path (future track)
|
||||
- Multi-user / auth (single-user app)
|
||||
- PostgreSQL migration (SQLite sufficient)
|
||||
- Windows support
|
||||
- AIHawk LinkedIn Easy Apply generalization (too tightly coupled to personal config)
|
||||
|
||||
---
|
||||
|
||||
## Backlog — Custom Job Source Scrapers
|
||||
|
||||
Not supported by JobSpy; would need custom scrapers plugged into `scripts/discover.py`:
|
||||
|
||||
| Priority | Site | Notes |
|
||||
|----------|------|-------|
|
||||
| 1 | [Adzuna](https://www.adzuna.com) | Free public API (api.adzuna.com) — cleanest integration path |
|
||||
| 2 | [The Ladders](https://www.theladders.com) | Focuses on $100K+ roles — good signal-to-noise for senior CS/ops positions |
|
||||
| 3 | Craigslist | HTML scrape, highly inconsistent by region; likely needs its own dedicated ingestion queue separate from the main discovery run |
|
||||
| — | Monster.com | Low priority — requires session/auth, likely needs Playwright; skip until others are done |
|
||||
|
||||
**Integration pattern:** Each custom source should return the same `pd.DataFrame` schema as JobSpy (`title`, `company`, `job_url`, `location`, `is_remote`, `description`, `site`) so `run_discovery` can consume it without changes. Cleanest as a separate `scripts/custom_boards/` module.
|
||||
|
||||
**LLM-guided profile setup wizard** (for generic build): First-run wizard that walks a new user through their work history and desired search terms, auto-generating `plain_text_resume.yaml` and `search_profiles.yaml`. See First-Run Wizard section above for hardware/identity/inference steps; this extends Step 2 with a career interview flow.
|
||||
|
||||
---
|
||||
|
||||
## Migration from Personal Version
|
||||
|
||||
No automated migration. The personal version stays on its own repo. If the user wants to
|
||||
carry over their `staging.db`, `config/*.yaml`, or cover letter corpus, they copy manually.
|
||||
The wizard's field defaults can be pre-populated from the personal version's config files
|
||||
if detected at a well-known path — but this is a nice-to-have, not required.
|
||||
|
|
@ -1,108 +0,0 @@
|
|||
# Design: Job Ingestion Improvements
|
||||
|
||||
**Date:** 2026-02-24
|
||||
**Status:** Approved
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Three improvements to how jobs enter the pipeline:
|
||||
|
||||
1. **Auto-parse LinkedIn Job Alert emails** — digest emails from `jobalerts-noreply@linkedin.com`
|
||||
contain multiple structured job cards in plain text. Currently ingested as a single confusing
|
||||
email lead. Instead, parse each card into a separate pending job and scrape it via a background
|
||||
task.
|
||||
|
||||
2. **`scrape_url` background task** — new task type that takes a job record's URL, fetches
|
||||
the full listing (title, company, description, salary, location), and updates the job row.
|
||||
Shared by both the LinkedIn alert parser and the manual URL import feature.
|
||||
|
||||
3. **Add Job(s) by URL on Home page** — paste one URL per line, or upload a CSV with a URL
|
||||
column. Each URL is inserted as a pending job and queued for background scraping.
|
||||
|
||||
---
|
||||
|
||||
## `scrape_url` Worker (`scripts/scrape_url.py`)
|
||||
|
||||
Single public function: `scrape_job_url(db_path, job_id) -> dict`
|
||||
|
||||
Board detection from URL hostname:
|
||||
|
||||
| URL pattern | Board | Scrape method |
|
||||
|---|---|---|
|
||||
| `linkedin.com/jobs/view/<id>/` | LinkedIn | LinkedIn guest jobs API (`/jobs-guest/jobs/api/jobPosting/<id>`) |
|
||||
| `indeed.com/viewjob?jk=<key>` | Indeed | requests + BeautifulSoup HTML parse |
|
||||
| `glassdoor.com/...` | Glassdoor | JobSpy internal scraper (same as `enrich_descriptions.py`) |
|
||||
| anything else | generic | requests + JSON-LD → og:tags fallback |
|
||||
|
||||
On success: `UPDATE jobs SET title, company, description, salary, location, is_remote WHERE id=?`
|
||||
On failure: job remains pending with its URL intact — user can still approve/reject it.
|
||||
|
||||
Requires a new `update_job_fields(db_path, job_id, fields: dict)` helper in `db.py`.
|
||||
|
||||
---
|
||||
|
||||
## LinkedIn Alert Parser (`imap_sync.py`)
|
||||
|
||||
New function `parse_linkedin_alert(body: str) -> list[dict]`
|
||||
|
||||
The plain-text body has a reliable block structure:
|
||||
```
|
||||
<Title>
|
||||
<Company>
|
||||
<Location>
|
||||
[optional social proof lines like "2 school alumni"]
|
||||
View job: https://www.linkedin.com/comm/jobs/view/<ID>/?<tracking>
|
||||
|
||||
---------------------------------------------------------
|
||||
|
||||
<next job block...>
|
||||
```
|
||||
|
||||
Parser:
|
||||
1. Split on lines of 10+ dashes
|
||||
2. For each block: filter out social-proof lines (alumni, "Apply with", "actively hiring", etc.)
|
||||
3. Extract: title (line 1), company (line 2), location (line 3), URL (line starting "View job:")
|
||||
4. Canonicalize URL: strip tracking params → `https://www.linkedin.com/jobs/view/<id>/`
|
||||
|
||||
Detection in `_scan_unmatched_leads`: if `from_addr` contains
|
||||
`jobalerts-noreply@linkedin.com`, skip the LLM path and call `parse_linkedin_alert` instead.
|
||||
Each parsed card → `insert_job()` + `submit_task(db, "scrape_url", job_id)`.
|
||||
The email itself is not stored as an email lead — it's a batch import trigger.
|
||||
|
||||
---
|
||||
|
||||
## Home Page URL Import
|
||||
|
||||
New section on `app/Home.py` between Email Sync and Danger Zone.
|
||||
|
||||
Two tabs:
|
||||
- **Paste URLs** — `st.text_area`, one URL per line
|
||||
- **Upload CSV** — `st.file_uploader`, auto-detects first column value starting with `http`
|
||||
|
||||
Both routes call a shared `_queue_url_imports(db_path, urls)` helper that:
|
||||
1. Filters URLs already in the DB (dedup by URL)
|
||||
2. Calls `insert_job({title="Importing…", source="manual", url=url, ...})`
|
||||
3. Calls `submit_task(db, "scrape_url", job_id)` per new job
|
||||
4. Shows `st.success(f"Queued N job(s)")`
|
||||
|
||||
A `@st.fragment(run_every=3)` status block below the form polls active `scrape_url` tasks
|
||||
and shows per-job status (⏳ / ✅ / ❌ title - company).
|
||||
|
||||
---
|
||||
|
||||
## Search Settings (already applied)
|
||||
|
||||
`config/search_profiles.yaml`:
|
||||
- `hours_old: 120 → 240` (cover LinkedIn's algo-sorted alerts)
|
||||
- `results_per_board: 50 → 75`
|
||||
- Added title: `Customer Engagement Manager`
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Scraping all 551 historical LinkedIn alert emails (run email sync going forward)
|
||||
- Deduplication against Notion (URL dedup in SQLite is sufficient)
|
||||
- Authentication-required boards (Indeed Easy Apply, etc.)
|
||||
|
|
@ -1,936 +0,0 @@
|
|||
# Job Ingestion Improvements — Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Auto-parse LinkedIn Job Alert digest emails into multiple pending jobs, add a `scrape_url` background task that fills in job details from a URL, and add a Home page widget for manual URL/CSV import.
|
||||
|
||||
**Architecture:** New `scripts/scrape_url.py` worker + `update_job_fields` DB helper → `scrape_url` task type in `task_runner.py` → consumed by both the LinkedIn alert parser in `imap_sync.py` and the new Home page URL import section.
|
||||
|
||||
**Tech Stack:** Python 3.12, Streamlit, SQLite, requests, BeautifulSoup4, JobSpy (internal scrapers), imap_sync existing patterns
|
||||
|
||||
**Reference:** Design doc at `docs/plans/2026-02-24-job-ingestion-design.md`
|
||||
|
||||
---
|
||||
|
||||
## Task 1: DB helper — `update_job_fields`
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/db.py`
|
||||
- Test: `tests/test_db.py`
|
||||
|
||||
**Step 1: Write the failing test**
|
||||
|
||||
Add to `tests/test_db.py`:
|
||||
|
||||
```python
|
||||
def test_update_job_fields(tmp_path):
|
||||
from scripts.db import init_db, insert_job, update_job_fields
|
||||
db = tmp_path / "test.db"
|
||||
init_db(db)
|
||||
job_id = insert_job(db, {
|
||||
"title": "Importing…", "company": "", "url": "https://example.com/job/1",
|
||||
"source": "manual", "location": "", "description": "", "date_found": "2026-02-24",
|
||||
})
|
||||
update_job_fields(db, job_id, {
|
||||
"title": "Customer Success Manager",
|
||||
"company": "Acme Corp",
|
||||
"location": "San Francisco, CA",
|
||||
"description": "Great role.",
|
||||
"salary": "$120k",
|
||||
"is_remote": 1,
|
||||
})
|
||||
import sqlite3
|
||||
conn = sqlite3.connect(db)
|
||||
row = dict(conn.execute("SELECT * FROM jobs WHERE id=?", (job_id,)).fetchone())
|
||||
conn.close()
|
||||
assert row["title"] == "Customer Success Manager"
|
||||
assert row["company"] == "Acme Corp"
|
||||
assert row["description"] == "Great role."
|
||||
assert row["is_remote"] == 1
|
||||
|
||||
|
||||
def test_update_job_fields_ignores_unknown_columns(tmp_path):
|
||||
from scripts.db import init_db, insert_job, update_job_fields
|
||||
db = tmp_path / "test.db"
|
||||
init_db(db)
|
||||
job_id = insert_job(db, {
|
||||
"title": "Importing…", "company": "", "url": "https://example.com/job/2",
|
||||
"source": "manual", "location": "", "description": "", "date_found": "2026-02-24",
|
||||
})
|
||||
# Should not raise even with an unknown column
|
||||
update_job_fields(db, job_id, {"title": "Real Title", "nonexistent_col": "ignored"})
|
||||
import sqlite3
|
||||
conn = sqlite3.connect(db)
|
||||
row = dict(conn.execute("SELECT * FROM jobs WHERE id=?", (job_id,)).fetchone())
|
||||
conn.close()
|
||||
assert row["title"] == "Real Title"
|
||||
```
|
||||
|
||||
**Step 2: Run test to verify it fails**
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_db.py::test_update_job_fields tests/test_db.py::test_update_job_fields_ignores_unknown_columns -v
|
||||
```
|
||||
Expected: FAIL — `ImportError: cannot import name 'update_job_fields'`
|
||||
|
||||
**Step 3: Implement `update_job_fields` in `scripts/db.py`**
|
||||
|
||||
Add after `update_cover_letter`:
|
||||
|
||||
```python
|
||||
_UPDATABLE_JOB_COLS = {
|
||||
"title", "company", "url", "source", "location", "is_remote",
|
||||
"salary", "description", "match_score", "keyword_gaps",
|
||||
}
|
||||
|
||||
|
||||
def update_job_fields(db_path: Path = DEFAULT_DB, job_id: int = None,
|
||||
fields: dict = None) -> None:
|
||||
"""Update arbitrary job columns. Unknown keys are silently ignored."""
|
||||
if not job_id or not fields:
|
||||
return
|
||||
safe = {k: v for k, v in fields.items() if k in _UPDATABLE_JOB_COLS}
|
||||
if not safe:
|
||||
return
|
||||
conn = sqlite3.connect(db_path)
|
||||
sets = ", ".join(f"{col} = ?" for col in safe)
|
||||
conn.execute(
|
||||
f"UPDATE jobs SET {sets} WHERE id = ?",
|
||||
(*safe.values(), job_id),
|
||||
)
|
||||
conn.commit()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
**Step 4: Run tests to verify they pass**
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_db.py::test_update_job_fields tests/test_db.py::test_update_job_fields_ignores_unknown_columns -v
|
||||
```
|
||||
Expected: PASS
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/db.py tests/test_db.py
|
||||
git commit -m "feat: add update_job_fields helper to db.py"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: `scripts/scrape_url.py` + `task_runner.py` integration
|
||||
|
||||
**Files:**
|
||||
- Create: `scripts/scrape_url.py`
|
||||
- Modify: `scripts/task_runner.py`
|
||||
- Test: `tests/test_scrape_url.py`
|
||||
|
||||
**Step 1: Write the failing tests**
|
||||
|
||||
Create `tests/test_scrape_url.py`:
|
||||
|
||||
```python
|
||||
"""Tests for URL-based job scraping."""
|
||||
from unittest.mock import patch, MagicMock
|
||||
|
||||
|
||||
def _make_db(tmp_path, url="https://www.linkedin.com/jobs/view/99999/"):
|
||||
from scripts.db import init_db, insert_job
|
||||
db = tmp_path / "test.db"
|
||||
init_db(db)
|
||||
job_id = insert_job(db, {
|
||||
"title": "Importing…", "company": "", "url": url,
|
||||
"source": "manual", "location": "", "description": "", "date_found": "2026-02-24",
|
||||
})
|
||||
return db, job_id
|
||||
|
||||
|
||||
def test_canonicalize_url_linkedin():
|
||||
from scripts.scrape_url import canonicalize_url
|
||||
messy = (
|
||||
"https://www.linkedin.com/jobs/view/4376518925/"
|
||||
"?trk=eml-email_job_alert&refId=abc%3D%3D&trackingId=xyz"
|
||||
)
|
||||
assert canonicalize_url(messy) == "https://www.linkedin.com/jobs/view/4376518925/"
|
||||
|
||||
|
||||
def test_canonicalize_url_linkedin_comm():
|
||||
from scripts.scrape_url import canonicalize_url
|
||||
comm = "https://www.linkedin.com/comm/jobs/view/4376518925/?trackingId=abc"
|
||||
assert canonicalize_url(comm) == "https://www.linkedin.com/jobs/view/4376518925/"
|
||||
|
||||
|
||||
def test_canonicalize_url_generic_strips_utm():
|
||||
from scripts.scrape_url import canonicalize_url
|
||||
url = "https://jobs.example.com/post/42?utm_source=linkedin&utm_medium=email&jk=real_param"
|
||||
result = canonicalize_url(url)
|
||||
assert "utm_source" not in result
|
||||
assert "real_param" in result
|
||||
|
||||
|
||||
def test_detect_board_linkedin():
|
||||
from scripts.scrape_url import _detect_board
|
||||
assert _detect_board("https://www.linkedin.com/jobs/view/12345/") == "linkedin"
|
||||
assert _detect_board("https://linkedin.com/jobs/view/12345/?tracking=abc") == "linkedin"
|
||||
|
||||
|
||||
def test_detect_board_indeed():
|
||||
from scripts.scrape_url import _detect_board
|
||||
assert _detect_board("https://www.indeed.com/viewjob?jk=abc123") == "indeed"
|
||||
|
||||
|
||||
def test_detect_board_glassdoor():
|
||||
from scripts.scrape_url import _detect_board
|
||||
assert _detect_board("https://www.glassdoor.com/job-listing/foo-bar-123.htm") == "glassdoor"
|
||||
|
||||
|
||||
def test_detect_board_generic():
|
||||
from scripts.scrape_url import _detect_board
|
||||
assert _detect_board("https://jobs.example.com/posting/42") == "generic"
|
||||
|
||||
|
||||
def test_extract_linkedin_job_id():
|
||||
from scripts.scrape_url import _extract_linkedin_job_id
|
||||
assert _extract_linkedin_job_id("https://www.linkedin.com/jobs/view/4376518925/") == "4376518925"
|
||||
assert _extract_linkedin_job_id("https://www.linkedin.com/comm/jobs/view/4376518925/?tracking=x") == "4376518925"
|
||||
assert _extract_linkedin_job_id("https://example.com/no-id") is None
|
||||
|
||||
|
||||
def test_scrape_linkedin_updates_job(tmp_path):
|
||||
db, job_id = _make_db(tmp_path)
|
||||
|
||||
linkedin_html = """<html><head></head><body>
|
||||
<h2 class="top-card-layout__title">Customer Success Manager</h2>
|
||||
<a class="topcard__org-name-link">Acme Corp</a>
|
||||
<span class="topcard__flavor--bullet">San Francisco, CA</span>
|
||||
<div class="show-more-less-html__markup">Exciting CSM role with great benefits.</div>
|
||||
</body></html>"""
|
||||
|
||||
mock_resp = MagicMock()
|
||||
mock_resp.text = linkedin_html
|
||||
mock_resp.raise_for_status = MagicMock()
|
||||
|
||||
with patch("scripts.scrape_url.requests.get", return_value=mock_resp):
|
||||
from scripts.scrape_url import scrape_job_url
|
||||
result = scrape_job_url(db, job_id)
|
||||
|
||||
assert result.get("title") == "Customer Success Manager"
|
||||
assert result.get("company") == "Acme Corp"
|
||||
assert "CSM role" in result.get("description", "")
|
||||
|
||||
import sqlite3
|
||||
conn = sqlite3.connect(db)
|
||||
row = dict(conn.execute("SELECT * FROM jobs WHERE id=?", (job_id,)).fetchone())
|
||||
conn.close()
|
||||
assert row["title"] == "Customer Success Manager"
|
||||
assert row["company"] == "Acme Corp"
|
||||
|
||||
|
||||
def test_scrape_url_generic_json_ld(tmp_path):
|
||||
db, job_id = _make_db(tmp_path, url="https://jobs.example.com/post/42")
|
||||
|
||||
json_ld_html = """<html><head>
|
||||
<script type="application/ld+json">
|
||||
{"@type": "JobPosting", "title": "TAM Role", "description": "Tech account mgmt.",
|
||||
"hiringOrganization": {"name": "TechCo"},
|
||||
"jobLocation": {"address": {"addressLocality": "Austin, TX"}}}
|
||||
</script>
|
||||
</head><body></body></html>"""
|
||||
|
||||
mock_resp = MagicMock()
|
||||
mock_resp.text = json_ld_html
|
||||
mock_resp.raise_for_status = MagicMock()
|
||||
|
||||
with patch("scripts.scrape_url.requests.get", return_value=mock_resp):
|
||||
from scripts.scrape_url import scrape_job_url
|
||||
result = scrape_job_url(db, job_id)
|
||||
|
||||
assert result.get("title") == "TAM Role"
|
||||
assert result.get("company") == "TechCo"
|
||||
|
||||
|
||||
def test_scrape_url_graceful_on_http_error(tmp_path):
|
||||
db, job_id = _make_db(tmp_path)
|
||||
import requests as req
|
||||
|
||||
with patch("scripts.scrape_url.requests.get", side_effect=req.RequestException("timeout")):
|
||||
from scripts.scrape_url import scrape_job_url
|
||||
result = scrape_job_url(db, job_id)
|
||||
|
||||
# Should return empty dict and not raise; job row still exists
|
||||
assert isinstance(result, dict)
|
||||
import sqlite3
|
||||
conn = sqlite3.connect(db)
|
||||
row = conn.execute("SELECT id FROM jobs WHERE id=?", (job_id,)).fetchone()
|
||||
conn.close()
|
||||
assert row is not None
|
||||
```
|
||||
|
||||
**Step 2: Run tests to verify they fail**
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_scrape_url.py -v
|
||||
```
|
||||
Expected: FAIL — `ModuleNotFoundError: No module named 'scripts.scrape_url'`
|
||||
|
||||
**Step 3: Implement `scripts/scrape_url.py`**
|
||||
|
||||
```python
|
||||
# scripts/scrape_url.py
|
||||
"""
|
||||
Scrape a job listing from its URL and update the job record.
|
||||
|
||||
Supports:
|
||||
- LinkedIn (guest jobs API — no auth required)
|
||||
- Indeed (HTML parse)
|
||||
- Glassdoor (JobSpy internal scraper, same as enrich_descriptions.py)
|
||||
- Generic (JSON-LD → og:tags fallback)
|
||||
|
||||
Usage (background task — called by task_runner):
|
||||
from scripts.scrape_url import scrape_job_url
|
||||
scrape_job_url(db_path, job_id)
|
||||
"""
|
||||
import json
|
||||
import re
|
||||
import sqlite3
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
|
||||
from scripts.db import DEFAULT_DB, update_job_fields
|
||||
|
||||
_HEADERS = {
|
||||
"User-Agent": (
|
||||
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
|
||||
)
|
||||
}
|
||||
_TIMEOUT = 12
|
||||
|
||||
|
||||
def _detect_board(url: str) -> str:
|
||||
"""Return 'linkedin', 'indeed', 'glassdoor', or 'generic'."""
|
||||
url_lower = url.lower()
|
||||
if "linkedin.com" in url_lower:
|
||||
return "linkedin"
|
||||
if "indeed.com" in url_lower:
|
||||
return "indeed"
|
||||
if "glassdoor.com" in url_lower:
|
||||
return "glassdoor"
|
||||
return "generic"
|
||||
|
||||
|
||||
def _extract_linkedin_job_id(url: str) -> Optional[str]:
|
||||
"""Extract numeric job ID from a LinkedIn job URL."""
|
||||
m = re.search(r"/jobs/view/(\d+)", url)
|
||||
return m.group(1) if m else None
|
||||
|
||||
|
||||
def canonicalize_url(url: str) -> str:
|
||||
"""
|
||||
Strip tracking parameters from a job URL and return a clean canonical form.
|
||||
|
||||
LinkedIn: https://www.linkedin.com/jobs/view/<id>/?trk=... → https://www.linkedin.com/jobs/view/<id>/
|
||||
Indeed: strips utm_* and other tracking params
|
||||
Others: strips utm_source/utm_medium/utm_campaign/trk/refId/trackingId
|
||||
"""
|
||||
url = url.strip()
|
||||
if "linkedin.com" in url.lower():
|
||||
job_id = _extract_linkedin_job_id(url)
|
||||
if job_id:
|
||||
return f"https://www.linkedin.com/jobs/view/{job_id}/"
|
||||
# For other boards: strip common tracking params
|
||||
from urllib.parse import urlparse, urlencode, parse_qsl
|
||||
_STRIP_PARAMS = {
|
||||
"utm_source", "utm_medium", "utm_campaign", "utm_content", "utm_term",
|
||||
"trk", "trkEmail", "refId", "trackingId", "lipi", "midToken", "midSig",
|
||||
"eid", "otpToken", "ssid", "fmid",
|
||||
}
|
||||
parsed = urlparse(url)
|
||||
clean_qs = urlencode([(k, v) for k, v in parse_qsl(parsed.query) if k not in _STRIP_PARAMS])
|
||||
return parsed._replace(query=clean_qs).geturl()
|
||||
|
||||
|
||||
def _scrape_linkedin(url: str) -> dict:
|
||||
"""Fetch via LinkedIn guest jobs API (no auth required)."""
|
||||
job_id = _extract_linkedin_job_id(url)
|
||||
if not job_id:
|
||||
return {}
|
||||
api_url = f"https://www.linkedin.com/jobs-guest/jobs/api/jobPosting/{job_id}"
|
||||
resp = requests.get(api_url, headers=_HEADERS, timeout=_TIMEOUT)
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
|
||||
def _text(selector, **kwargs):
|
||||
tag = soup.find(selector, **kwargs)
|
||||
return tag.get_text(strip=True) if tag else ""
|
||||
|
||||
title = _text("h2", class_="top-card-layout__title")
|
||||
company = _text("a", class_="topcard__org-name-link") or _text("span", class_="topcard__org-name-link")
|
||||
location = _text("span", class_="topcard__flavor--bullet")
|
||||
desc_div = soup.find("div", class_="show-more-less-html__markup")
|
||||
description = desc_div.get_text(separator="\n", strip=True) if desc_div else ""
|
||||
|
||||
return {k: v for k, v in {
|
||||
"title": title,
|
||||
"company": company,
|
||||
"location": location,
|
||||
"description": description,
|
||||
"source": "linkedin",
|
||||
}.items() if v}
|
||||
|
||||
|
||||
def _scrape_indeed(url: str) -> dict:
|
||||
"""Scrape an Indeed job page."""
|
||||
resp = requests.get(url, headers=_HEADERS, timeout=_TIMEOUT)
|
||||
resp.raise_for_status()
|
||||
return _parse_json_ld_or_og(resp.text) or {}
|
||||
|
||||
|
||||
def _scrape_glassdoor(url: str) -> dict:
|
||||
"""Re-use JobSpy's Glassdoor scraper for description fetch."""
|
||||
m = re.search(r"jl=(\d+)", url)
|
||||
if not m:
|
||||
return {}
|
||||
try:
|
||||
from jobspy.glassdoor import Glassdoor
|
||||
from jobspy.glassdoor.constant import fallback_token, headers
|
||||
from jobspy.model import ScraperInput, Site
|
||||
from jobspy.util import create_session
|
||||
|
||||
scraper = Glassdoor()
|
||||
scraper.base_url = "https://www.glassdoor.com/"
|
||||
scraper.session = create_session(has_retry=True)
|
||||
token = scraper._get_csrf_token()
|
||||
headers["gd-csrf-token"] = token if token else fallback_token
|
||||
scraper.scraper_input = ScraperInput(site_type=[Site.GLASSDOOR])
|
||||
description = scraper._fetch_job_description(int(m.group(1)))
|
||||
return {"description": description} if description else {}
|
||||
except Exception:
|
||||
return {}
|
||||
|
||||
|
||||
def _parse_json_ld_or_og(html: str) -> dict:
|
||||
"""Extract job fields from JSON-LD structured data, then og: meta tags."""
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
|
||||
# Try JSON-LD first
|
||||
for script in soup.find_all("script", type="application/ld+json"):
|
||||
try:
|
||||
data = json.loads(script.string or "")
|
||||
if isinstance(data, list):
|
||||
data = next((d for d in data if d.get("@type") == "JobPosting"), {})
|
||||
if data.get("@type") == "JobPosting":
|
||||
org = data.get("hiringOrganization") or {}
|
||||
loc = (data.get("jobLocation") or {})
|
||||
if isinstance(loc, list):
|
||||
loc = loc[0] if loc else {}
|
||||
addr = loc.get("address") or {}
|
||||
location = (
|
||||
addr.get("addressLocality", "") or
|
||||
addr.get("addressRegion", "") or
|
||||
addr.get("addressCountry", "")
|
||||
)
|
||||
return {k: v for k, v in {
|
||||
"title": data.get("title", ""),
|
||||
"company": org.get("name", ""),
|
||||
"location": location,
|
||||
"description": data.get("description", ""),
|
||||
"salary": str(data.get("baseSalary", "")) if data.get("baseSalary") else "",
|
||||
}.items() if v}
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
# Fall back to og: meta tags
|
||||
def _meta(prop):
|
||||
tag = soup.find("meta", property=prop) or soup.find("meta", attrs={"name": prop})
|
||||
return (tag or {}).get("content", "") if tag else ""
|
||||
|
||||
title = _meta("og:title") or (soup.find("title") or {}).get_text(strip=True)
|
||||
description = _meta("og:description")
|
||||
return {k: v for k, v in {"title": title, "description": description}.items() if v}
|
||||
|
||||
|
||||
def _scrape_generic(url: str) -> dict:
|
||||
resp = requests.get(url, headers=_HEADERS, timeout=_TIMEOUT)
|
||||
resp.raise_for_status()
|
||||
return _parse_json_ld_or_og(resp.text) or {}
|
||||
|
||||
|
||||
def scrape_job_url(db_path: Path = DEFAULT_DB, job_id: int = None) -> dict:
|
||||
"""
|
||||
Fetch the job listing at the stored URL and update the job record.
|
||||
|
||||
Returns the dict of fields that were scraped (may be empty on failure).
|
||||
Does not raise — failures are logged and the job row is left as-is.
|
||||
"""
|
||||
if not job_id:
|
||||
return {}
|
||||
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn.row_factory = sqlite3.Row
|
||||
row = conn.execute("SELECT url FROM jobs WHERE id=?", (job_id,)).fetchone()
|
||||
conn.close()
|
||||
if not row:
|
||||
return {}
|
||||
|
||||
url = row["url"] or ""
|
||||
if not url.startswith("http"):
|
||||
return {}
|
||||
|
||||
board = _detect_board(url)
|
||||
try:
|
||||
if board == "linkedin":
|
||||
fields = _scrape_linkedin(url)
|
||||
elif board == "indeed":
|
||||
fields = _scrape_indeed(url)
|
||||
elif board == "glassdoor":
|
||||
fields = _scrape_glassdoor(url)
|
||||
else:
|
||||
fields = _scrape_generic(url)
|
||||
except requests.RequestException as exc:
|
||||
print(f"[scrape_url] HTTP error for job {job_id} ({url}): {exc}")
|
||||
return {}
|
||||
except Exception as exc:
|
||||
print(f"[scrape_url] Error scraping job {job_id} ({url}): {exc}")
|
||||
return {}
|
||||
|
||||
if fields:
|
||||
# Never overwrite the URL or source with empty values
|
||||
fields.pop("url", None)
|
||||
update_job_fields(db_path, job_id, fields)
|
||||
print(f"[scrape_url] job {job_id}: scraped '{fields.get('title', '?')}' @ {fields.get('company', '?')}")
|
||||
|
||||
return fields
|
||||
```
|
||||
|
||||
**Step 4: Add `scrape_url` task type to `scripts/task_runner.py`**
|
||||
|
||||
In `_run_task`, add a new `elif` branch after `enrich_descriptions` and before the final `else`:
|
||||
|
||||
```python
|
||||
elif task_type == "scrape_url":
|
||||
from scripts.scrape_url import scrape_job_url
|
||||
fields = scrape_job_url(db_path, job_id)
|
||||
title = fields.get("title") or job.get("url", "?")
|
||||
company = fields.get("company", "")
|
||||
msg = f"{title}" + (f" @ {company}" if company else "")
|
||||
update_task_status(db_path, task_id, "completed", error=msg)
|
||||
return
|
||||
```
|
||||
|
||||
**Step 5: Run all tests**
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_scrape_url.py -v
|
||||
```
|
||||
Expected: all PASS
|
||||
|
||||
**Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/scrape_url.py scripts/task_runner.py tests/test_scrape_url.py
|
||||
git commit -m "feat: add scrape_url background task for URL-based job import"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 3: LinkedIn Job Alert email parser
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/imap_sync.py`
|
||||
- Test: `tests/test_imap_sync.py`
|
||||
|
||||
**Step 1: Write the failing tests**
|
||||
|
||||
Add to `tests/test_imap_sync.py`:
|
||||
|
||||
```python
|
||||
def test_parse_linkedin_alert_extracts_jobs():
|
||||
from scripts.imap_sync import parse_linkedin_alert
|
||||
body = """\
|
||||
Your job alert for customer success manager in United States
|
||||
New jobs match your preferences.
|
||||
Manage alerts: https://www.linkedin.com/comm/jobs/alerts?...
|
||||
|
||||
Customer Success Manager
|
||||
Reflow
|
||||
California, United States
|
||||
View job: https://www.linkedin.com/comm/jobs/view/4376518925/?trackingId=abc%3D%3D&refId=xyz
|
||||
|
||||
---------------------------------------------------------
|
||||
|
||||
Customer Engagement Manager
|
||||
Bitwarden
|
||||
United States
|
||||
|
||||
2 school alumni
|
||||
Apply with resume & profile
|
||||
View job: https://www.linkedin.com/comm/jobs/view/4359824983/?trackingId=def%3D%3D
|
||||
|
||||
---------------------------------------------------------
|
||||
|
||||
"""
|
||||
jobs = parse_linkedin_alert(body)
|
||||
assert len(jobs) == 2
|
||||
assert jobs[0]["title"] == "Customer Success Manager"
|
||||
assert jobs[0]["company"] == "Reflow"
|
||||
assert jobs[0]["location"] == "California, United States"
|
||||
assert jobs[0]["url"] == "https://www.linkedin.com/jobs/view/4376518925/"
|
||||
assert jobs[1]["title"] == "Customer Engagement Manager"
|
||||
assert jobs[1]["company"] == "Bitwarden"
|
||||
assert jobs[1]["url"] == "https://www.linkedin.com/jobs/view/4359824983/"
|
||||
|
||||
|
||||
def test_parse_linkedin_alert_skips_blocks_without_view_job():
|
||||
from scripts.imap_sync import parse_linkedin_alert
|
||||
body = """\
|
||||
Customer Success Manager
|
||||
Some Company
|
||||
United States
|
||||
|
||||
---------------------------------------------------------
|
||||
|
||||
Valid Job Title
|
||||
Valid Company
|
||||
Remote
|
||||
View job: https://www.linkedin.com/comm/jobs/view/1111111/?x=y
|
||||
|
||||
---------------------------------------------------------
|
||||
"""
|
||||
jobs = parse_linkedin_alert(body)
|
||||
assert len(jobs) == 1
|
||||
assert jobs[0]["title"] == "Valid Job Title"
|
||||
|
||||
|
||||
def test_parse_linkedin_alert_empty_body():
|
||||
from scripts.imap_sync import parse_linkedin_alert
|
||||
assert parse_linkedin_alert("") == []
|
||||
assert parse_linkedin_alert("No jobs here.") == []
|
||||
```
|
||||
|
||||
**Step 2: Run tests to verify they fail**
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_imap_sync.py::test_parse_linkedin_alert_extracts_jobs tests/test_imap_sync.py::test_parse_linkedin_alert_skips_blocks_without_view_job tests/test_imap_sync.py::test_parse_linkedin_alert_empty_body -v
|
||||
```
|
||||
Expected: FAIL — `ImportError: cannot import name 'parse_linkedin_alert'`
|
||||
|
||||
**Step 3: Implement `parse_linkedin_alert` in `scripts/imap_sync.py`**
|
||||
|
||||
Add after the existing `_has_todo_keyword` function (around line 391):
|
||||
|
||||
```python
|
||||
_LINKEDIN_ALERT_SENDER = "jobalerts-noreply@linkedin.com"
|
||||
|
||||
# Social-proof / nav lines to skip when parsing alert blocks
|
||||
_ALERT_SKIP_PHRASES = {
|
||||
"alumni", "apply with", "actively hiring", "manage alerts",
|
||||
"view all jobs", "your job alert", "new jobs match",
|
||||
"unsubscribe", "linkedin corporation",
|
||||
}
|
||||
|
||||
|
||||
def parse_linkedin_alert(body: str) -> list[dict]:
|
||||
"""
|
||||
Parse the plain-text body of a LinkedIn Job Alert digest email.
|
||||
|
||||
Returns a list of dicts: {title, company, location, url}.
|
||||
URL is canonicalized to https://www.linkedin.com/jobs/view/<id>/
|
||||
(tracking parameters stripped).
|
||||
"""
|
||||
jobs = []
|
||||
# Split on separator lines (10+ dashes)
|
||||
blocks = re.split(r"\n\s*-{10,}\s*\n", body)
|
||||
for block in blocks:
|
||||
lines = [ln.strip() for ln in block.strip().splitlines() if ln.strip()]
|
||||
|
||||
# Find "View job:" URL
|
||||
url = None
|
||||
for line in lines:
|
||||
m = re.search(r"View job:\s*(https?://\S+)", line, re.IGNORECASE)
|
||||
if m:
|
||||
raw_url = m.group(1)
|
||||
job_id_m = re.search(r"/jobs/view/(\d+)", raw_url)
|
||||
if job_id_m:
|
||||
url = f"https://www.linkedin.com/jobs/view/{job_id_m.group(1)}/"
|
||||
break
|
||||
if not url:
|
||||
continue
|
||||
|
||||
# Filter noise lines
|
||||
content = [
|
||||
ln for ln in lines
|
||||
if not any(p in ln.lower() for p in _ALERT_SKIP_PHRASES)
|
||||
and not ln.lower().startswith("view job:")
|
||||
and not ln.startswith("http")
|
||||
]
|
||||
if len(content) < 2:
|
||||
continue
|
||||
|
||||
jobs.append({
|
||||
"title": content[0],
|
||||
"company": content[1],
|
||||
"location": content[2] if len(content) > 2 else "",
|
||||
"url": url,
|
||||
})
|
||||
return jobs
|
||||
```
|
||||
|
||||
**Step 4: Wire the parser into `_scan_unmatched_leads`**
|
||||
|
||||
In `_scan_unmatched_leads`, inside the `for uid in all_uids:` loop, add a detection block immediately after the `if mid in known_message_ids: continue` check (before the existing `_has_recruitment_keyword` check):
|
||||
|
||||
```python
|
||||
# ── LinkedIn Job Alert digest — parse each card individually ──────
|
||||
if _LINKEDIN_ALERT_SENDER in parsed["from_addr"].lower():
|
||||
cards = parse_linkedin_alert(parsed["body"])
|
||||
for card in cards:
|
||||
if card["url"] in existing_urls:
|
||||
continue
|
||||
job_id = insert_job(db_path, {
|
||||
"title": card["title"],
|
||||
"company": card["company"],
|
||||
"url": card["url"],
|
||||
"source": "linkedin",
|
||||
"location": card["location"],
|
||||
"is_remote": 0,
|
||||
"salary": "",
|
||||
"description": "",
|
||||
"date_found": datetime.now().isoformat()[:10],
|
||||
})
|
||||
if job_id:
|
||||
from scripts.task_runner import submit_task
|
||||
submit_task(db_path, "scrape_url", job_id)
|
||||
existing_urls.add(card["url"])
|
||||
new_leads += 1
|
||||
print(f"[imap] LinkedIn alert → {card['company']} — {card['title']}")
|
||||
known_message_ids.add(mid)
|
||||
continue # skip normal LLM extraction path
|
||||
```
|
||||
|
||||
**Step 5: Run all imap_sync tests**
|
||||
|
||||
```bash
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_imap_sync.py -v
|
||||
```
|
||||
Expected: all PASS (including the 3 new tests)
|
||||
|
||||
**Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/imap_sync.py tests/test_imap_sync.py
|
||||
git commit -m "feat: auto-parse LinkedIn Job Alert digest emails into pending jobs"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Home page — Add Job(s) by URL
|
||||
|
||||
**Files:**
|
||||
- Modify: `app/Home.py`
|
||||
|
||||
No unit tests — this is pure Streamlit UI. Verify manually by pasting a URL and checking the DB.
|
||||
|
||||
**Step 1: Add `_queue_url_imports` helper and the new section to `app/Home.py`**
|
||||
|
||||
Add to the imports at the top (after the existing `from scripts.db import ...` line):
|
||||
|
||||
```python
|
||||
from scripts.db import DEFAULT_DB, init_db, get_job_counts, purge_jobs, purge_email_data, \
|
||||
kill_stuck_tasks, get_task_for_job, get_active_tasks, insert_job, get_existing_urls
|
||||
```
|
||||
|
||||
Add this helper function before the Streamlit layout code (after the `init_db` call at the top):
|
||||
|
||||
```python
|
||||
def _queue_url_imports(db_path: Path, urls: list[str]) -> int:
|
||||
"""Insert each URL as a pending manual job and queue a scrape_url task.
|
||||
Returns count of newly queued jobs."""
|
||||
from datetime import datetime
|
||||
from scripts.scrape_url import canonicalize_url
|
||||
existing = get_existing_urls(db_path)
|
||||
queued = 0
|
||||
for url in urls:
|
||||
url = canonicalize_url(url.strip())
|
||||
if not url.startswith("http"):
|
||||
continue
|
||||
if url in existing:
|
||||
continue
|
||||
job_id = insert_job(db_path, {
|
||||
"title": "Importing…",
|
||||
"company": "",
|
||||
"url": url,
|
||||
"source": "manual",
|
||||
"location": "",
|
||||
"description": "",
|
||||
"date_found": datetime.now().isoformat()[:10],
|
||||
})
|
||||
if job_id:
|
||||
submit_task(db_path, "scrape_url", job_id)
|
||||
queued += 1
|
||||
return queued
|
||||
```
|
||||
|
||||
Add a new section between the Email Sync divider and the Danger Zone expander. Replace:
|
||||
|
||||
```python
|
||||
st.divider()
|
||||
|
||||
# ── Danger zone: purge + re-scrape ────────────────────────────────────────────
|
||||
```
|
||||
|
||||
with:
|
||||
|
||||
```python
|
||||
st.divider()
|
||||
|
||||
# ── Add Jobs by URL ───────────────────────────────────────────────────────────
|
||||
add_left, add_right = st.columns([3, 1])
|
||||
with add_left:
|
||||
st.subheader("Add Jobs by URL")
|
||||
st.caption("Paste job listing URLs to import and scrape in the background. "
|
||||
"Supports LinkedIn, Indeed, Glassdoor, and most job boards.")
|
||||
|
||||
url_tab, csv_tab = st.tabs(["Paste URLs", "Upload CSV"])
|
||||
|
||||
with url_tab:
|
||||
url_text = st.text_area(
|
||||
"urls",
|
||||
placeholder="https://www.linkedin.com/jobs/view/1234567/\nhttps://www.indeed.com/viewjob?jk=abc",
|
||||
height=100,
|
||||
label_visibility="collapsed",
|
||||
)
|
||||
if st.button("📥 Add Jobs", key="add_urls_btn", use_container_width=True,
|
||||
disabled=not (url_text or "").strip()):
|
||||
_urls = [u.strip() for u in url_text.strip().splitlines() if u.strip().startswith("http")]
|
||||
if _urls:
|
||||
_n = _queue_url_imports(DEFAULT_DB, _urls)
|
||||
if _n:
|
||||
st.success(f"Queued {_n} job{'s' if _n != 1 else ''} for import. Check Job Review shortly.")
|
||||
else:
|
||||
st.info("All URLs already in the database.")
|
||||
st.rerun()
|
||||
|
||||
with csv_tab:
|
||||
csv_file = st.file_uploader("CSV with a URL column", type=["csv"],
|
||||
label_visibility="collapsed")
|
||||
if csv_file:
|
||||
import csv as _csv
|
||||
import io as _io
|
||||
reader = _csv.DictReader(_io.StringIO(csv_file.read().decode("utf-8", errors="replace")))
|
||||
_csv_urls = []
|
||||
for row in reader:
|
||||
for val in row.values():
|
||||
if val and val.strip().startswith("http"):
|
||||
_csv_urls.append(val.strip())
|
||||
break
|
||||
if _csv_urls:
|
||||
st.caption(f"Found {len(_csv_urls)} URL(s) in CSV.")
|
||||
if st.button("📥 Import CSV Jobs", key="add_csv_btn", use_container_width=True):
|
||||
_n = _queue_url_imports(DEFAULT_DB, _csv_urls)
|
||||
st.success(f"Queued {_n} job{'s' if _n != 1 else ''} for import.")
|
||||
st.rerun()
|
||||
else:
|
||||
st.warning("No URLs found — CSV must have a column whose values start with http.")
|
||||
|
||||
# Active scrape_url tasks status
|
||||
@st.fragment(run_every=3)
|
||||
def _scrape_status():
|
||||
import sqlite3 as _sq
|
||||
conn = _sq.connect(DEFAULT_DB)
|
||||
conn.row_factory = _sq.Row
|
||||
rows = conn.execute(
|
||||
"""SELECT bt.status, bt.error, j.title, j.company, j.url
|
||||
FROM background_tasks bt
|
||||
JOIN jobs j ON j.id = bt.job_id
|
||||
WHERE bt.task_type = 'scrape_url'
|
||||
AND bt.updated_at >= datetime('now', '-5 minutes')
|
||||
ORDER BY bt.updated_at DESC LIMIT 20"""
|
||||
).fetchall()
|
||||
conn.close()
|
||||
if not rows:
|
||||
return
|
||||
st.caption("Recent URL imports:")
|
||||
for r in rows:
|
||||
if r["status"] == "running":
|
||||
st.info(f"⏳ Scraping {r['url']}")
|
||||
elif r["status"] == "completed":
|
||||
label = f"{r['title']}" + (f" @ {r['company']}" if r['company'] else "")
|
||||
st.success(f"✅ {label}")
|
||||
elif r["status"] == "failed":
|
||||
st.error(f"❌ {r['url']} — {r['error'] or 'scrape failed'}")
|
||||
|
||||
_scrape_status()
|
||||
|
||||
st.divider()
|
||||
|
||||
# ── Danger zone: purge + re-scrape ────────────────────────────────────────────
|
||||
```
|
||||
|
||||
**Step 2: Check `background_tasks` schema has an `updated_at` column**
|
||||
|
||||
The status fragment queries `bt.updated_at`. Verify it exists:
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python -c "
|
||||
import sqlite3
|
||||
from scripts.db import DEFAULT_DB, init_db
|
||||
init_db(DEFAULT_DB)
|
||||
conn = sqlite3.connect(DEFAULT_DB)
|
||||
print(conn.execute('PRAGMA table_info(background_tasks)').fetchall())
|
||||
"
|
||||
```
|
||||
|
||||
If `updated_at` is missing, add a migration in `scripts/db.py`'s `_migrate_db` function:
|
||||
|
||||
```python
|
||||
try:
|
||||
conn.execute("ALTER TABLE background_tasks ADD COLUMN updated_at TEXT DEFAULT (datetime('now'))")
|
||||
except sqlite3.OperationalError:
|
||||
pass
|
||||
```
|
||||
|
||||
And update `update_task_status` in `db.py` to set `updated_at = datetime('now')` on every status change:
|
||||
|
||||
```python
|
||||
def update_task_status(db_path, task_id, status, error=None):
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn.execute(
|
||||
"UPDATE background_tasks SET status=?, error=?, updated_at=datetime('now') WHERE id=?",
|
||||
(status, error, task_id),
|
||||
)
|
||||
conn.commit()
|
||||
conn.close()
|
||||
```
|
||||
|
||||
**Step 3: Restart the UI and manually verify**
|
||||
|
||||
```bash
|
||||
bash /devl/job-seeker/scripts/manage-ui.sh restart
|
||||
```
|
||||
|
||||
Test:
|
||||
1. Paste `https://www.linkedin.com/jobs/view/4376518925/` into the text area
|
||||
2. Click "📥 Add Jobs" — should show "Queued 1 job for import"
|
||||
3. Go to Job Review → should see a pending job (Reflow - Customer Success Manager once scraped)
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add app/Home.py
|
||||
git commit -m "feat: add 'Add Jobs by URL' section to Home page with background scraping"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Final: push to remote
|
||||
|
||||
```bash
|
||||
git push origin main
|
||||
```
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -1,474 +0,0 @@
|
|||
# Job Seeker Platform — Monetization Business Plan
|
||||
|
||||
**Date:** 2026-02-24
|
||||
**Status:** Draft — pre-VC pitch
|
||||
**Author:** Brainstorming session
|
||||
|
||||
---
|
||||
|
||||
## 1. Product Overview
|
||||
|
||||
An automated job discovery, resume matching, and application pipeline platform. Built originally as a personal tool for a single job seeker; architecture is already generalized — user identity, preferences, and data are fully parameterized via onboarding, not hardcoded.
|
||||
|
||||
### Core pipeline
|
||||
```
|
||||
Job Discovery (multi-board) → Resume Matching → Job Review UI
|
||||
→ Apply Workspace (cover letter + PDF)
|
||||
→ Interviews Kanban (phone_screen → offer → hired)
|
||||
→ Notion Sync
|
||||
```
|
||||
|
||||
### Key feature surface
|
||||
- Multi-board job discovery (LinkedIn, Indeed, Glassdoor, ZipRecruiter, Google, Adzuna, The Ladders)
|
||||
- LinkedIn Alert email ingestion + email classifier (interview requests, rejections, surveys)
|
||||
- Resume keyword matching + match scoring
|
||||
- AI cover letter generation (local model, shared hosted model, or cloud LLM)
|
||||
- Company research briefs (web scrape + LLM synthesis)
|
||||
- Interview prep + practice Q&A
|
||||
- Culture-fit survey assistant with vision/screenshot support
|
||||
- Application pipeline kanban with stage tracking
|
||||
- Notion sync for external tracking
|
||||
- Mission alignment + accessibility preferences (personal decision-making only)
|
||||
- Per-user fine-tuned cover letter model (trained on user's own writing corpus)
|
||||
|
||||
---
|
||||
|
||||
## 2. Target Market
|
||||
|
||||
### Primary: Individual job seekers (B2C)
|
||||
- Actively searching, technically comfortable, value privacy
|
||||
- Frustrated by manual tracking (spreadsheets, Notion boards)
|
||||
- Want AI-assisted applications without giving their data to a third party
|
||||
- Typical job search duration: 3–6 months → average subscription length ~4.5 months
|
||||
|
||||
### Secondary: Career coaches (B2B, seat-based)
|
||||
- Manage 10–20 active clients simultaneously
|
||||
- High willingness to pay for tools that make their service more efficient
|
||||
- **20× revenue multiplier** vs. solo users (base + per-seat pricing)
|
||||
|
||||
### Tertiary: Outplacement firms / staffing agencies (B2B enterprise)
|
||||
- Future expansion; validates product-market fit at coach tier first
|
||||
|
||||
---
|
||||
|
||||
## 3. Distribution Model
|
||||
|
||||
### Starting point: Local-first (self-hosted)
|
||||
|
||||
Users run the application on their own machine via Docker Compose or a native installer. All job data, resume data, and preferences stay local. AI features are optional and configurable — users can use their own LLM backends or subscribe for hosted AI.
|
||||
|
||||
**Why local-first:**
|
||||
- Zero infrastructure cost per free user
|
||||
- Strong privacy story (no job search data on your servers)
|
||||
- Reversible — easy to add a hosted SaaS path later without a rewrite
|
||||
- Aligns with the open core licensing model
|
||||
|
||||
### Future path: Cloud Edition (SaaS)
|
||||
|
||||
Same codebase deployed as a hosted service. Users sign up at a URL, no install required. Unlocked when revenue and user feedback validate the market.
|
||||
|
||||
**Architecture readiness:** The config layer, per-user data isolation, and SQLite-per-user design already support multi-tenancy with minimal refactoring. SaaS is a deployment mode, not a rewrite.
|
||||
|
||||
---
|
||||
|
||||
## 4. Licensing Strategy
|
||||
|
||||
### Open Core
|
||||
|
||||
| Component | License | Rationale |
|
||||
|---|---|---|
|
||||
| Job discovery pipeline | MIT | Community maintains scrapers (boards break constantly) |
|
||||
| SQLite schema + `db.py` | MIT | Interoperability, trust |
|
||||
| Application pipeline state machine | MIT | Core value is visible, auditable |
|
||||
| Streamlit UI shell | MIT | Community contributions, forks welcome |
|
||||
| AI cover letter generation | BSL 1.1 | Proprietary prompt engineering + model routing |
|
||||
| Company research synthesis | BSL 1.1 | LLM orchestration is the moat |
|
||||
| Interview prep + practice Q&A | BSL 1.1 | Premium feature |
|
||||
| Survey assistant (vision) | BSL 1.1 | Premium feature |
|
||||
| Email classifier | BSL 1.1 | Premium feature |
|
||||
| Notion sync | BSL 1.1 | Integration layer |
|
||||
| Team / multi-user features | Proprietary | Future enterprise feature |
|
||||
| Analytics dashboard | Proprietary | Future feature |
|
||||
| Fine-tuned model weights | Proprietary | Per-user, not redistributable |
|
||||
|
||||
**Business Source License (BSL 1.1):** Code is visible and auditable on GitHub. Free for personal, non-commercial self-hosting. Commercial use or SaaS re-hosting requires a paid license. Converts to MIT after 4 years. Used by HashiCorp (Vault, Terraform), MariaDB, and others — well understood by the VC community.
|
||||
|
||||
**Why this works here:** The value is not in the code. A competitor could clone the repo and still not have: the fine-tuned model, the user's corpus, the orchestration prompts, or the UX polish. The moat is the system, not any individual file.
|
||||
|
||||
---
|
||||
|
||||
## 5. Tier Structure
|
||||
|
||||
### Free — $0/mo
|
||||
Self-hosted, local-only. Genuinely useful as a privacy-respecting job tracker.
|
||||
|
||||
| Feature | Included |
|
||||
|---|---|
|
||||
| Multi-board job discovery | ✓ |
|
||||
| Custom board scrapers (Adzuna, The Ladders) | ✓ |
|
||||
| LinkedIn Alert email ingestion | ✓ |
|
||||
| Add jobs by URL | ✓ |
|
||||
| Resume keyword matching | ✓ |
|
||||
| Cover letter generation (local Ollama only) | ✓ |
|
||||
| Application pipeline kanban | ✓ |
|
||||
| Mission alignment + accessibility preferences | ✓ |
|
||||
| Search profiles | 1 |
|
||||
| AI backend | User's local Ollama |
|
||||
| Support | Community (GitHub Discussions) |
|
||||
|
||||
**Purpose:** Acquisition engine. GitHub stars = distribution. Users who get a job on free tier refer friends.
|
||||
|
||||
---
|
||||
|
||||
### Paid — $12/mo
|
||||
For job seekers who want quality AI output without GPU setup or API key management.
|
||||
|
||||
Includes everything in Free, plus:
|
||||
|
||||
| Feature | Included |
|
||||
|---|---|
|
||||
| Shared hosted fine-tuned cover letter model | ✓ |
|
||||
| Claude API (BYOK — bring your own key) | ✓ |
|
||||
| Company research briefs | ✓ |
|
||||
| Interview prep + practice Q&A | ✓ |
|
||||
| Survey assistant (vision/screenshot) | ✓ |
|
||||
| Search criteria LLM suggestions | ✓ |
|
||||
| Email classifier | ✓ |
|
||||
| Notion sync | ✓ |
|
||||
| Search profiles | 5 |
|
||||
| Support | Email |
|
||||
|
||||
**Purpose:** Primary revenue tier. High margin, low support burden. Targets the individual job seeker who wants "it just works."
|
||||
|
||||
---
|
||||
|
||||
### Premium — $29/mo
|
||||
For power users and career coaches who want best-in-class output and personal model training.
|
||||
|
||||
Includes everything in Paid, plus:
|
||||
|
||||
| Feature | Included |
|
||||
|---|---|
|
||||
| Claude Sonnet (your hosted key, 150 ops/mo included) | ✓ |
|
||||
| Per-user fine-tuned model (trained on their corpus) | ✓ (one-time onboarding) |
|
||||
| Corpus re-training | ✓ (quarterly) |
|
||||
| Search profiles | Unlimited |
|
||||
| Multi-user / coach mode | ✓ (+$15/seat) |
|
||||
| Shared job pool across seats | ✓ |
|
||||
| Priority support + onboarding call | ✓ |
|
||||
|
||||
**Purpose:** Highest LTV tier. Coach accounts at 3+ seats generate $59–$239/mo each. Fine-tuned personal model is a high-perceived-value differentiator that costs ~$0.50 to produce.
|
||||
|
||||
---
|
||||
|
||||
## 6. AI Inference — Claude API Cost Model
|
||||
|
||||
Pricing basis: Haiku 4.5 = $0.80/MTok in · $4/MTok out | Sonnet 4.6 = $3/MTok in · $15/MTok out
|
||||
|
||||
### Per-operation costs
|
||||
|
||||
| Operation | Tokens In | Tokens Out | Haiku | Sonnet |
|
||||
|---|---|---|---|---|
|
||||
| Cover letter generation | ~2,400 | ~400 | $0.0035 | $0.013 |
|
||||
| Company research brief | ~3,000 | ~800 | $0.0056 | $0.021 |
|
||||
| Survey Q&A (5 questions) | ~3,000 | ~1,500 | $0.0084 | $0.031 |
|
||||
| Job description enrichment | ~800 | ~300 | $0.0018 | $0.007 |
|
||||
| Search criteria suggestion | ~400 | ~200 | $0.0010 | $0.004 |
|
||||
|
||||
### Monthly inference cost per active user
|
||||
Assumptions: 12 cover letters, 3 research briefs, 2 surveys, 40 enrichments, 2 search suggestions
|
||||
|
||||
| Backend mix | Cost/user/mo |
|
||||
|---|---|
|
||||
| Haiku only (paid tier) | ~$0.15 |
|
||||
| Sonnet only | ~$0.57 |
|
||||
| Mixed: Sonnet for CL + research, Haiku for rest (premium tier) | ~$0.31 |
|
||||
|
||||
### Per-user fine-tuning cost (premium, one-time)
|
||||
| Provider | Cost |
|
||||
|---|---|
|
||||
| User's local GPU | $0 |
|
||||
| RunPod A100 (~20 min) | $0.25–$0.40 |
|
||||
| Together AI / Replicate | $0.50–$0.75 |
|
||||
| Quarterly re-train | Same as above |
|
||||
|
||||
**Amortized over 12 months:** ~$0.04–$0.06/user/mo
|
||||
|
||||
---
|
||||
|
||||
## 7. Full Infrastructure Cost Model
|
||||
|
||||
Local-first architecture means most compute runs on the user's machine. Your infra is limited to: AI inference API calls, shared model serving, fine-tune jobs, license/auth server, and storage for model artifacts.
|
||||
|
||||
### Monthly infrastructure at 100K users
|
||||
(4% paid conversion = 4,000 paid; 20% of paid premium = 800 premium)
|
||||
|
||||
| Cost center | Detail | Monthly cost |
|
||||
|---|---|---|
|
||||
| Claude API inference (paid tier, Haiku) | 4,000 users × $0.15 | $600 |
|
||||
| Claude API inference (premium tier, mixed) | 800 users × $0.31 | $248 |
|
||||
| Shared model serving (Together AI, 3B model) | 48,000 requests/mo | $27 |
|
||||
| Per-user fine-tune jobs | 800 users / 12mo × $0.50 | $33 |
|
||||
| App hosting (license server, auth API, DB) | VPS + PostgreSQL | $200 |
|
||||
| Model artifact storage (800 × 1.5GB on S3) | 1.2TB | $28 |
|
||||
| **Total** | | **$1,136/mo** |
|
||||
|
||||
---
|
||||
|
||||
## 8. Revenue Model & Unit Economics
|
||||
|
||||
### Monthly revenue at scale
|
||||
|
||||
| Total users | Paid (4%) | Premium (20% of paid) | Revenue/mo | Infra/mo | **Gross margin** |
|
||||
|---|---|---|---|---|---|
|
||||
| 10,000 | 400 | 80 | $7,120 | $196 | **97.2%** |
|
||||
| 100,000 | 4,000 | 800 | $88,250 | $1,136 | **98.7%** |
|
||||
|
||||
### Blended ARPU
|
||||
- Across all users (including free): **~$0.71/user/mo**
|
||||
- Across paying users only: **~$17.30/user/mo**
|
||||
- Coach account (3 seats avg): **~$74/mo**
|
||||
|
||||
### LTV per user segment
|
||||
- Paid individual (4.5mo avg job search): **~$54**
|
||||
- Premium individual (4.5mo avg): **~$130**
|
||||
- Coach account (ongoing, low churn): **$74/mo × 18mo estimated = ~$1,330**
|
||||
- **Note:** Success churn is real — users leave when they get a job. Re-subscription rate on next job search partially offsets this.
|
||||
|
||||
### ARR projections
|
||||
|
||||
| Scale | ARR |
|
||||
|---|---|
|
||||
| 10K users | **~$85K** |
|
||||
| 100K users | **~$1.06M** |
|
||||
| 1M users | **~$10.6M** |
|
||||
|
||||
To reach $10M ARR: ~1M total users **or** meaningful coach/enterprise penetration at lower user counts.
|
||||
|
||||
---
|
||||
|
||||
## 9. VC Pitch Angles
|
||||
|
||||
### The thesis
|
||||
> "GitHub is our distribution channel. Local-first is our privacy moat. Coaches are our revenue engine."
|
||||
|
||||
### Key metrics to hit before Series A
|
||||
- 10K GitHub stars (validates distribution thesis)
|
||||
- 500 paying users (validates willingness to pay)
|
||||
- 20 coach accounts (validates B2B multiplier)
|
||||
- 97%+ gross margin (already proven in model)
|
||||
|
||||
### Competitive differentiation
|
||||
1. **Privacy-first** — job search data never leaves your machine on free/paid tiers
|
||||
2. **Fine-tuned personal model** — no other tool trains a cover letter model on your specific writing voice
|
||||
3. **Full pipeline** — discovery through hired, not just one step (most competitors are point solutions)
|
||||
4. **Open core** — community maintains job board scrapers, which break constantly; competitors pay engineers for this
|
||||
5. **LLM-agnostic** — works with Ollama, Claude, GPT, vLLM; users aren't locked to one provider
|
||||
|
||||
### Risks to address
|
||||
- **Success churn** — mitigated by re-subscription on next job search, coach accounts (persistent), and potential pivot to ongoing career management
|
||||
- **Job board scraping fragility** — mitigated by open core (community patches), multiple board sources, email ingestion fallback
|
||||
- **LLM cost spikes** — mitigated by Haiku-first routing, local model fallback, user BYOK option
|
||||
- **Copying by incumbents** — LinkedIn, Indeed have distribution but not privacy story; fine-tuned personal model is hard to replicate at their scale
|
||||
|
||||
---
|
||||
|
||||
## 10. Roadmap
|
||||
|
||||
### Phase 1 — Local-first launch (now)
|
||||
- Docker Compose installer + setup wizard
|
||||
- License key server (simple, hosted)
|
||||
- Paid tier: shared model endpoint + Notion sync + email classifier
|
||||
- Premium tier: fine-tune pipeline + Claude API routing
|
||||
- Open core GitHub repo (MIT core, BSL premium)
|
||||
|
||||
### Phase 2 — Coach tier validation (3–6 months post-launch)
|
||||
- Multi-user mode with seat management
|
||||
- Coach dashboard: shared job pool, per-candidate pipeline view
|
||||
- Billing portal (Stripe)
|
||||
- Outplacement firm pilot
|
||||
|
||||
### Phase 3 — Cloud Edition (6–12 months, revenue-funded or post-seed)
|
||||
- Hosted SaaS version at a URL (no install)
|
||||
- Same codebase, cloud deployment mode
|
||||
- Converts local-first users who want convenience
|
||||
- Enables mobile access
|
||||
|
||||
### Phase 4 — Enterprise (post-Series A)
|
||||
- SSO / SAML
|
||||
- Admin dashboard + analytics
|
||||
- API for ATS integrations
|
||||
- Custom fine-tune models for outplacement firm's brand voice
|
||||
|
||||
---
|
||||
|
||||
## 11. Competitive Landscape
|
||||
|
||||
### Direct competitors
|
||||
|
||||
| Product | Price | Pipeline | AI CL | Privacy | Fine-tune | Open Source |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **Job Seeker Platform** | Free–$29 | Full (discovery→hired) | Personal fine-tune | Local-first | Per-user | Core (MIT) |
|
||||
| Teal | Free/$29 | Partial (tracker + resume) | Generic AI | Cloud | No | No |
|
||||
| Jobscan | $49.95 | Resume scan only | No | Cloud | No | No |
|
||||
| Huntr | Free/$30 | Tracker only | No | Cloud | No | No |
|
||||
| Rezi | $29 | Resume/CL only | Generic AI | Cloud | No | No |
|
||||
| Kickresume | $19 | Resume/CL only | Generic AI | Cloud | No | No |
|
||||
| LinkedIn Premium | $40 | Job search only | No | Cloud (them) | No | No |
|
||||
| AIHawk | Free | LinkedIn Easy Apply | No | Local | No | Yes (MIT) |
|
||||
| Simplify | Free | Auto-fill only | No | Extension | No | No |
|
||||
|
||||
### Competitive analysis
|
||||
|
||||
**Teal** ($29/mo) is the closest feature competitor — job tracker + resume builder + AI cover letters. Key gaps: cloud-only (privacy risk), no discovery automation, generic AI (not fine-tuned to your voice), no interview prep, no email classifier. Their paid tier costs the same as our premium and delivers substantially less.
|
||||
|
||||
**Jobscan** ($49.95/mo) is the premium ATS-optimization tool. Single-purpose, no pipeline, no cover letters. Overpriced for what it does. Users often use it alongside a tracker — this platform replaces both.
|
||||
|
||||
**AIHawk** (open source) automates LinkedIn Easy Apply but has no pipeline, no AI beyond form filling, no cover letter gen, no tracking. It's a macro, not a platform. We already integrate with it as a downstream action. We're complementary, not competitive at the free tier.
|
||||
|
||||
**LinkedIn Premium** ($40/mo) has distribution but actively works against user privacy and owns the candidate relationship. Users are the product. Our privacy story is a direct counter-positioning.
|
||||
|
||||
### The whitespace
|
||||
|
||||
No competitor offers all three of: **full pipeline automation + privacy-first local storage + personalized fine-tuned AI**. Every existing tool is either a point solution (just resume, just tracker, just auto-apply) or cloud-based SaaS that monetizes user data. The combination is the moat.
|
||||
|
||||
### Indirect competition
|
||||
|
||||
- **Spreadsheets + Notion templates** — free, flexible, no AI. The baseline we replace for free users.
|
||||
- **Recruiting agencies** — human-assisted job search; we're a complement, not a replacement.
|
||||
- **Career coaches** — we sell *to* them, not against them.
|
||||
|
||||
---
|
||||
|
||||
## 12. Go-to-Market Strategy
|
||||
|
||||
### Phase 1: Developer + privacy community launch
|
||||
|
||||
**Channel:** GitHub → Hacker News → Reddit
|
||||
|
||||
The open core model makes GitHub the primary distribution channel. A compelling README, one-command Docker install, and a working free tier are the launch. Target communities:
|
||||
|
||||
- Hacker News "Show HN" — privacy-first self-hosted tools get strong traction
|
||||
- r/cscareerquestions (1.2M members) — active job seekers, technically literate
|
||||
- r/selfhosted (2.8M members) — prime audience for local-first tools
|
||||
- r/ExperiencedDevs, r/remotework — secondary seeding
|
||||
|
||||
**Goal:** 1,000 GitHub stars and 100 free installs in first 30 days.
|
||||
|
||||
**Content hook:** "I built a private job search AI that runs entirely on your machine — no data leaves your computer." Privacy angle resonates deeply post-2024 data breach fatigue.
|
||||
|
||||
### Phase 2: Career coaching channel
|
||||
|
||||
**Channel:** LinkedIn → direct outreach → coach partnerships
|
||||
|
||||
Career coaches are the highest-LTV customer and the most efficient channel to reach many job seekers at once. One coach onboarded = 10–20 active users.
|
||||
|
||||
Tactics:
|
||||
- Identify coaches on LinkedIn who post about job search tools
|
||||
- Offer white-glove onboarding + 60-day free trial of coach seats
|
||||
- Co-create content: "How I run 15 client job searches simultaneously"
|
||||
- Referral program: coach gets 1 free seat per paid client referral
|
||||
|
||||
**Goal:** 20 coach accounts within 90 days of paid tier launch.
|
||||
|
||||
### Phase 3: Content + SEO (SaaS phase)
|
||||
|
||||
Once the hosted Cloud Edition exists, invest in organic content:
|
||||
|
||||
- "Best job tracker apps 2027" (comparison content — we win on privacy + AI)
|
||||
- "How to write a cover letter that sounds like you, not ChatGPT"
|
||||
- "Job search automation without giving LinkedIn your data"
|
||||
- Tutorial videos: full setup walkthrough, fine-tuning demo
|
||||
|
||||
**Goal:** 10K organic monthly visitors driving 2–5% free tier signups.
|
||||
|
||||
### Phase 4: Outplacement firm partnerships (enterprise)
|
||||
|
||||
Target HR consultancies and outplacement firms (Challenger, Gray & Christmas; Right Management; Lee Hecht Harrison). These firms place thousands of candidates per year and pay per-seat enterprise licenses.
|
||||
|
||||
**Goal:** 3 enterprise pilots within 12 months of coach tier validation.
|
||||
|
||||
### Pricing strategy by channel
|
||||
|
||||
| Channel | Entry offer | Conversion lever |
|
||||
|---|---|---|
|
||||
| GitHub / OSS | Free forever | Upgrade friction: GPU setup, no shared model |
|
||||
| Direct / ProductHunt | Free 30-day paid trial | AI quality gap is immediately visible |
|
||||
| Coach outreach | Free 60-day coach trial | Efficiency gain across client base |
|
||||
| Enterprise | Pilot with 10 seats | ROI vs. current manual process |
|
||||
|
||||
### Key metrics by phase
|
||||
|
||||
| Phase | Primary metric | Target |
|
||||
|---|---|---|
|
||||
| Launch | GitHub stars | 1K in 30 days |
|
||||
| Paid validation | Paying users | 500 in 90 days |
|
||||
| Coach validation | Coach accounts | 20 in 90 days |
|
||||
| SaaS launch | Cloud signups | 10K in 6 months |
|
||||
| Enterprise | ARR from enterprise | $100K in 12 months |
|
||||
|
||||
---
|
||||
|
||||
## 13. Pricing Sensitivity Analysis
|
||||
|
||||
### Paid tier sensitivity ($8 / $12 / $15 / $20)
|
||||
|
||||
Assumption: 100K total users, 4% base conversion, gross infra cost $1,136/mo
|
||||
|
||||
| Price | Conversion assumption | Paying users | Revenue/mo | Gross margin |
|
||||
|---|---|---|---|---|
|
||||
| $8 | 5.5% (price-elastic) | 5,500 | $44,000 | 97.4% |
|
||||
| **$12** | **4.0% (base)** | **4,000** | **$48,000** | **97.6%** |
|
||||
| $15 | 3.2% (slight drop) | 3,200 | $48,000 | 97.6% |
|
||||
| $20 | 2.5% (meaningful drop) | 2,500 | $50,000 | 97.7% |
|
||||
|
||||
**Finding:** Revenue is relatively flat between $12 and $20 because conversion drops offset the price increase. $12 is the sweet spot — maximizes paying user count (more data, more referrals, more upgrade candidates) without sacrificing revenue. Going below $10 requires meaningfully higher conversion to justify.
|
||||
|
||||
### Premium tier sensitivity ($19 / $29 / $39 / $49)
|
||||
|
||||
Assumption: 800 base premium users (20% of 4,000 paid), conversion adjusts with price
|
||||
|
||||
| Price | Conversion from paid | Premium users | Revenue/mo | Fine-tune cost | Net/mo |
|
||||
|---|---|---|---|---|---|
|
||||
| $19 | 25% | 1,000 | $19,000 | $42 | $18,958 |
|
||||
| **$29** | **20%** | **800** | **$23,200** | **$33** | **$23,167** |
|
||||
| $39 | 15% | 600 | $23,400 | $25 | $23,375 |
|
||||
| $49 | 10% | 400 | $19,600 | $17 | $19,583 |
|
||||
|
||||
**Finding:** $29–$39 is the revenue-maximizing range. $29 wins on user volume (more fine-tune data, stronger coach acquisition funnel). $39 wins marginally on revenue but shrinks the premium base significantly. Recommend $29 at launch with the option to test $34–$39 once the fine-tuned model quality is demonstrated.
|
||||
|
||||
### Coach seat sensitivity ($10 / $15 / $20 per seat)
|
||||
|
||||
Assumption: 50 coach accounts, 3 seats avg, base $29 already captured above
|
||||
|
||||
| Seat price | Seat revenue/mo | Total coach revenue/mo |
|
||||
|---|---|---|
|
||||
| $10 | $1,500 | $1,500 |
|
||||
| **$15** | **$2,250** | **$2,250** |
|
||||
| $20 | $3,000 | $3,000 |
|
||||
|
||||
**Finding:** Seat pricing is relatively inelastic for coaches — $15–$20 is well within their cost of tools per client. $15 is conservative and easy to raise. $20 is defensible once coach ROI is documented. Consider $15 at launch, $20 after first 20 coach accounts are active.
|
||||
|
||||
### Blended revenue at optimized pricing (100K users)
|
||||
|
||||
| Component | Users | Price | Revenue/mo |
|
||||
|---|---|---|---|
|
||||
| Paid tier | 4,000 | $12 | $48,000 |
|
||||
| Premium individual | 720 | $29 | $20,880 |
|
||||
| Premium coach base | 80 | $29 | $2,320 |
|
||||
| Coach seats (80 accounts × 3 avg) | 240 seats | $15 | $3,600 |
|
||||
| **Total** | | | **$74,800/mo** |
|
||||
| Infrastructure | | | -$1,136/mo |
|
||||
| **Net** | | | **$73,664/mo (~$884K ARR)** |
|
||||
|
||||
### Sensitivity to conversion rate (at $12/$29 pricing, 100K users)
|
||||
|
||||
| Free→Paid conversion | Paid→Premium conversion | Revenue/mo | ARR |
|
||||
|---|---|---|---|
|
||||
| 2% | 15% | $30,720 | $369K |
|
||||
| 3% | 18% | $47,664 | $572K |
|
||||
| **4%** | **20%** | **$65,600** | **$787K** |
|
||||
| 5% | 22% | $84,480 | $1.01M |
|
||||
| 6% | 25% | $104,400 | $1.25M |
|
||||
|
||||
**Key insight:** Conversion rate is the highest-leverage variable. Going from 4% → 5% free-to-paid conversion adds $228K ARR at 100K users. Investment in onboarding quality and the free-tier value proposition has outsized return vs. price adjustments.
|
||||
|
|
@ -1,367 +0,0 @@
|
|||
# CircuitForge License Server — Design Document
|
||||
|
||||
**Date:** 2026-02-25
|
||||
**Status:** Approved — ready for implementation
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Build a self-hosted licensing server for Circuit Forge LLC products. v1 serves Peregrine; schema is multi-product from day one. Enforces free / paid / premium / ultra tier gates with offline-capable JWT validation, 30-day refresh cycle, 7-day grace period, seat tracking, usage telemetry, and a content violation flagging foundation.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ circuitforge-license (Heimdall:8600) │
|
||||
│ FastAPI + SQLite + RS256 JWT │
|
||||
│ │
|
||||
│ Public API (/v1/…): │
|
||||
│ POST /v1/activate → issue JWT │
|
||||
│ POST /v1/refresh → renew JWT │
|
||||
│ POST /v1/deactivate → free a seat │
|
||||
│ POST /v1/usage → record usage event │
|
||||
│ POST /v1/flag → report violation │
|
||||
│ │
|
||||
│ Admin API (/admin/…, bearer token): │
|
||||
│ POST/GET /admin/keys → CRUD keys │
|
||||
│ DELETE /admin/keys/{id} → revoke │
|
||||
│ GET /admin/activations → audit │
|
||||
│ GET /admin/usage → telemetry │
|
||||
│ GET/PATCH /admin/flags → flag review │
|
||||
└─────────────────────────────────────────────────┘
|
||||
↑ HTTPS via Caddy (license.circuitforge.com)
|
||||
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ Peregrine (user's machine) │
|
||||
│ scripts/license.py │
|
||||
│ │
|
||||
│ activate(key) → POST /v1/activate │
|
||||
│ writes config/license.json │
|
||||
│ verify_local() → validates JWT offline │
|
||||
│ using embedded public key │
|
||||
│ refresh_if_needed() → called on app startup │
|
||||
│ effective_tier() → tier string for can_use() │
|
||||
│ report_usage(…) → fire-and-forget telemetry │
|
||||
│ report_flag(…) → fire-and-forget violation │
|
||||
└─────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Key properties:**
|
||||
- Peregrine verifies tier **offline** on every check — RS256 public key embedded at build time
|
||||
- Network required only at activation and 30-day refresh
|
||||
- Revoked keys stop working at next refresh cycle (≤30 day lag — acceptable for v1)
|
||||
- `config/license.json` gitignored; missing = free tier
|
||||
|
||||
---
|
||||
|
||||
## Crypto: RS256 (asymmetric JWT)
|
||||
|
||||
- **Private key** — lives only on the license server (`keys/private.pem`, gitignored)
|
||||
- **Public key** — committed to both the license server repo and Peregrine (`scripts/license_public_key.pem`)
|
||||
- Peregrine can verify JWT authenticity without ever knowing the private key
|
||||
- A stolen JWT cannot be forged without the private key
|
||||
- Revocation: server refuses refresh; old JWT valid until expiry then grace period expires
|
||||
|
||||
**Key generation (one-time, on Heimdall):**
|
||||
```bash
|
||||
openssl genrsa -out keys/private.pem 2048
|
||||
openssl rsa -in keys/private.pem -pubout -out keys/public.pem
|
||||
# copy keys/public.pem → peregrine/scripts/license_public_key.pem
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Database Schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE license_keys (
|
||||
id TEXT PRIMARY KEY, -- UUID
|
||||
key_display TEXT UNIQUE NOT NULL, -- CFG-PRNG-XXXX-XXXX-XXXX
|
||||
product TEXT NOT NULL, -- peregrine | falcon | osprey | …
|
||||
tier TEXT NOT NULL, -- paid | premium | ultra
|
||||
seats INTEGER DEFAULT 1,
|
||||
valid_until TEXT, -- ISO date or NULL (perpetual)
|
||||
revoked INTEGER DEFAULT 0,
|
||||
customer_email TEXT, -- proper field, not buried in notes
|
||||
source TEXT DEFAULT 'manual', -- manual | beta | promo | stripe
|
||||
trial INTEGER DEFAULT 0, -- 1 = time-limited trial key
|
||||
notes TEXT,
|
||||
created_at TEXT NOT NULL
|
||||
);
|
||||
|
||||
CREATE TABLE activations (
|
||||
id TEXT PRIMARY KEY,
|
||||
key_id TEXT NOT NULL REFERENCES license_keys(id),
|
||||
machine_id TEXT NOT NULL, -- sha256(hostname + MAC)
|
||||
app_version TEXT, -- Peregrine version at last refresh
|
||||
platform TEXT, -- linux | macos | windows | docker
|
||||
activated_at TEXT NOT NULL,
|
||||
last_refresh TEXT NOT NULL,
|
||||
deactivated_at TEXT -- NULL = still active
|
||||
);
|
||||
|
||||
CREATE TABLE usage_events (
|
||||
id TEXT PRIMARY KEY,
|
||||
key_id TEXT NOT NULL REFERENCES license_keys(id),
|
||||
machine_id TEXT NOT NULL,
|
||||
product TEXT NOT NULL,
|
||||
event_type TEXT NOT NULL, -- cover_letter_generated |
|
||||
-- company_research | email_sync |
|
||||
-- interview_prep | survey | etc.
|
||||
metadata TEXT, -- JSON blob for context
|
||||
created_at TEXT NOT NULL
|
||||
);
|
||||
|
||||
CREATE TABLE flags (
|
||||
id TEXT PRIMARY KEY,
|
||||
key_id TEXT NOT NULL REFERENCES license_keys(id),
|
||||
machine_id TEXT,
|
||||
product TEXT NOT NULL,
|
||||
flag_type TEXT NOT NULL, -- content_violation | tos_violation |
|
||||
-- abuse | manual
|
||||
details TEXT, -- JSON: prompt snippet, output excerpt
|
||||
status TEXT DEFAULT 'open', -- open | reviewed | dismissed | actioned
|
||||
created_at TEXT NOT NULL,
|
||||
reviewed_at TEXT,
|
||||
action_taken TEXT -- none | warned | revoked
|
||||
);
|
||||
|
||||
CREATE TABLE audit_log (
|
||||
id TEXT PRIMARY KEY,
|
||||
entity_type TEXT NOT NULL, -- key | activation | flag
|
||||
entity_id TEXT NOT NULL,
|
||||
action TEXT NOT NULL, -- created | revoked | activated |
|
||||
-- deactivated | flag_actioned
|
||||
actor TEXT, -- admin identifier (future multi-admin)
|
||||
details TEXT, -- JSON
|
||||
created_at TEXT NOT NULL
|
||||
);
|
||||
```
|
||||
|
||||
**Flags scope (v1):** Schema and `POST /v1/flag` endpoint capture data. No admin enforcement UI in v1 — query DB directly. Build review UI in v2 when there's data to act on.
|
||||
|
||||
---
|
||||
|
||||
## JWT Payload
|
||||
|
||||
```json
|
||||
{
|
||||
"sub": "CFG-PRNG-A1B2-C3D4-E5F6",
|
||||
"product": "peregrine",
|
||||
"tier": "paid",
|
||||
"seats": 2,
|
||||
"machine": "a3f9c2…",
|
||||
"notice": "Version 1.1 available — see circuitforge.com/update",
|
||||
"iat": 1740000000,
|
||||
"exp": 1742592000
|
||||
}
|
||||
```
|
||||
|
||||
`notice` is optional — set via a server config value; included in refresh responses so Peregrine can surface it as a banner. No DB table needed.
|
||||
|
||||
---
|
||||
|
||||
## Key Format
|
||||
|
||||
`CFG-PRNG-A1B2-C3D4-E5F6`
|
||||
|
||||
- `CFG` — Circuit Forge
|
||||
- `PRNG` / `FLCN` / `OSPY` / … — 4-char product code
|
||||
- Three random 4-char alphanumeric segments
|
||||
- Human-readable, easy to copy/paste into a support email
|
||||
|
||||
---
|
||||
|
||||
## Endpoint Reference
|
||||
|
||||
| Method | Path | Auth | Purpose |
|
||||
|--------|------|------|---------|
|
||||
| POST | `/v1/activate` | none | Issue JWT for key + machine |
|
||||
| POST | `/v1/refresh` | JWT bearer | Renew JWT before expiry |
|
||||
| POST | `/v1/deactivate` | JWT bearer | Free a seat |
|
||||
| POST | `/v1/usage` | JWT bearer | Record usage event (fire-and-forget) |
|
||||
| POST | `/v1/flag` | JWT bearer | Report content/ToS violation |
|
||||
| POST | `/admin/keys` | admin token | Create a new key |
|
||||
| GET | `/admin/keys` | admin token | List all keys + activation counts |
|
||||
| DELETE | `/admin/keys/{id}` | admin token | Revoke a key |
|
||||
| GET | `/admin/activations` | admin token | Full activation audit |
|
||||
| GET | `/admin/usage` | admin token | Usage breakdown per key/product/event |
|
||||
| GET | `/admin/flags` | admin token | List flags (open by default) |
|
||||
| PATCH | `/admin/flags/{id}` | admin token | Update flag status + action |
|
||||
|
||||
---
|
||||
|
||||
## Peregrine Client (`scripts/license.py`)
|
||||
|
||||
**Public API:**
|
||||
```python
|
||||
def activate(key: str) -> dict # POST /v1/activate, writes license.json
|
||||
def verify_local() -> dict | None # validates JWT offline; None = free tier
|
||||
def refresh_if_needed() -> None # silent; called on app startup
|
||||
def effective_tier() -> str # "free"|"paid"|"premium"|"ultra"
|
||||
def report_usage(event_type: str, # fire-and-forget; failures silently dropped
|
||||
metadata: dict = {}) -> None
|
||||
def report_flag(flag_type: str, # fire-and-forget
|
||||
details: dict) -> None
|
||||
```
|
||||
|
||||
**`effective_tier()` decision tree:**
|
||||
```
|
||||
license.json missing or unreadable → "free"
|
||||
JWT signature invalid → "free"
|
||||
JWT product != "peregrine" → "free"
|
||||
JWT not expired → tier from payload
|
||||
JWT expired, within grace period → tier from payload + show banner
|
||||
JWT expired, grace period expired → "free" + show banner
|
||||
```
|
||||
|
||||
**`config/license.json` (gitignored):**
|
||||
```json
|
||||
{
|
||||
"jwt": "eyJ…",
|
||||
"key_display": "CFG-PRNG-A1B2-C3D4-E5F6",
|
||||
"tier": "paid",
|
||||
"valid_until": "2026-03-27",
|
||||
"machine_id": "a3f9c2…",
|
||||
"last_refresh": "2026-02-25T12:00:00Z",
|
||||
"grace_until": null
|
||||
}
|
||||
```
|
||||
|
||||
**Integration point in `tiers.py`:**
|
||||
```python
|
||||
def effective_tier(profile) -> str:
|
||||
from scripts.license import effective_tier as _license_tier
|
||||
if profile.dev_tier_override: # dev override still works in dev mode
|
||||
return profile.dev_tier_override
|
||||
return _license_tier()
|
||||
```
|
||||
|
||||
**Settings License tab** (new tab in `app/pages/2_Settings.py`):
|
||||
- Text input: enter license key → calls `activate()` → shows result
|
||||
- If active: tier badge, key display string, expiry date, seat count
|
||||
- Grace period: amber banner with days remaining
|
||||
- "Deactivate this machine" button → `/v1/deactivate`, deletes `license.json`
|
||||
|
||||
---
|
||||
|
||||
## Deployment
|
||||
|
||||
**Repo:** `git.opensourcesolarpunk.com/pyr0ball/circuitforge-license` (private)
|
||||
|
||||
**Repo layout:**
|
||||
```
|
||||
circuitforge-license/
|
||||
├── app/
|
||||
│ ├── main.py # FastAPI app
|
||||
│ ├── db.py # SQLite helpers, schema init
|
||||
│ ├── models.py # Pydantic models
|
||||
│ ├── crypto.py # RSA sign/verify helpers
|
||||
│ └── routes/
|
||||
│ ├── public.py # /v1/* endpoints
|
||||
│ └── admin.py # /admin/* endpoints
|
||||
├── data/ # SQLite DB (named volume)
|
||||
├── keys/
|
||||
│ ├── private.pem # gitignored
|
||||
│ └── public.pem # committed
|
||||
├── scripts/
|
||||
│ └── issue-key.sh # curl wrapper for key issuance
|
||||
├── tests/
|
||||
├── Dockerfile
|
||||
├── docker-compose.yml
|
||||
├── .env.example
|
||||
└── requirements.txt
|
||||
```
|
||||
|
||||
**`docker-compose.yml` (on Heimdall):**
|
||||
```yaml
|
||||
services:
|
||||
license:
|
||||
build: .
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "127.0.0.1:8600:8600"
|
||||
volumes:
|
||||
- license_data:/app/data
|
||||
- ./keys:/app/keys:ro
|
||||
env_file: .env
|
||||
|
||||
volumes:
|
||||
license_data:
|
||||
```
|
||||
|
||||
**`.env` (gitignored):**
|
||||
```
|
||||
ADMIN_TOKEN=<long random string>
|
||||
JWT_PRIVATE_KEY_PATH=/app/keys/private.pem
|
||||
JWT_PUBLIC_KEY_PATH=/app/keys/public.pem
|
||||
JWT_EXPIRY_DAYS=30
|
||||
GRACE_PERIOD_DAYS=7
|
||||
```
|
||||
|
||||
**Caddy block (add to Heimdall Caddyfile):**
|
||||
```caddy
|
||||
license.circuitforge.com {
|
||||
reverse_proxy localhost:8600
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Admin Workflow (v1)
|
||||
|
||||
All operations via `curl` or `scripts/issue-key.sh`:
|
||||
|
||||
```bash
|
||||
# Issue a key
|
||||
./scripts/issue-key.sh --product peregrine --tier paid --seats 2 \
|
||||
--email user@example.com --notes "Beta — manual payment 2026-02-25"
|
||||
# → CFG-PRNG-A1B2-C3D4-E5F6 (email to customer)
|
||||
|
||||
# List all keys
|
||||
curl https://license.circuitforge.com/admin/keys \
|
||||
-H "Authorization: Bearer $ADMIN_TOKEN"
|
||||
|
||||
# Revoke a key
|
||||
curl -X DELETE https://license.circuitforge.com/admin/keys/{id} \
|
||||
-H "Authorization: Bearer $ADMIN_TOKEN"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
**License server:**
|
||||
- pytest with in-memory SQLite and generated test keypair
|
||||
- All endpoints tested: activate, refresh, deactivate, usage, flag, admin CRUD
|
||||
- Seat limit enforcement, expiry, revocation all unit tested
|
||||
|
||||
**Peregrine client:**
|
||||
- `verify_local()` tested with pre-signed test JWT using test keypair
|
||||
- `activate()` / `refresh()` tested with `httpx` mocks
|
||||
- `effective_tier()` tested across all states: valid, expired, grace, revoked, missing
|
||||
|
||||
**Integration smoke test:**
|
||||
```bash
|
||||
docker compose up -d
|
||||
# create test key via admin API
|
||||
# call /v1/activate with test key
|
||||
# verify JWT signature with public key
|
||||
# verify /v1/refresh extends expiry
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Decisions Log
|
||||
|
||||
| Decision | Rationale |
|
||||
|----------|-----------|
|
||||
| RS256 over HS256 | Public key embeddable in client; private key never leaves server |
|
||||
| SQLite over Postgres | Matches Peregrine's SQLite-first philosophy; trivially backupable |
|
||||
| 30-day JWT lifetime | Standard SaaS pattern; invisible to users in normal operation |
|
||||
| 7-day grace period | Covers travel, network outages, server maintenance |
|
||||
| Flags v1: capture only | No volume to justify review UI yet; add in v2 |
|
||||
| No payment integration | Manual issuance until customer volume justifies automation |
|
||||
| Multi-product schema | Adding a column now vs migrating a live DB later |
|
||||
| Separate repo | License server is infrastructure, not part of Peregrine's BSL scope |
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -1,257 +0,0 @@
|
|||
# Peregrine — Dual-GPU / Dual-Inference Design
|
||||
|
||||
**Date:** 2026-02-26
|
||||
**Status:** Approved — ready for implementation
|
||||
**Scope:** Peregrine (reference impl; patterns propagate to future products)
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Replace the fixed `dual-gpu` profile (Ollama + vLLM hardwired to GPU 0 + GPU 1) with a
|
||||
`DUAL_GPU_MODE` env var that selects which inference stack occupies GPU 1. Simultaneously
|
||||
add a first-run download size warning to preflight so users know what they're in for before
|
||||
Docker starts pulling images and models.
|
||||
|
||||
---
|
||||
|
||||
## Modes
|
||||
|
||||
| `DUAL_GPU_MODE` | GPU 0 | GPU 1 | Research backend |
|
||||
|-----------------|-------|-------|-----------------|
|
||||
| `ollama` (default) | ollama + vision | ollama_research | `ollama_research` |
|
||||
| `vllm` | ollama + vision | vllm | `vllm_research` |
|
||||
| `mixed` | ollama + vision | ollama_research + vllm (VRAM-split) | `vllm_research` → `ollama_research` fallback |
|
||||
|
||||
`mixed` requires sufficient VRAM on GPU 1. Preflight warns (not blocks) when GPU 1 has
|
||||
< 12 GB free before starting in mixed mode.
|
||||
|
||||
Cover letters always use `ollama` on GPU 0. Research uses whichever GPU 1 backend is
|
||||
reachable. The LLM router's `_is_reachable()` check handles this transparently — the
|
||||
fallback chain simply skips services that aren't running.
|
||||
|
||||
---
|
||||
|
||||
## Compose Profile Architecture
|
||||
|
||||
Docker Compose profiles used to gate which services start per mode.
|
||||
`DUAL_GPU_MODE` is read by the Makefile and passed as a second `--profile` flag.
|
||||
|
||||
### Service → profile mapping
|
||||
|
||||
| Service | Profiles |
|
||||
|---------|---------|
|
||||
| `ollama` | `cpu`, `single-gpu`, `dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed` |
|
||||
| `vision` | `single-gpu`, `dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed` |
|
||||
| `ollama_research` | `dual-gpu-ollama`, `dual-gpu-mixed` |
|
||||
| `vllm` | `dual-gpu-vllm`, `dual-gpu-mixed` |
|
||||
| `finetune` | `finetune` |
|
||||
|
||||
User-facing profiles remain: `remote`, `cpu`, `single-gpu`, `dual-gpu`.
|
||||
Sub-profiles (`dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed`) are injected by the
|
||||
Makefile and never typed by the user.
|
||||
|
||||
---
|
||||
|
||||
## File Changes
|
||||
|
||||
### `compose.yml`
|
||||
|
||||
**`ollama`** — add all dual-gpu sub-profiles to `profiles`:
|
||||
```yaml
|
||||
profiles: [cpu, single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
|
||||
```
|
||||
|
||||
**`vision`** — same pattern:
|
||||
```yaml
|
||||
profiles: [single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
|
||||
```
|
||||
|
||||
**`vllm`** — change from `[dual-gpu]` to:
|
||||
```yaml
|
||||
profiles: [dual-gpu-vllm, dual-gpu-mixed]
|
||||
```
|
||||
|
||||
**`ollama_research`** — new service:
|
||||
```yaml
|
||||
ollama_research:
|
||||
image: ollama/ollama:latest
|
||||
ports:
|
||||
- "${OLLAMA_RESEARCH_PORT:-11435}:11434"
|
||||
volumes:
|
||||
- ${OLLAMA_MODELS_DIR:-~/models/ollama}:/root/.ollama # shared — no double download
|
||||
- ./docker/ollama/entrypoint.sh:/entrypoint.sh
|
||||
environment:
|
||||
- OLLAMA_MODELS=/root/.ollama
|
||||
- DEFAULT_OLLAMA_MODEL=${OLLAMA_RESEARCH_MODEL:-llama3.2:3b}
|
||||
entrypoint: ["/bin/bash", "/entrypoint.sh"]
|
||||
profiles: [dual-gpu-ollama, dual-gpu-mixed]
|
||||
restart: unless-stopped
|
||||
```
|
||||
|
||||
### `compose.gpu.yml`
|
||||
|
||||
Add `ollama_research` block (GPU 1). `vllm` stays on GPU 1 as-is:
|
||||
```yaml
|
||||
ollama_research:
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
device_ids: ["1"]
|
||||
capabilities: [gpu]
|
||||
```
|
||||
|
||||
### `compose.podman-gpu.yml`
|
||||
|
||||
Same addition for Podman CDI:
|
||||
```yaml
|
||||
ollama_research:
|
||||
devices:
|
||||
- nvidia.com/gpu=1
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices: []
|
||||
```
|
||||
|
||||
### `Makefile`
|
||||
|
||||
Two additions after existing `COMPOSE` detection:
|
||||
|
||||
```makefile
|
||||
DUAL_GPU_MODE ?= $(shell grep -m1 '^DUAL_GPU_MODE=' .env 2>/dev/null | cut -d= -f2 || echo ollama)
|
||||
|
||||
# GPU overlay: matches single-gpu, dual-gpu (findstring gpu already covers these)
|
||||
# Sub-profile injection for dual-gpu modes:
|
||||
ifeq ($(PROFILE),dual-gpu)
|
||||
COMPOSE_FILES += --profile dual-gpu-$(DUAL_GPU_MODE)
|
||||
endif
|
||||
```
|
||||
|
||||
Update `manage.sh` usage block to document `dual-gpu` profile with `DUAL_GPU_MODE` note:
|
||||
```
|
||||
dual-gpu Ollama + Vision on GPU 0; GPU 1 mode set by DUAL_GPU_MODE
|
||||
DUAL_GPU_MODE=ollama (default) ollama_research on GPU 1
|
||||
DUAL_GPU_MODE=vllm vllm on GPU 1
|
||||
DUAL_GPU_MODE=mixed both on GPU 1 (VRAM-split; see preflight warning)
|
||||
```
|
||||
|
||||
### `scripts/preflight.py`
|
||||
|
||||
**1. `_SERVICES` — add `ollama_research`:**
|
||||
```python
|
||||
"ollama_research": ("ollama_research_port", 11435, "OLLAMA_RESEARCH_PORT", True, True),
|
||||
```
|
||||
|
||||
**2. `_LLM_BACKENDS` — add entries for both new backends:**
|
||||
```python
|
||||
"ollama_research": [("ollama_research", "/v1")],
|
||||
# vllm_research is an alias for vllm's port — preflight updates base_url for both:
|
||||
"vllm": [("vllm", "/v1"), ("vllm_research", "/v1")],
|
||||
```
|
||||
|
||||
**3. `_DOCKER_INTERNAL` — add `ollama_research`:**
|
||||
```python
|
||||
"ollama_research": ("ollama_research", 11434), # container-internal port is always 11434
|
||||
```
|
||||
|
||||
**4. `recommend_profile()` — unchanged** (still returns `"dual-gpu"` for 2 GPUs).
|
||||
Write `DUAL_GPU_MODE=ollama` to `.env` when first setting up a 2-GPU system.
|
||||
|
||||
**5. Mixed-mode VRAM warning** — after GPU resource section, before closing line:
|
||||
```python
|
||||
dual_gpu_mode = os.environ.get("DUAL_GPU_MODE", "ollama")
|
||||
if dual_gpu_mode == "mixed" and len(gpus) >= 2:
|
||||
if gpus[1]["vram_free_gb"] < 12:
|
||||
print(f"║ ⚠ DUAL_GPU_MODE=mixed: GPU 1 has only {gpus[1]['vram_free_gb']:.1f} GB free")
|
||||
print(f"║ Running ollama_research + vllm together may cause OOM.")
|
||||
print(f"║ Consider DUAL_GPU_MODE=ollama or DUAL_GPU_MODE=vllm instead.")
|
||||
```
|
||||
|
||||
**6. Download size warning** — profile-aware block added just before the closing `╚` line:
|
||||
|
||||
```
|
||||
║ Download sizes (first-run estimates)
|
||||
║ Docker images
|
||||
║ ollama/ollama ~800 MB (shared by ollama + ollama_research)
|
||||
║ searxng/searxng ~300 MB
|
||||
║ app (Python build) ~1.5 GB
|
||||
║ vision service ~3.0 GB [single-gpu and above]
|
||||
║ vllm/vllm-openai ~10.0 GB [vllm / mixed mode only]
|
||||
║
|
||||
║ Model weights (lazy-loaded on first use)
|
||||
║ llama3.2:3b ~2.0 GB → OLLAMA_MODELS_DIR
|
||||
║ moondream2 ~1.8 GB → vision container cache [single-gpu+]
|
||||
║ Note: ollama + ollama_research share the same model dir — no double download
|
||||
║
|
||||
║ ⚠ Total first-run: ~X GB (models persist between restarts)
|
||||
```
|
||||
|
||||
Total is summed at runtime based on active profile + `DUAL_GPU_MODE`.
|
||||
|
||||
Size table (used by the warning calculator):
|
||||
| Component | Size | Condition |
|
||||
|-----------|------|-----------|
|
||||
| `ollama/ollama` image | 800 MB | cpu, single-gpu, dual-gpu |
|
||||
| `searxng/searxng` image | 300 MB | always |
|
||||
| app image | 1,500 MB | always |
|
||||
| vision service image | 3,000 MB | single-gpu, dual-gpu |
|
||||
| `vllm/vllm-openai` image | 10,000 MB | vllm or mixed mode |
|
||||
| llama3.2:3b weights | 2,000 MB | cpu, single-gpu, dual-gpu |
|
||||
| moondream2 weights | 1,800 MB | single-gpu, dual-gpu |
|
||||
|
||||
### `config/llm.yaml`
|
||||
|
||||
**Add `vllm_research` backend:**
|
||||
```yaml
|
||||
vllm_research:
|
||||
api_key: ''
|
||||
base_url: http://host.docker.internal:8000/v1 # same port as vllm; preflight keeps in sync
|
||||
enabled: true
|
||||
model: __auto__
|
||||
supports_images: false
|
||||
type: openai_compat
|
||||
```
|
||||
|
||||
**Update `research_fallback_order`:**
|
||||
```yaml
|
||||
research_fallback_order:
|
||||
- claude_code
|
||||
- vllm_research
|
||||
- ollama_research
|
||||
- github_copilot
|
||||
- anthropic
|
||||
```
|
||||
|
||||
`vllm` stays in the main `fallback_order` (cover letters). `vllm_research` is the explicit
|
||||
research alias for the same service — different config key, same port, makes routing intent
|
||||
readable in the YAML.
|
||||
|
||||
---
|
||||
|
||||
## Downstream Compatibility
|
||||
|
||||
The LLM router requires no changes. `_is_reachable()` already skips backends that aren't
|
||||
responding. When `DUAL_GPU_MODE=ollama`, `vllm_research` is unreachable and skipped;
|
||||
`ollama_research` is up and used. When `DUAL_GPU_MODE=vllm`, the reverse. `mixed` mode
|
||||
makes both reachable; `vllm_research` wins as the higher-priority entry.
|
||||
|
||||
Preflight's `update_llm_yaml()` keeps `base_url` values correct for both adopted (external)
|
||||
and Docker-internal routing automatically, since `vllm_research` is registered under the
|
||||
`"vllm"` key in `_LLM_BACKENDS`.
|
||||
|
||||
---
|
||||
|
||||
## Future Considerations
|
||||
|
||||
- **Triple-GPU / 3+ service configs:** When a third product is active, extract this pattern
|
||||
into `circuitforge-core` as a reusable inference topology manager.
|
||||
- **Dual vLLM:** Two vLLM instances (e.g., different model sizes per task) follows the same
|
||||
pattern — add `vllm_research` as a separate compose service on its own port.
|
||||
- **VRAM-aware model selection:** Preflight could suggest smaller models when VRAM is tight
|
||||
in mixed mode (e.g., swap llama3.2:3b → llama3.2:1b for the research instance).
|
||||
- **Queue optimizer (1-GPU / CPU):** When only one inference backend is available and a batch
|
||||
of tasks is queued, group by task type (all cover letters first, then all research briefs)
|
||||
to avoid repeated model context switches. Tracked separately.
|
||||
|
|
@ -1,811 +0,0 @@
|
|||
# Dual-GPU / Dual-Inference Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Add `DUAL_GPU_MODE=ollama|vllm|mixed` env var that gates which inference service occupies GPU 1 on dual-GPU systems, plus a first-run download size warning in preflight.
|
||||
|
||||
**Architecture:** Sub-profiles (`dual-gpu-ollama`, `dual-gpu-vllm`, `dual-gpu-mixed`) are injected alongside `--profile dual-gpu` by the Makefile based on `DUAL_GPU_MODE`. The LLM router requires zero changes — `_is_reachable()` naturally skips backends that aren't running. Preflight gains `ollama_research` as a tracked service and emits a size warning block.
|
||||
|
||||
**Tech Stack:** Docker Compose profiles, Python (preflight.py), YAML (llm.yaml, compose files), bash (Makefile, manage.sh)
|
||||
|
||||
**Design doc:** `docs/plans/2026-02-26-dual-gpu-design.md`
|
||||
|
||||
**Test runner:** `conda run -n job-seeker python -m pytest tests/ -v`
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Update `config/llm.yaml`
|
||||
|
||||
**Files:**
|
||||
- Modify: `config/llm.yaml`
|
||||
|
||||
**Step 1: Add `vllm_research` backend and update `research_fallback_order`**
|
||||
|
||||
Open `config/llm.yaml`. After the `vllm:` block, add:
|
||||
|
||||
```yaml
|
||||
vllm_research:
|
||||
api_key: ''
|
||||
base_url: http://host.docker.internal:8000/v1
|
||||
enabled: true
|
||||
model: __auto__
|
||||
supports_images: false
|
||||
type: openai_compat
|
||||
```
|
||||
|
||||
Replace `research_fallback_order:` section with:
|
||||
|
||||
```yaml
|
||||
research_fallback_order:
|
||||
- claude_code
|
||||
- vllm_research
|
||||
- ollama_research
|
||||
- github_copilot
|
||||
- anthropic
|
||||
```
|
||||
|
||||
**Step 2: Verify YAML parses cleanly**
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python -c "import yaml; yaml.safe_load(open('config/llm.yaml'))"
|
||||
```
|
||||
|
||||
Expected: no output (no error).
|
||||
|
||||
**Step 3: Run existing llm config test**
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python -m pytest tests/test_llm_router.py::test_config_loads -v
|
||||
```
|
||||
|
||||
Expected: PASS
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add config/llm.yaml
|
||||
git commit -m "feat: add vllm_research backend and update research_fallback_order"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Write failing tests for preflight changes
|
||||
|
||||
**Files:**
|
||||
- Create: `tests/test_preflight.py`
|
||||
|
||||
No existing test file for preflight. Write all tests upfront — they fail until Task 3–5 implement the code.
|
||||
|
||||
**Step 1: Create `tests/test_preflight.py`**
|
||||
|
||||
```python
|
||||
"""Tests for scripts/preflight.py additions: dual-GPU service table, size warning, VRAM check."""
|
||||
import pytest
|
||||
from pathlib import Path
|
||||
from unittest.mock import patch
|
||||
import yaml
|
||||
import tempfile
|
||||
import os
|
||||
|
||||
|
||||
# ── Service table ──────────────────────────────────────────────────────────────
|
||||
|
||||
def test_ollama_research_in_services():
|
||||
"""ollama_research must be in _SERVICES at port 11435."""
|
||||
from scripts.preflight import _SERVICES
|
||||
assert "ollama_research" in _SERVICES
|
||||
_, default_port, env_var, docker_owned, adoptable = _SERVICES["ollama_research"]
|
||||
assert default_port == 11435
|
||||
assert env_var == "OLLAMA_RESEARCH_PORT"
|
||||
assert docker_owned is True
|
||||
assert adoptable is True
|
||||
|
||||
|
||||
def test_ollama_research_in_llm_backends():
|
||||
"""ollama_research must be a standalone key in _LLM_BACKENDS (not nested under ollama)."""
|
||||
from scripts.preflight import _LLM_BACKENDS
|
||||
assert "ollama_research" in _LLM_BACKENDS
|
||||
# Should map to the ollama_research llm backend
|
||||
backend_names = [name for name, _ in _LLM_BACKENDS["ollama_research"]]
|
||||
assert "ollama_research" in backend_names
|
||||
|
||||
|
||||
def test_vllm_research_in_llm_backends():
|
||||
"""vllm_research must be registered under vllm in _LLM_BACKENDS."""
|
||||
from scripts.preflight import _LLM_BACKENDS
|
||||
assert "vllm" in _LLM_BACKENDS
|
||||
backend_names = [name for name, _ in _LLM_BACKENDS["vllm"]]
|
||||
assert "vllm_research" in backend_names
|
||||
|
||||
|
||||
def test_ollama_research_in_docker_internal():
|
||||
"""ollama_research must map to internal port 11434 (Ollama's container port)."""
|
||||
from scripts.preflight import _DOCKER_INTERNAL
|
||||
assert "ollama_research" in _DOCKER_INTERNAL
|
||||
hostname, port = _DOCKER_INTERNAL["ollama_research"]
|
||||
assert hostname == "ollama_research"
|
||||
assert port == 11434 # container-internal port is always 11434
|
||||
|
||||
|
||||
def test_ollama_not_mapped_to_ollama_research_backend():
|
||||
"""ollama service key must only update the ollama llm backend, not ollama_research."""
|
||||
from scripts.preflight import _LLM_BACKENDS
|
||||
ollama_backend_names = [name for name, _ in _LLM_BACKENDS.get("ollama", [])]
|
||||
assert "ollama_research" not in ollama_backend_names
|
||||
|
||||
|
||||
# ── Download size warning ──────────────────────────────────────────────────────
|
||||
|
||||
def test_download_size_remote_profile():
|
||||
"""Remote profile: only searxng + app, no ollama, no vision, no vllm."""
|
||||
from scripts.preflight import _download_size_mb
|
||||
sizes = _download_size_mb("remote", "ollama")
|
||||
assert "searxng" in sizes
|
||||
assert "app" in sizes
|
||||
assert "ollama" not in sizes
|
||||
assert "vision_image" not in sizes
|
||||
assert "vllm_image" not in sizes
|
||||
|
||||
|
||||
def test_download_size_cpu_profile():
|
||||
"""CPU profile: adds ollama image + llama3.2:3b weights."""
|
||||
from scripts.preflight import _download_size_mb
|
||||
sizes = _download_size_mb("cpu", "ollama")
|
||||
assert "ollama" in sizes
|
||||
assert "llama3_2_3b" in sizes
|
||||
assert "vision_image" not in sizes
|
||||
|
||||
|
||||
def test_download_size_single_gpu_profile():
|
||||
"""Single-GPU: adds vision image + moondream2 weights."""
|
||||
from scripts.preflight import _download_size_mb
|
||||
sizes = _download_size_mb("single-gpu", "ollama")
|
||||
assert "vision_image" in sizes
|
||||
assert "moondream2" in sizes
|
||||
assert "vllm_image" not in sizes
|
||||
|
||||
|
||||
def test_download_size_dual_gpu_ollama_mode():
|
||||
"""dual-gpu + ollama mode: no vllm image."""
|
||||
from scripts.preflight import _download_size_mb
|
||||
sizes = _download_size_mb("dual-gpu", "ollama")
|
||||
assert "vllm_image" not in sizes
|
||||
|
||||
|
||||
def test_download_size_dual_gpu_vllm_mode():
|
||||
"""dual-gpu + vllm mode: adds ~10 GB vllm image."""
|
||||
from scripts.preflight import _download_size_mb
|
||||
sizes = _download_size_mb("dual-gpu", "vllm")
|
||||
assert "vllm_image" in sizes
|
||||
assert sizes["vllm_image"] >= 9000 # at least 9 GB
|
||||
|
||||
|
||||
def test_download_size_dual_gpu_mixed_mode():
|
||||
"""dual-gpu + mixed mode: also includes vllm image."""
|
||||
from scripts.preflight import _download_size_mb
|
||||
sizes = _download_size_mb("dual-gpu", "mixed")
|
||||
assert "vllm_image" in sizes
|
||||
|
||||
|
||||
# ── Mixed-mode VRAM warning ────────────────────────────────────────────────────
|
||||
|
||||
def test_mixed_mode_vram_warning_triggered():
|
||||
"""Should return a warning string when GPU 1 has < 12 GB free in mixed mode."""
|
||||
from scripts.preflight import _mixed_mode_vram_warning
|
||||
gpus = [
|
||||
{"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 20.0},
|
||||
{"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 8.0}, # tight
|
||||
]
|
||||
warning = _mixed_mode_vram_warning(gpus, "mixed")
|
||||
assert warning is not None
|
||||
assert "8.0" in warning or "GPU 1" in warning
|
||||
|
||||
|
||||
def test_mixed_mode_vram_warning_not_triggered_with_headroom():
|
||||
"""Should return None when GPU 1 has >= 12 GB free."""
|
||||
from scripts.preflight import _mixed_mode_vram_warning
|
||||
gpus = [
|
||||
{"name": "RTX 4090", "vram_total_gb": 24.0, "vram_free_gb": 20.0},
|
||||
{"name": "RTX 4090", "vram_total_gb": 24.0, "vram_free_gb": 18.0}, # plenty
|
||||
]
|
||||
warning = _mixed_mode_vram_warning(gpus, "mixed")
|
||||
assert warning is None
|
||||
|
||||
|
||||
def test_mixed_mode_vram_warning_not_triggered_for_other_modes():
|
||||
"""Warning only applies in mixed mode."""
|
||||
from scripts.preflight import _mixed_mode_vram_warning
|
||||
gpus = [
|
||||
{"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 20.0},
|
||||
{"name": "RTX 3090", "vram_total_gb": 24.0, "vram_free_gb": 6.0},
|
||||
]
|
||||
assert _mixed_mode_vram_warning(gpus, "ollama") is None
|
||||
assert _mixed_mode_vram_warning(gpus, "vllm") is None
|
||||
|
||||
|
||||
# ── update_llm_yaml with ollama_research ──────────────────────────────────────
|
||||
|
||||
def test_update_llm_yaml_sets_ollama_research_url_docker_internal():
|
||||
"""ollama_research backend URL must be set to ollama_research:11434 when Docker-owned."""
|
||||
from scripts.preflight import update_llm_yaml
|
||||
|
||||
llm_cfg = {
|
||||
"backends": {
|
||||
"ollama": {"base_url": "http://old", "type": "openai_compat"},
|
||||
"ollama_research": {"base_url": "http://old", "type": "openai_compat"},
|
||||
"vllm": {"base_url": "http://old", "type": "openai_compat"},
|
||||
"vllm_research": {"base_url": "http://old", "type": "openai_compat"},
|
||||
"vision_service": {"base_url": "http://old", "type": "vision_service"},
|
||||
}
|
||||
}
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
|
||||
yaml.dump(llm_cfg, f)
|
||||
tmp_path = Path(f.name)
|
||||
|
||||
ports = {
|
||||
"ollama": {
|
||||
"resolved": 11434, "external": False, "env_var": "OLLAMA_PORT"
|
||||
},
|
||||
"ollama_research": {
|
||||
"resolved": 11435, "external": False, "env_var": "OLLAMA_RESEARCH_PORT"
|
||||
},
|
||||
"vllm": {
|
||||
"resolved": 8000, "external": False, "env_var": "VLLM_PORT"
|
||||
},
|
||||
"vision": {
|
||||
"resolved": 8002, "external": False, "env_var": "VISION_PORT"
|
||||
},
|
||||
}
|
||||
|
||||
try:
|
||||
# Patch LLM_YAML to point at our temp file
|
||||
with patch("scripts.preflight.LLM_YAML", tmp_path):
|
||||
update_llm_yaml(ports)
|
||||
|
||||
result = yaml.safe_load(tmp_path.read_text())
|
||||
# Docker-internal: use service name + container port
|
||||
assert result["backends"]["ollama_research"]["base_url"] == "http://ollama_research:11434/v1"
|
||||
# vllm_research must match vllm's URL
|
||||
assert result["backends"]["vllm_research"]["base_url"] == result["backends"]["vllm"]["base_url"]
|
||||
finally:
|
||||
tmp_path.unlink()
|
||||
|
||||
|
||||
def test_update_llm_yaml_sets_ollama_research_url_external():
|
||||
"""When ollama_research is external (adopted), URL uses host.docker.internal:11435."""
|
||||
from scripts.preflight import update_llm_yaml
|
||||
|
||||
llm_cfg = {
|
||||
"backends": {
|
||||
"ollama": {"base_url": "http://old", "type": "openai_compat"},
|
||||
"ollama_research": {"base_url": "http://old", "type": "openai_compat"},
|
||||
}
|
||||
}
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
|
||||
yaml.dump(llm_cfg, f)
|
||||
tmp_path = Path(f.name)
|
||||
|
||||
ports = {
|
||||
"ollama": {"resolved": 11434, "external": False, "env_var": "OLLAMA_PORT"},
|
||||
"ollama_research": {"resolved": 11435, "external": True, "env_var": "OLLAMA_RESEARCH_PORT"},
|
||||
}
|
||||
|
||||
try:
|
||||
with patch("scripts.preflight.LLM_YAML", tmp_path):
|
||||
update_llm_yaml(ports)
|
||||
result = yaml.safe_load(tmp_path.read_text())
|
||||
assert result["backends"]["ollama_research"]["base_url"] == "http://host.docker.internal:11435/v1"
|
||||
finally:
|
||||
tmp_path.unlink()
|
||||
```
|
||||
|
||||
**Step 2: Run tests to confirm they all fail**
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python -m pytest tests/test_preflight.py -v 2>&1 | head -50
|
||||
```
|
||||
|
||||
Expected: all FAIL with `ImportError` or `AssertionError` — that's correct.
|
||||
|
||||
**Step 3: Commit failing tests**
|
||||
|
||||
```bash
|
||||
git add tests/test_preflight.py
|
||||
git commit -m "test: add failing tests for dual-gpu preflight additions"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: `preflight.py` — service table additions
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/preflight.py:46-67` (`_SERVICES`, `_LLM_BACKENDS`, `_DOCKER_INTERNAL`)
|
||||
|
||||
**Step 1: Update `_SERVICES`**
|
||||
|
||||
Find the `_SERVICES` dict (currently ends at the `"ollama"` entry). Add `ollama_research` as a new entry:
|
||||
|
||||
```python
|
||||
_SERVICES: dict[str, tuple[str, int, str, bool, bool]] = {
|
||||
"streamlit": ("streamlit_port", 8501, "STREAMLIT_PORT", True, False),
|
||||
"searxng": ("searxng_port", 8888, "SEARXNG_PORT", True, True),
|
||||
"vllm": ("vllm_port", 8000, "VLLM_PORT", True, True),
|
||||
"vision": ("vision_port", 8002, "VISION_PORT", True, True),
|
||||
"ollama": ("ollama_port", 11434, "OLLAMA_PORT", True, True),
|
||||
"ollama_research": ("ollama_research_port", 11435, "OLLAMA_RESEARCH_PORT", True, True),
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Update `_LLM_BACKENDS`**
|
||||
|
||||
Replace the existing dict:
|
||||
|
||||
```python
|
||||
_LLM_BACKENDS: dict[str, list[tuple[str, str]]] = {
|
||||
"ollama": [("ollama", "/v1")],
|
||||
"ollama_research": [("ollama_research", "/v1")],
|
||||
"vllm": [("vllm", "/v1"), ("vllm_research", "/v1")],
|
||||
"vision": [("vision_service", "")],
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Update `_DOCKER_INTERNAL`**
|
||||
|
||||
Add `ollama_research` entry:
|
||||
|
||||
```python
|
||||
_DOCKER_INTERNAL: dict[str, tuple[str, int]] = {
|
||||
"ollama": ("ollama", 11434),
|
||||
"ollama_research": ("ollama_research", 11434), # container-internal port is always 11434
|
||||
"vllm": ("vllm", 8000),
|
||||
"vision": ("vision", 8002),
|
||||
"searxng": ("searxng", 8080),
|
||||
}
|
||||
```
|
||||
|
||||
**Step 4: Run service table tests**
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python -m pytest tests/test_preflight.py::test_ollama_research_in_services tests/test_preflight.py::test_ollama_research_in_llm_backends tests/test_preflight.py::test_vllm_research_in_llm_backends tests/test_preflight.py::test_ollama_research_in_docker_internal tests/test_preflight.py::test_ollama_not_mapped_to_ollama_research_backend tests/test_preflight.py::test_update_llm_yaml_sets_ollama_research_url_docker_internal tests/test_preflight.py::test_update_llm_yaml_sets_ollama_research_url_external -v
|
||||
```
|
||||
|
||||
Expected: all PASS
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/preflight.py
|
||||
git commit -m "feat: add ollama_research to preflight service table and LLM backend map"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: `preflight.py` — `_download_size_mb()` pure function
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/preflight.py` (add new function after `calc_cpu_offload_gb`)
|
||||
|
||||
**Step 1: Add the function**
|
||||
|
||||
After `calc_cpu_offload_gb()`, add:
|
||||
|
||||
```python
|
||||
def _download_size_mb(profile: str, dual_gpu_mode: str = "ollama") -> dict[str, int]:
|
||||
"""
|
||||
Return estimated first-run download sizes in MB, keyed by component name.
|
||||
Profile-aware: only includes components that will actually be pulled.
|
||||
"""
|
||||
sizes: dict[str, int] = {
|
||||
"searxng": 300,
|
||||
"app": 1500,
|
||||
}
|
||||
if profile in ("cpu", "single-gpu", "dual-gpu"):
|
||||
sizes["ollama"] = 800
|
||||
sizes["llama3_2_3b"] = 2000
|
||||
if profile in ("single-gpu", "dual-gpu"):
|
||||
sizes["vision_image"] = 3000
|
||||
sizes["moondream2"] = 1800
|
||||
if profile == "dual-gpu" and dual_gpu_mode in ("vllm", "mixed"):
|
||||
sizes["vllm_image"] = 10000
|
||||
return sizes
|
||||
```
|
||||
|
||||
**Step 2: Run download size tests**
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python -m pytest tests/test_preflight.py -k "download_size" -v
|
||||
```
|
||||
|
||||
Expected: all PASS
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/preflight.py
|
||||
git commit -m "feat: add _download_size_mb() pure function for preflight size warning"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: `preflight.py` — VRAM warning, size report block, DUAL_GPU_MODE default
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/preflight.py` (three additions to `main()` and a new helper)
|
||||
|
||||
**Step 1: Add `_mixed_mode_vram_warning()` after `_download_size_mb()`**
|
||||
|
||||
```python
|
||||
def _mixed_mode_vram_warning(gpus: list[dict], dual_gpu_mode: str) -> str | None:
|
||||
"""
|
||||
Return a warning string if GPU 1 likely lacks VRAM for mixed mode, else None.
|
||||
Only relevant when dual_gpu_mode == 'mixed' and at least 2 GPUs are present.
|
||||
"""
|
||||
if dual_gpu_mode != "mixed" or len(gpus) < 2:
|
||||
return None
|
||||
free = gpus[1]["vram_free_gb"]
|
||||
if free < 12:
|
||||
return (
|
||||
f"⚠ DUAL_GPU_MODE=mixed: GPU 1 has only {free:.1f} GB free — "
|
||||
f"running ollama_research + vllm together may cause OOM. "
|
||||
f"Consider DUAL_GPU_MODE=ollama or DUAL_GPU_MODE=vllm."
|
||||
)
|
||||
return None
|
||||
```
|
||||
|
||||
**Step 2: Run VRAM warning tests**
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python -m pytest tests/test_preflight.py -k "vram" -v
|
||||
```
|
||||
|
||||
Expected: all PASS
|
||||
|
||||
**Step 3: Wire size warning into `main()` report block**
|
||||
|
||||
In `main()`, find the closing `print("╚═...═╝")` line. Add the size warning block just before it:
|
||||
|
||||
```python
|
||||
# ── Download size warning ──────────────────────────────────────────────
|
||||
dual_gpu_mode = os.environ.get("DUAL_GPU_MODE", "ollama")
|
||||
sizes = _download_size_mb(profile, dual_gpu_mode)
|
||||
total_mb = sum(sizes.values())
|
||||
print("║")
|
||||
print("║ Download sizes (first-run estimates)")
|
||||
print("║ Docker images")
|
||||
print(f"║ app (Python build) ~{sizes.get('app', 0):,} MB")
|
||||
if "searxng" in sizes:
|
||||
print(f"║ searxng/searxng ~{sizes['searxng']:,} MB")
|
||||
if "ollama" in sizes:
|
||||
shared_note = " (shared by ollama + ollama_research)" if profile == "dual-gpu" and dual_gpu_mode in ("ollama", "mixed") else ""
|
||||
print(f"║ ollama/ollama ~{sizes['ollama']:,} MB{shared_note}")
|
||||
if "vision_image" in sizes:
|
||||
print(f"║ vision service ~{sizes['vision_image']:,} MB (torch + moondream)")
|
||||
if "vllm_image" in sizes:
|
||||
print(f"║ vllm/vllm-openai ~{sizes['vllm_image']:,} MB")
|
||||
print("║ Model weights (lazy-loaded on first use)")
|
||||
if "llama3_2_3b" in sizes:
|
||||
print(f"║ llama3.2:3b ~{sizes['llama3_2_3b']:,} MB → OLLAMA_MODELS_DIR")
|
||||
if "moondream2" in sizes:
|
||||
print(f"║ moondream2 ~{sizes['moondream2']:,} MB → vision container cache")
|
||||
if profile == "dual-gpu" and dual_gpu_mode in ("ollama", "mixed"):
|
||||
print("║ Note: ollama + ollama_research share model dir — no double download")
|
||||
print(f"║ ⚠ Total first-run: ~{total_mb / 1024:.1f} GB (models persist between restarts)")
|
||||
|
||||
# ── Mixed-mode VRAM warning ────────────────────────────────────────────
|
||||
vram_warn = _mixed_mode_vram_warning(gpus, dual_gpu_mode)
|
||||
if vram_warn:
|
||||
print("║")
|
||||
print(f"║ {vram_warn}")
|
||||
```
|
||||
|
||||
**Step 4: Wire `DUAL_GPU_MODE` default into `write_env()` block in `main()`**
|
||||
|
||||
In `main()`, find the `if not args.check_only:` block. After `env_updates["PEREGRINE_GPU_NAMES"]`, add:
|
||||
|
||||
```python
|
||||
# Write DUAL_GPU_MODE default for new 2-GPU setups (don't override user's choice)
|
||||
if len(gpus) >= 2:
|
||||
existing_env: dict[str, str] = {}
|
||||
if ENV_FILE.exists():
|
||||
for line in ENV_FILE.read_text().splitlines():
|
||||
if "=" in line and not line.startswith("#"):
|
||||
k, _, v = line.partition("=")
|
||||
existing_env[k.strip()] = v.strip()
|
||||
if "DUAL_GPU_MODE" not in existing_env:
|
||||
env_updates["DUAL_GPU_MODE"] = "ollama"
|
||||
```
|
||||
|
||||
**Step 5: Add `import os` if not already present at top of file**
|
||||
|
||||
Check line 1–30 of `scripts/preflight.py`. `import os` is already present inside `get_cpu_cores()` as a local import — move it to the top-level imports block:
|
||||
|
||||
```python
|
||||
import os # add alongside existing stdlib imports
|
||||
```
|
||||
|
||||
And remove the local `import os` inside `get_cpu_cores()`.
|
||||
|
||||
**Step 6: Run all preflight tests**
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python -m pytest tests/test_preflight.py -v
|
||||
```
|
||||
|
||||
Expected: all PASS
|
||||
|
||||
**Step 7: Smoke-check the preflight report output**
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python scripts/preflight.py --check-only
|
||||
```
|
||||
|
||||
Expected: report includes the `Download sizes` block near the bottom.
|
||||
|
||||
**Step 8: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/preflight.py
|
||||
git commit -m "feat: add DUAL_GPU_MODE default, VRAM warning, and download size report to preflight"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: `compose.yml` — `ollama_research` service + profile updates
|
||||
|
||||
**Files:**
|
||||
- Modify: `compose.yml`
|
||||
|
||||
**Step 1: Update `ollama` profiles line**
|
||||
|
||||
Find:
|
||||
```yaml
|
||||
profiles: [cpu, single-gpu, dual-gpu]
|
||||
```
|
||||
Replace with:
|
||||
```yaml
|
||||
profiles: [cpu, single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
|
||||
```
|
||||
|
||||
**Step 2: Update `vision` profiles line**
|
||||
|
||||
Find:
|
||||
```yaml
|
||||
profiles: [single-gpu, dual-gpu]
|
||||
```
|
||||
Replace with:
|
||||
```yaml
|
||||
profiles: [single-gpu, dual-gpu-ollama, dual-gpu-vllm, dual-gpu-mixed]
|
||||
```
|
||||
|
||||
**Step 3: Update `vllm` profiles line**
|
||||
|
||||
Find:
|
||||
```yaml
|
||||
profiles: [dual-gpu]
|
||||
```
|
||||
Replace with:
|
||||
```yaml
|
||||
profiles: [dual-gpu-vllm, dual-gpu-mixed]
|
||||
```
|
||||
|
||||
**Step 4: Add `ollama_research` service**
|
||||
|
||||
After the closing lines of the `ollama` service block, add:
|
||||
|
||||
```yaml
|
||||
ollama_research:
|
||||
image: ollama/ollama:latest
|
||||
ports:
|
||||
- "${OLLAMA_RESEARCH_PORT:-11435}:11434"
|
||||
volumes:
|
||||
- ${OLLAMA_MODELS_DIR:-~/models/ollama}:/root/.ollama
|
||||
- ./docker/ollama/entrypoint.sh:/entrypoint.sh
|
||||
environment:
|
||||
- OLLAMA_MODELS=/root/.ollama
|
||||
- DEFAULT_OLLAMA_MODEL=${OLLAMA_RESEARCH_MODEL:-llama3.2:3b}
|
||||
entrypoint: ["/bin/bash", "/entrypoint.sh"]
|
||||
profiles: [dual-gpu-ollama, dual-gpu-mixed]
|
||||
restart: unless-stopped
|
||||
```
|
||||
|
||||
**Step 5: Validate compose YAML**
|
||||
|
||||
```bash
|
||||
docker compose -f compose.yml config --quiet
|
||||
```
|
||||
|
||||
Expected: no errors.
|
||||
|
||||
**Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add compose.yml
|
||||
git commit -m "feat: add ollama_research service and update profiles for dual-gpu sub-profiles"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: GPU overlay files — `compose.gpu.yml` and `compose.podman-gpu.yml`
|
||||
|
||||
**Files:**
|
||||
- Modify: `compose.gpu.yml`
|
||||
- Modify: `compose.podman-gpu.yml`
|
||||
|
||||
**Step 1: Add `ollama_research` to `compose.gpu.yml`**
|
||||
|
||||
After the `ollama:` block, add:
|
||||
|
||||
```yaml
|
||||
ollama_research:
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
device_ids: ["1"]
|
||||
capabilities: [gpu]
|
||||
```
|
||||
|
||||
**Step 2: Add `ollama_research` to `compose.podman-gpu.yml`**
|
||||
|
||||
After the `ollama:` block, add:
|
||||
|
||||
```yaml
|
||||
ollama_research:
|
||||
devices:
|
||||
- nvidia.com/gpu=1
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices: []
|
||||
```
|
||||
|
||||
**Step 3: Validate both files**
|
||||
|
||||
```bash
|
||||
docker compose -f compose.yml -f compose.gpu.yml config --quiet
|
||||
```
|
||||
|
||||
Expected: no errors.
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add compose.gpu.yml compose.podman-gpu.yml
|
||||
git commit -m "feat: assign ollama_research to GPU 1 in Docker and Podman GPU overlays"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 8: `Makefile` + `manage.sh` — `DUAL_GPU_MODE` injection and help text
|
||||
|
||||
**Files:**
|
||||
- Modify: `Makefile`
|
||||
- Modify: `manage.sh`
|
||||
|
||||
**Step 1: Update `Makefile`**
|
||||
|
||||
After the `COMPOSE_OVERRIDE` variable, add `DUAL_GPU_MODE` reading:
|
||||
|
||||
```makefile
|
||||
DUAL_GPU_MODE ?= $(shell grep -m1 '^DUAL_GPU_MODE=' .env 2>/dev/null | cut -d= -f2 || echo ollama)
|
||||
```
|
||||
|
||||
In the GPU overlay block, find:
|
||||
```makefile
|
||||
else
|
||||
ifneq (,$(findstring gpu,$(PROFILE)))
|
||||
COMPOSE_FILES := -f compose.yml $(COMPOSE_OVERRIDE) -f compose.gpu.yml
|
||||
endif
|
||||
endif
|
||||
```
|
||||
|
||||
Replace the `else` branch with:
|
||||
```makefile
|
||||
else
|
||||
ifneq (,$(findstring gpu,$(PROFILE)))
|
||||
COMPOSE_FILES := -f compose.yml $(COMPOSE_OVERRIDE) -f compose.gpu.yml
|
||||
endif
|
||||
endif
|
||||
ifeq ($(PROFILE),dual-gpu)
|
||||
COMPOSE_FILES += --profile dual-gpu-$(DUAL_GPU_MODE)
|
||||
endif
|
||||
```
|
||||
|
||||
**Step 2: Update `manage.sh` — profiles help block**
|
||||
|
||||
Find the profiles section in `usage()`:
|
||||
```bash
|
||||
echo " dual-gpu Ollama + Vision + vLLM on GPU 0+1"
|
||||
```
|
||||
|
||||
Replace with:
|
||||
```bash
|
||||
echo " dual-gpu Ollama + Vision on GPU 0; GPU 1 set by DUAL_GPU_MODE"
|
||||
echo " DUAL_GPU_MODE=ollama (default) ollama_research on GPU 1"
|
||||
echo " DUAL_GPU_MODE=vllm vllm on GPU 1"
|
||||
echo " DUAL_GPU_MODE=mixed both on GPU 1 (VRAM-split)"
|
||||
```
|
||||
|
||||
**Step 3: Verify Makefile parses**
|
||||
|
||||
```bash
|
||||
make help
|
||||
```
|
||||
|
||||
Expected: help table prints cleanly, no make errors.
|
||||
|
||||
**Step 4: Verify manage.sh help**
|
||||
|
||||
```bash
|
||||
./manage.sh help
|
||||
```
|
||||
|
||||
Expected: new dual-gpu description appears in profiles section.
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add Makefile manage.sh
|
||||
git commit -m "feat: inject DUAL_GPU_MODE sub-profile in Makefile; update manage.sh help"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 9: Integration smoke test
|
||||
|
||||
**Goal:** Verify the full chain works for `DUAL_GPU_MODE=ollama` without actually starting Docker (dry-run compose config check).
|
||||
|
||||
**Step 1: Write `DUAL_GPU_MODE=ollama` to `.env` temporarily**
|
||||
|
||||
```bash
|
||||
echo "DUAL_GPU_MODE=ollama" >> .env
|
||||
```
|
||||
|
||||
**Step 2: Dry-run compose config for dual-gpu + dual-gpu-ollama**
|
||||
|
||||
```bash
|
||||
docker compose -f compose.yml -f compose.gpu.yml --profile dual-gpu --profile dual-gpu-ollama config 2>&1 | grep -E "^ [a-z]|image:|ports:"
|
||||
```
|
||||
|
||||
Expected output includes:
|
||||
- `ollama:` service with port 11434
|
||||
- `ollama_research:` service with port 11435
|
||||
- `vision:` service
|
||||
- `searxng:` service
|
||||
- **No** `vllm:` service
|
||||
|
||||
**Step 3: Dry-run for `DUAL_GPU_MODE=vllm`**
|
||||
|
||||
```bash
|
||||
docker compose -f compose.yml -f compose.gpu.yml --profile dual-gpu --profile dual-gpu-vllm config 2>&1 | grep -E "^ [a-z]|image:|ports:"
|
||||
```
|
||||
|
||||
Expected:
|
||||
- `ollama:` service (port 11434)
|
||||
- `vllm:` service (port 8000)
|
||||
- **No** `ollama_research:` service
|
||||
|
||||
**Step 4: Run full test suite**
|
||||
|
||||
```bash
|
||||
conda run -n job-seeker python -m pytest tests/ -v
|
||||
```
|
||||
|
||||
Expected: all existing tests PASS, all new preflight tests PASS.
|
||||
|
||||
**Step 5: Clean up `.env` test entry**
|
||||
|
||||
```bash
|
||||
# Remove the test DUAL_GPU_MODE line (preflight will re-write it correctly on next run)
|
||||
sed -i '/^DUAL_GPU_MODE=/d' .env
|
||||
```
|
||||
|
||||
**Step 6: Final commit**
|
||||
|
||||
```bash
|
||||
git add .env # in case preflight rewrote it during testing
|
||||
git commit -m "feat: dual-gpu DUAL_GPU_MODE complete — ollama/vllm/mixed GPU 1 selection"
|
||||
```
|
||||
|
|
@ -1,132 +0,0 @@
|
|||
# Email Classifier Benchmark — Design
|
||||
|
||||
**Date:** 2026-02-26
|
||||
**Status:** Approved
|
||||
|
||||
## Problem
|
||||
|
||||
The current `classify_stage_signal()` in `scripts/imap_sync.py` uses `llama3.1:8b` via
|
||||
Ollama for 6-label email classification. This is slow, requires a running Ollama instance,
|
||||
and accuracy is unverified against alternatives. This design establishes a benchmark harness
|
||||
to evaluate HuggingFace-native classifiers as potential replacements.
|
||||
|
||||
## Labels
|
||||
|
||||
```
|
||||
interview_scheduled offer_received rejected
|
||||
positive_response survey_received neutral
|
||||
```
|
||||
|
||||
## Approach: Standalone Benchmark Script (Approach B)
|
||||
|
||||
Two new files; nothing in `imap_sync.py` changes until a winner is chosen.
|
||||
|
||||
```
|
||||
scripts/
|
||||
benchmark_classifier.py — CLI entry point
|
||||
classifier_adapters.py — adapter classes (reusable by imap_sync later)
|
||||
|
||||
data/
|
||||
email_eval.jsonl — labeled ground truth (gitignored — contains email content)
|
||||
email_eval.jsonl.example — committed example with fake emails
|
||||
|
||||
scripts/classifier_service/
|
||||
environment.yml — new conda env: job-seeker-classifiers
|
||||
```
|
||||
|
||||
## Adapter Pattern
|
||||
|
||||
```
|
||||
ClassifierAdapter (ABC)
|
||||
.classify(subject, body) → str # one of the 6 labels
|
||||
.name → str
|
||||
.model_id → str
|
||||
.load() / .unload() # explicit lifecycle
|
||||
|
||||
ZeroShotAdapter(ClassifierAdapter)
|
||||
# uses transformers pipeline("zero-shot-classification")
|
||||
# candidate_labels = list of 6 labels
|
||||
# works for: DeBERTa, BART-MNLI, BGE-M3-ZeroShot, XLM-RoBERTa
|
||||
|
||||
GLiClassAdapter(ClassifierAdapter)
|
||||
# uses gliclass library (pip install gliclass)
|
||||
# GLiClassModel + ZeroShotClassificationPipeline
|
||||
# works for: gliclass-instruct-large-v1.0
|
||||
|
||||
RerankerAdapter(ClassifierAdapter)
|
||||
# uses FlagEmbedding reranker.compute_score()
|
||||
# scores (email_text, label_description) pairs; highest = predicted label
|
||||
# works for: bge-reranker-v2-m3
|
||||
```
|
||||
|
||||
## Model Registry
|
||||
|
||||
| Short name | Model | Params | Adapter | Default |
|
||||
|------------|-------|--------|---------|---------|
|
||||
| `deberta-zeroshot` | MoritzLaurer/DeBERTa-v3-large-zeroshot-v2.0 | 400M | ZeroShot | ✅ |
|
||||
| `deberta-small` | cross-encoder/nli-deberta-v3-small | 100M | ZeroShot | ✅ |
|
||||
| `gliclass-large` | knowledgator/gliclass-instruct-large-v1.0 | 400M | GLiClass | ✅ |
|
||||
| `bart-mnli` | facebook/bart-large-mnli | 400M | ZeroShot | ✅ |
|
||||
| `bge-m3-zeroshot` | MoritzLaurer/bge-m3-zeroshot-v2.0 | 600M | ZeroShot | ✅ |
|
||||
| `bge-reranker` | BAAI/bge-reranker-v2-m3 | 600M | Reranker | ❌ (`--include-slow`) |
|
||||
| `deberta-xlarge` | microsoft/deberta-xlarge-mnli | 750M | ZeroShot | ❌ (`--include-slow`) |
|
||||
| `mdeberta-mnli` | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 300M | ZeroShot | ❌ (`--include-slow`) |
|
||||
| `xlm-roberta-anli` | vicgalle/xlm-roberta-large-xnli-anli | 600M | ZeroShot | ❌ (`--include-slow`) |
|
||||
|
||||
## CLI Modes
|
||||
|
||||
### `--compare` (live IMAP, visual table)
|
||||
Extends the pattern of `test_email_classify.py`. Pulls emails via IMAP, shows a table:
|
||||
```
|
||||
Subject | Phrase | llama3 | deberta-zs | deberta-sm | gliclass | bart | bge-m3
|
||||
```
|
||||
- Phrase-filter column shows BLOCK/pass (same gate as production)
|
||||
- `llama3` column = current production baseline
|
||||
- HF model columns follow
|
||||
|
||||
### `--eval` (ground-truth evaluation)
|
||||
Reads `data/email_eval.jsonl`, runs all models, reports per-label and aggregate metrics:
|
||||
- Per-label: precision, recall, F1
|
||||
- Aggregate: macro-F1, accuracy
|
||||
- Latency: ms/email per model
|
||||
|
||||
JSONL format:
|
||||
```jsonl
|
||||
{"subject": "Interview invitation", "body": "We'd like to schedule...", "label": "interview_scheduled"}
|
||||
{"subject": "Your application", "body": "We regret to inform you...", "label": "rejected"}
|
||||
```
|
||||
|
||||
### `--list-models`
|
||||
Prints the registry with sizes, adapter types, and default/slow flags.
|
||||
|
||||
## Conda Environment
|
||||
|
||||
New env `job-seeker-classifiers` — isolated from `job-seeker` (no torch there).
|
||||
|
||||
Key deps:
|
||||
- `torch` (CUDA-enabled)
|
||||
- `transformers`
|
||||
- `gliclass`
|
||||
- `FlagEmbedding` (for bge-reranker only)
|
||||
- `sentence-transformers` (optional, for future embedding-based approaches)
|
||||
|
||||
## GPU
|
||||
|
||||
Auto-select (`device="cuda"` when available, CPU fallback). No GPU pinning — models
|
||||
load one at a time so VRAM pressure is sequential, not cumulative.
|
||||
|
||||
## Error Handling
|
||||
|
||||
- Model load failures: skip that column, print warning, continue
|
||||
- Classification errors: show `ERR` in cell, continue
|
||||
- IMAP failures: propagate (same as existing harness)
|
||||
- Missing eval file: clear error message pointing to `data/email_eval.jsonl.example`
|
||||
|
||||
## What Does Not Change (Yet)
|
||||
|
||||
- `scripts/imap_sync.py` — production classifier unchanged
|
||||
- `scripts/llm_router.py` — unchanged
|
||||
- `staging.db` schema — unchanged
|
||||
|
||||
After benchmark results are reviewed, a separate PR will wire the winning model
|
||||
into `classify_stage_signal()` as an opt-in backend in `llm_router.py`.
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -1,229 +0,0 @@
|
|||
# Public Mirror Strategy — Design
|
||||
|
||||
**Date:** 2026-03-02
|
||||
**Scope:** Peregrine (initial); pattern applies to all future CircuitForge products
|
||||
**Status:** Approved — ready for implementation planning
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
Publish Peregrine to GitHub and Codeberg as push-mirrored community hubs. Full BSL 1.1
|
||||
codebase, no MIT carve-outs. Git hooks enforcing safety + commit format committed to the
|
||||
repo so every clone gets them automatically. Issue templates and a CONTRIBUTING.md make
|
||||
the project approachable for external contributors. FossHub added when a Windows installer
|
||||
exists.
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
**Whole repo: BSL 1.1.** No MIT exception — including `scrapers/`. The original rationale
|
||||
for making scrapers MIT (community maintenance) is equally served by BSL 1.1: contributors
|
||||
can fix broken scrapers, submit PRs, and run the tool at home for free. Making scrapers MIT
|
||||
would allow competitors to lift CF-authored scraper code into a competing commercial product
|
||||
without a license, which is not in CircuitForge's interest.
|
||||
|
||||
The `LICENSE` file at repo root covers the full codebase. No `LICENSE-MIT` file needed.
|
||||
CONTRIBUTING.md explains what BSL means practically for contributors.
|
||||
|
||||
BSL converts to MIT after 4 years per the standard BSL 1.1 terms.
|
||||
|
||||
---
|
||||
|
||||
## Mirror Sync
|
||||
|
||||
Forgejo has built-in **push mirror** support (Settings → Mirror → Push mirrors). Every push
|
||||
to the primary Forgejo repo auto-replicates within seconds — no CI/CD overhead, no cron job.
|
||||
|
||||
Two mirrors:
|
||||
- `github.com/CircuitForge/peregrine`
|
||||
- `codeberg.org/CircuitForge/peregrine`
|
||||
|
||||
Both under the `CircuitForge` org (consistent branding; not the personal `pyr0ball` account).
|
||||
GitHub and Codeberg orgs to be created if not already present.
|
||||
|
||||
---
|
||||
|
||||
## README Canonical-Source Banner
|
||||
|
||||
A prominent notice near the top of the README:
|
||||
|
||||
```
|
||||
> **Primary development** happens at [git.opensourcesolarpunk.com](https://git.opensourcesolarpunk.com/pyr0ball/peregrine).
|
||||
> GitHub and Codeberg are push mirrors. Issues and PRs are welcome on either platform.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CONTRIBUTING.md
|
||||
|
||||
Sections:
|
||||
|
||||
1. **License** — BSL 1.1 overview. What it means: self-hosting for personal non-commercial
|
||||
use is free; commercial SaaS use requires a paid license; converts to MIT after 4 years.
|
||||
Link to full `LICENSE`.
|
||||
|
||||
2. **CLA** — One-sentence acknowledgment in bold:
|
||||
*"By submitting a pull request you agree that your contribution is licensed under the
|
||||
project's BSL 1.1 terms."* No separate CLA file or signature process — the PR template
|
||||
repeats this as a checkbox.
|
||||
|
||||
3. **Dev setup** — Docker path (recommended) and conda path, pointing to
|
||||
`docs/getting-started/installation.md`.
|
||||
|
||||
4. **PR process** — GH and Codeberg PRs are reviewed and cherry-picked to Forgejo; Forgejo
|
||||
is the canonical merge target. Contributors do not need a Forgejo account.
|
||||
|
||||
5. **Commit format** — `type: description` (or `type(scope): description`). Valid types:
|
||||
`feat fix docs chore test refactor perf ci build`. Hooks enforce this — if your commit is
|
||||
rejected, the hook message tells you exactly why.
|
||||
|
||||
6. **Issue guidance** — link to templates; note that security issues go to
|
||||
`security@circuitforge.tech`, not GitHub Issues.
|
||||
|
||||
---
|
||||
|
||||
## Git Hooks (`.githooks/`)
|
||||
|
||||
Committed to the repo. Activated by `setup.sh` via:
|
||||
|
||||
```sh
|
||||
git config core.hooksPath .githooks
|
||||
```
|
||||
|
||||
`setup.sh` already runs on first clone; hook activation is added there so no contributor
|
||||
has to think about it.
|
||||
|
||||
### `pre-commit`
|
||||
|
||||
Blocks the commit if any staged file matches:
|
||||
|
||||
**Exact path blocklist:**
|
||||
- `config/user.yaml`
|
||||
- `config/server.yaml`
|
||||
- `config/llm.yaml`
|
||||
- `config/notion.yaml`
|
||||
- `config/adzuna.yaml`
|
||||
- `config/label_tool.yaml`
|
||||
- `.env`
|
||||
- `demo/data/*.db`
|
||||
- `data/*.db`
|
||||
- `data/*.jsonl`
|
||||
|
||||
**Content scan** (regex on staged diff):
|
||||
- `sk-[A-Za-z0-9]{20,}` — OpenAI-style keys
|
||||
- `Bearer [A-Za-z0-9\-_]{20,}` — generic bearer tokens
|
||||
- `api_key:\s*["\']?[A-Za-z0-9\-_]{16,}` — YAML key fields with values
|
||||
|
||||
On match: prints the offending file/pattern, aborts with a clear message and hint to use
|
||||
`git restore --staged <file>` or add to `.gitignore`.
|
||||
|
||||
### `commit-msg`
|
||||
|
||||
Reads `$1` (the commit message temp file). Rejects if:
|
||||
- Message is empty or whitespace-only
|
||||
- First line does not match `^(feat|fix|docs|chore|test|refactor|perf|ci|build)(\(.+\))?: .+`
|
||||
|
||||
On rejection: prints the required format and lists valid types. Does not touch the message
|
||||
(no auto-rewriting).
|
||||
|
||||
---
|
||||
|
||||
## Issue Templates
|
||||
|
||||
Location: `.github/ISSUE_TEMPLATE/` (GitHub) and `.gitea/ISSUE_TEMPLATE/` (Codeberg/Forgejo).
|
||||
|
||||
### Bug Report (`bug_report.md`)
|
||||
|
||||
Fields:
|
||||
- Peregrine version (output of `./manage.sh status`)
|
||||
- OS and runtime (Docker / conda-direct)
|
||||
- Steps to reproduce
|
||||
- Expected behaviour
|
||||
- Actual behaviour (with log snippets)
|
||||
- Relevant config (redact keys)
|
||||
|
||||
### Feature Request (`feature_request.md`)
|
||||
|
||||
Fields:
|
||||
- Problem statement ("I want to do X but currently...")
|
||||
- Proposed solution
|
||||
- Alternatives considered
|
||||
- Which tier this might belong to (free / paid / premium / ultra)
|
||||
- Willingness to contribute a PR
|
||||
|
||||
### PR Template (`.github/pull_request_template.md`)
|
||||
|
||||
Fields:
|
||||
- Summary of changes
|
||||
- Related issue(s)
|
||||
- Type of change (feat / fix / docs / ...)
|
||||
- Testing done
|
||||
- **CLA checkbox:** `[ ] I agree my contribution is licensed under the project's BSL 1.1 terms.`
|
||||
|
||||
### Security (`SECURITY.md`)
|
||||
|
||||
Single page: do not open a GitHub Issue for security vulnerabilities. Email
|
||||
`security@circuitforge.tech`. Response target: 72 hours.
|
||||
|
||||
---
|
||||
|
||||
## GitHub-Specific Extras
|
||||
|
||||
**CI (GitHub Actions)** — `.github/workflows/ci.yml`:
|
||||
- Trigger: push and PR to `main`
|
||||
- Steps: checkout → set up Python 3.11 → install deps from `requirements.txt` →
|
||||
`pytest tests/ -v`
|
||||
- Free for public repos; gives contributors a green checkmark without needing local conda
|
||||
|
||||
**Repo topics:** `job-search`, `ai-assistant`, `privacy`, `streamlit`, `python`,
|
||||
`open-core`, `neurodivergent`, `accessibility`, `bsl`
|
||||
|
||||
**Releases:** Mirror Forgejo tags. Release notes auto-generated from conventional commit
|
||||
subjects grouped by type.
|
||||
|
||||
---
|
||||
|
||||
## FossHub (Future — Windows RC prerequisite)
|
||||
|
||||
When a signed Windows installer (`.msi` or `.exe`) is ready:
|
||||
|
||||
1. Submit via FossHub publisher portal (`https://www.fosshub.com/contribute.html`)
|
||||
2. Requirements: stable versioned release, no bundled software, no adware
|
||||
3. FossHub gives a trusted, antivirus-clean download URL — important for an app running on
|
||||
users' personal machines
|
||||
4. Link FossHub download from README and from `circuitforge.tech` downloads section
|
||||
|
||||
No action needed until Windows RC exists.
|
||||
|
||||
---
|
||||
|
||||
## File Map
|
||||
|
||||
```
|
||||
peregrine/
|
||||
├── .githooks/
|
||||
│ ├── pre-commit # sensitive file + key pattern blocker
|
||||
│ └── commit-msg # conventional commit format enforcer
|
||||
├── .github/
|
||||
│ ├── workflows/
|
||||
│ │ └── ci.yml # pytest on push/PR
|
||||
│ ├── ISSUE_TEMPLATE/
|
||||
│ │ ├── bug_report.md
|
||||
│ │ └── feature_request.md
|
||||
│ └── pull_request_template.md
|
||||
├── .gitea/
|
||||
│ └── ISSUE_TEMPLATE/ # mirrors .github/ISSUE_TEMPLATE/ for Forgejo/Codeberg
|
||||
├── CONTRIBUTING.md
|
||||
└── SECURITY.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Forgejo mirror configuration (done via Forgejo web UI, not committed to repo)
|
||||
- GitHub/Codeberg org creation (manual one-time step)
|
||||
- Windows installer build pipeline (separate future effort)
|
||||
- `circuitforge-core` extraction (deferred until second product)
|
||||
|
|
@ -1,185 +0,0 @@
|
|||
# Feedback Button — Design
|
||||
|
||||
**Date:** 2026-03-03
|
||||
**Status:** Approved
|
||||
**Product:** Peregrine (`PRNG`)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
A floating feedback button visible on every Peregrine page that lets beta testers file
|
||||
Forgejo issues directly from the UI. Supports optional attachment of diagnostic data
|
||||
(logs, recent listings) and screenshots — all with explicit per-item user consent and
|
||||
PII masking before anything leaves the app.
|
||||
|
||||
The backend is intentionally decoupled from Streamlit so it can be wrapped in a
|
||||
FastAPI route when Peregrine moves to a proper Vue/Nuxt frontend.
|
||||
|
||||
---
|
||||
|
||||
## Goals
|
||||
|
||||
- Zero-friction bug reporting for beta testers
|
||||
- Privacy-first: nothing is sent without explicit consent + PII preview
|
||||
- Future-proof: backend callable from Streamlit now, FastAPI/Vue later
|
||||
- GitHub support as a config option once public mirrors are active
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
### Files
|
||||
|
||||
| File | Role |
|
||||
|---|---|
|
||||
| `scripts/feedback_api.py` | Pure Python backend — no Streamlit imports |
|
||||
| `app/feedback.py` | Thin Streamlit UI shell — floating button + dialog |
|
||||
| `app/components/screenshot_capture.py` | Custom Streamlit component using `html2canvas` |
|
||||
| `app/app.py` | One-line addition: inject feedback button in sidebar block |
|
||||
| `.env` / `.env.example` | Add `FORGEJO_API_TOKEN`, `FORGEJO_REPO` |
|
||||
|
||||
### Config additions (`.env`)
|
||||
|
||||
```
|
||||
FORGEJO_API_TOKEN=...
|
||||
FORGEJO_REPO=pyr0ball/peregrine
|
||||
# GITHUB_TOKEN= # future — filed when public mirror is active
|
||||
# GITHUB_REPO= # future
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Backend (`scripts/feedback_api.py`)
|
||||
|
||||
Pure Python. No Streamlit dependency. All functions return plain dicts or bytes.
|
||||
|
||||
### Functions
|
||||
|
||||
| Function | Signature | Purpose |
|
||||
|---|---|---|
|
||||
| `collect_context` | `(page: str) → dict` | Page name, app version (git describe), tier, LLM backend, OS, timestamp |
|
||||
| `collect_logs` | `(n: int = 100) → str` | Tail of `.streamlit.log`; `mask_pii()` applied before return |
|
||||
| `collect_listings` | `(n: int = 5) → list[dict]` | Recent jobs from DB — `title`, `company`, `url` only |
|
||||
| `mask_pii` | `(text: str) → str` | Regex: emails → `[email redacted]`, phones → `[phone redacted]` |
|
||||
| `build_issue_body` | `(form, context, attachments) → str` | Assembles final markdown issue body |
|
||||
| `create_forgejo_issue` | `(title, body, labels) → dict` | POST to Forgejo API; returns `{number, url}` |
|
||||
| `upload_attachment` | `(issue_number, image_bytes, filename) → str` | POST screenshot to issue assets; returns attachment URL |
|
||||
| `screenshot_page` | `(port: int) → bytes` | Server-side Playwright fallback screenshot; returns PNG bytes |
|
||||
|
||||
### Issue creation — two-step
|
||||
|
||||
1. `create_forgejo_issue()` → issue number
|
||||
2. `upload_attachment(issue_number, ...)` → attachment auto-linked by Forgejo
|
||||
|
||||
### Labels
|
||||
|
||||
Always applied: `beta-feedback`, `needs-triage`
|
||||
Type-based: `bug` / `feature-request` / `question`
|
||||
|
||||
### Future multi-destination
|
||||
|
||||
`feedback_api.py` checks both `FORGEJO_API_TOKEN` and `GITHUB_TOKEN` (when present)
|
||||
and files to whichever destinations are configured. No structural changes needed when
|
||||
GitHub support is added.
|
||||
|
||||
---
|
||||
|
||||
## UI Flow (`app/feedback.py`)
|
||||
|
||||
### Floating button
|
||||
|
||||
A real Streamlit button inside a keyed container. CSS injected via
|
||||
`st.markdown(unsafe_allow_html=True)` applies `position: fixed; bottom: 2rem;
|
||||
right: 2rem; z-index: 9999` to the container. Hidden entirely when `IS_DEMO=true`.
|
||||
|
||||
### Dialog — Step 1: Form
|
||||
|
||||
- **Type selector:** Bug / Feature Request / Other
|
||||
- **Title:** short text input
|
||||
- **Description:** free-text area
|
||||
- **Reproduction steps:** appears only when Bug is selected (adaptive)
|
||||
|
||||
### Dialog — Step 2: Consent + Attachments
|
||||
|
||||
```
|
||||
┌─ Include diagnostic data? ─────────────────────────────┐
|
||||
│ [toggle] │
|
||||
│ └─ if on → expandable preview of exactly what's sent │
|
||||
│ (logs tailed + masked, listings title/company/url) │
|
||||
├─ Screenshot ───────────────────────────────────────────┤
|
||||
│ [📸 Capture current view] → inline thumbnail preview │
|
||||
│ [📎 Upload screenshot] → inline thumbnail preview │
|
||||
├─ Attribution ──────────────────────────────────────────┤
|
||||
│ [ ] Include my name & email (shown from user.yaml) │
|
||||
└────────────────────────────────────────────────────────┘
|
||||
[Submit]
|
||||
```
|
||||
|
||||
### Post-submit
|
||||
|
||||
- Success: "Issue filed → [view on Forgejo]" with clickable link
|
||||
- Error: friendly message + copy-to-clipboard fallback (issue body as text)
|
||||
|
||||
---
|
||||
|
||||
## Screenshot Component (`app/components/screenshot_capture.py`)
|
||||
|
||||
Uses `st.components.v1.html()` with `html2canvas` loaded from CDN (no build step).
|
||||
On capture, JS renders the visible viewport to a canvas, encodes as base64 PNG, and
|
||||
returns it to Python via the component value.
|
||||
|
||||
Server-side Playwright (`screenshot_page()`) is the fallback when the JS component
|
||||
can't return data (e.g., cross-origin iframe restrictions). It screenshots
|
||||
`localhost:<port>` from the server — captures layout/UI state but not user session
|
||||
state.
|
||||
|
||||
Both paths return `bytes`. The UI shows an inline thumbnail so the user can review
|
||||
before submitting.
|
||||
|
||||
---
|
||||
|
||||
## Privacy & PII Rules
|
||||
|
||||
| Data | Included? | Condition |
|
||||
|---|---|---|
|
||||
| App logs | Optional | User toggles on + sees masked preview |
|
||||
| Job listings | Optional (title/company/url only) | User toggles on |
|
||||
| Cover letters / notes | Never | — |
|
||||
| Resume content | Never | — |
|
||||
| Name + email | Optional | User checks attribution checkbox |
|
||||
| Screenshots | Optional | User captures or uploads |
|
||||
|
||||
`mask_pii()` is applied to all text before it appears in the preview and before
|
||||
submission. Users see exactly what will be sent.
|
||||
|
||||
---
|
||||
|
||||
## Future: FastAPI wrapper
|
||||
|
||||
When Peregrine moves to Vue/Nuxt:
|
||||
|
||||
```python
|
||||
# server.py (FastAPI)
|
||||
from scripts.feedback_api import build_issue_body, create_forgejo_issue, upload_attachment
|
||||
|
||||
@app.post("/api/feedback")
|
||||
async def submit_feedback(payload: FeedbackPayload):
|
||||
body = build_issue_body(payload.form, payload.context, payload.attachments)
|
||||
result = create_forgejo_issue(payload.title, body, payload.labels)
|
||||
if payload.screenshot:
|
||||
upload_attachment(result["number"], payload.screenshot, "screenshot.png")
|
||||
return {"url": result["url"]}
|
||||
```
|
||||
|
||||
The Streamlit layer is replaced by a Vue `<FeedbackButton>` component that POSTs
|
||||
to this endpoint. Backend unchanged.
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Rate limiting (beta testers are trusted; add later if abused)
|
||||
- Issue deduplication
|
||||
- In-app issue status tracking
|
||||
- Video / screen recording
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -1,242 +0,0 @@
|
|||
# Digest Email Parsers — Design
|
||||
|
||||
**Date:** 2026-03-05
|
||||
**Products:** Peregrine (primary), Avocet (bucket)
|
||||
**Status:** Design approved, ready for implementation planning
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
Peregrine's `imap_sync.py` can extract leads from digest emails, but only for LinkedIn — the
|
||||
parser is hardcoded inline with no extension point. Adzuna and The Ladders digest emails are
|
||||
unhandled. Additionally, any digest email from an unknown sender is silently dropped with no
|
||||
way to collect samples for building new parsers.
|
||||
|
||||
---
|
||||
|
||||
## Solution Overview
|
||||
|
||||
Two complementary changes:
|
||||
|
||||
1. **`peregrine/scripts/digest_parsers.py`** — a standalone parser module with a sender registry
|
||||
and dispatcher. `imap_sync.py` calls a single function; the registry handles dispatch.
|
||||
LinkedIn parser moves here; Adzuna and Ladders parsers are built against real IMAP samples.
|
||||
|
||||
2. **Avocet digest bucket** — when a user labels an email as `digest` in the Avocet label UI,
|
||||
the email is appended to `data/digest_samples.jsonl`. This file is the corpus for building
|
||||
and testing new parsers for senders not yet in the registry.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
### Production path (Peregrine)
|
||||
|
||||
```
|
||||
imap_sync._scan_unmatched_leads()
|
||||
│
|
||||
├─ parse_digest(from_addr, body)
|
||||
│ │
|
||||
│ ├─ None → unknown sender → fall through to LLM extraction (unchanged)
|
||||
│ ├─ [] → known sender, nothing found → skip
|
||||
│ └─ [...] → jobs found → insert_job() + submit_task("scrape_url")
|
||||
│
|
||||
└─ continue (digest email consumed; does not reach LLM path)
|
||||
```
|
||||
|
||||
### Sample collection path (Avocet)
|
||||
|
||||
```
|
||||
Avocet label UI
|
||||
│
|
||||
└─ label == "digest"
|
||||
│
|
||||
└─ append to data/digest_samples.jsonl
|
||||
│
|
||||
└─ used as reference for building new parsers
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Module: `peregrine/scripts/digest_parsers.py`
|
||||
|
||||
### Parser interface
|
||||
|
||||
Each parser function:
|
||||
|
||||
```python
|
||||
def parse_<source>(body: str) -> list[dict]
|
||||
```
|
||||
|
||||
Returns zero or more job dicts:
|
||||
|
||||
```python
|
||||
{
|
||||
"title": str, # job title
|
||||
"company": str, # company name
|
||||
"location": str, # location string (may be empty)
|
||||
"url": str, # canonical URL, tracking params stripped
|
||||
"source": str, # "linkedin" | "adzuna" | "theladders"
|
||||
}
|
||||
```
|
||||
|
||||
### Dispatcher
|
||||
|
||||
```python
|
||||
DIGEST_PARSERS: dict[str, tuple[str, Callable[[str], list[dict]]]] = {
|
||||
"jobalerts@linkedin.com": ("linkedin", parse_linkedin),
|
||||
"noreply@adzuna.com": ("adzuna", parse_adzuna),
|
||||
"noreply@theladders.com": ("theladders", parse_theladders),
|
||||
}
|
||||
|
||||
def parse_digest(from_addr: str, body: str) -> list[dict] | None:
|
||||
"""
|
||||
Dispatch to the appropriate parser based on sender address.
|
||||
|
||||
Returns:
|
||||
None — no parser matched (not a known digest sender)
|
||||
[] — parser matched, no extractable jobs found
|
||||
[dict, ...] — one dict per job card extracted
|
||||
"""
|
||||
addr = from_addr.lower()
|
||||
for sender, (source, parse_fn) in DIGEST_PARSERS.items():
|
||||
if sender in addr:
|
||||
return parse_fn(body)
|
||||
return None
|
||||
```
|
||||
|
||||
Sender matching is a substring check, tolerant of display-name wrappers
|
||||
(`"LinkedIn <jobalerts@linkedin.com>"` matches correctly).
|
||||
|
||||
### Parsers
|
||||
|
||||
**`parse_linkedin`** — moved verbatim from `imap_sync.parse_linkedin_alert()`, renamed.
|
||||
No behavior change.
|
||||
|
||||
**`parse_adzuna`** — built against real Adzuna digest email bodies pulled from the
|
||||
configured IMAP account during implementation. Expected format: job blocks separated
|
||||
by consistent delimiters with title, company, location, and a trackable URL per block.
|
||||
|
||||
**`parse_theladders`** — same approach. The Ladders already has a web scraper in
|
||||
`scripts/custom_boards/theladders.py`; URL canonicalization patterns from there apply here.
|
||||
|
||||
---
|
||||
|
||||
## Changes to `imap_sync.py`
|
||||
|
||||
Replace the LinkedIn-specific block in `_scan_unmatched_leads()` (~lines 561–585):
|
||||
|
||||
**Before:**
|
||||
```python
|
||||
if _LINKEDIN_ALERT_SENDER in parsed["from_addr"].lower():
|
||||
cards = parse_linkedin_alert(parsed["body"])
|
||||
for card in cards:
|
||||
# ... LinkedIn-specific insert ...
|
||||
known_message_ids.add(mid)
|
||||
continue
|
||||
```
|
||||
|
||||
**After:**
|
||||
```python
|
||||
from scripts.digest_parsers import parse_digest # top of file
|
||||
|
||||
cards = parse_digest(parsed["from_addr"], parsed["body"])
|
||||
if cards is not None:
|
||||
for card in cards:
|
||||
if card["url"] in existing_urls:
|
||||
continue
|
||||
job_id = insert_job(db_path, {
|
||||
"title": card["title"],
|
||||
"company": card["company"],
|
||||
"url": card["url"],
|
||||
"source": card["source"],
|
||||
"location": card["location"],
|
||||
"is_remote": 0,
|
||||
"salary": "",
|
||||
"description": "",
|
||||
"date_found": datetime.now().isoformat()[:10],
|
||||
})
|
||||
if job_id:
|
||||
submit_task(db_path, "scrape_url", job_id)
|
||||
existing_urls.add(card["url"])
|
||||
new_leads += 1
|
||||
print(f"[imap] digest ({card['source']}) → {card['company']} — {card['title']}")
|
||||
known_message_ids.add(mid)
|
||||
continue
|
||||
```
|
||||
|
||||
`parse_digest` returning `None` falls through to the existing LLM extraction path — all
|
||||
non-digest recruitment emails are completely unaffected.
|
||||
|
||||
---
|
||||
|
||||
## Avocet: Digest Bucket
|
||||
|
||||
### File
|
||||
|
||||
`avocet/data/digest_samples.jsonl` — gitignored. An `.example` entry is committed.
|
||||
|
||||
Schema matches the existing label queue (JSONL on-disk schema):
|
||||
|
||||
```json
|
||||
{"subject": "...", "body": "...", "from_addr": "...", "date": "...", "account": "..."}
|
||||
```
|
||||
|
||||
### Trigger
|
||||
|
||||
In `app/label_tool.py` and `app/api.py`: when a `digest` label is applied, append the
|
||||
email to `digest_samples.jsonl` alongside the normal write to `email_score.jsonl`.
|
||||
|
||||
No Peregrine dependency — if the file path doesn't exist the `data/` directory is created
|
||||
automatically. Avocet remains fully standalone.
|
||||
|
||||
### Usage
|
||||
|
||||
When a new digest sender appears in the wild:
|
||||
1. Label representative emails as `digest` in Avocet → samples land in `digest_samples.jsonl`
|
||||
2. Inspect samples, write `parse_<source>(body)` in `digest_parsers.py`
|
||||
3. Add the sender string to `DIGEST_PARSERS`
|
||||
4. Add fixture test in `peregrine/tests/test_digest_parsers.py`
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
### `peregrine/tests/test_digest_parsers.py`
|
||||
|
||||
- Fixture bodies sourced from real IMAP samples (anonymized company names / URLs acceptable)
|
||||
- Each parser: valid body → expected cards returned
|
||||
- Each parser: empty / malformed body → `[]`, no exception
|
||||
- Dispatcher: known sender → correct parser invoked
|
||||
- Dispatcher: unknown sender → `None`
|
||||
- URL canonicalization: tracking params stripped, canonical form asserted
|
||||
- Dedup within digest: same URL appearing twice in one email → one card
|
||||
|
||||
### `avocet/tests/test_digest_bucket.py`
|
||||
|
||||
- `digest` label → row appended to `digest_samples.jsonl`
|
||||
- Any other label → `digest_samples.jsonl` not touched
|
||||
- First write creates `data/` directory if absent
|
||||
|
||||
---
|
||||
|
||||
## Files Changed / Created
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `peregrine/scripts/digest_parsers.py` | **New** — parser module |
|
||||
| `peregrine/scripts/imap_sync.py` | Replace inline LinkedIn block with `parse_digest()` call |
|
||||
| `peregrine/tests/test_digest_parsers.py` | **New** — parser unit tests |
|
||||
| `avocet/app/label_tool.py` | Append to `digest_samples.jsonl` on `digest` label |
|
||||
| `avocet/app/api.py` | Same — digest bucket write in label endpoint |
|
||||
| `avocet/tests/test_digest_bucket.py` | **New** — bucket write tests |
|
||||
| `avocet/data/digest_samples.jsonl.example` | **New** — committed sample for reference |
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Avocet → Peregrine direct import trigger (deferred; bucket is sufficient for now)
|
||||
- `background_tasks` integration for digest re-processing (not needed with bucket approach)
|
||||
- HTML digest parsing (all three senders send plain-text alerts; revisit if needed)
|
||||
|
|
@ -1,897 +0,0 @@
|
|||
# Digest Email Parsers Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Extract job listings from LinkedIn, Adzuna, and The Ladders digest emails into Peregrine leads, with an Avocet bucket that collects digest samples for future parser development.
|
||||
|
||||
**Architecture:** New `peregrine/scripts/digest_parsers.py` exposes a `parse_digest(from_addr, body)` dispatcher backed by a sender registry. `imap_sync.py` replaces its inline LinkedIn block with one dispatcher call. Avocet's two label paths (`label_tool.py` + `api.py`) append digest-labeled emails to `data/digest_samples.jsonl`. Adzuna and Ladders parsers are built from real IMAP samples fetched in Task 2.
|
||||
|
||||
**Tech Stack:** Python stdlib only — `re`, `json`, `pathlib`. No new dependencies.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Create `digest_parsers.py` with dispatcher + LinkedIn parser
|
||||
|
||||
**Files:**
|
||||
- Create: `peregrine/scripts/digest_parsers.py`
|
||||
- Create: `peregrine/tests/test_digest_parsers.py`
|
||||
|
||||
**Context:**
|
||||
`parse_linkedin_alert()` currently lives inline in `imap_sync.py`. We move it here (renamed
|
||||
`parse_linkedin`) and wrap it in a dispatcher. All other parsers plug into the same registry.
|
||||
|
||||
Run all tests with:
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py -v
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Step 1: Write the failing tests**
|
||||
|
||||
Create `peregrine/tests/test_digest_parsers.py`:
|
||||
|
||||
```python
|
||||
"""Tests for digest email parser registry."""
|
||||
import pytest
|
||||
from scripts.digest_parsers import parse_digest, parse_linkedin
|
||||
|
||||
# ── LinkedIn fixture ──────────────────────────────────────────────────────────
|
||||
# Mirrors the plain-text format LinkedIn Job Alert emails actually send.
|
||||
# Each job block is separated by a line of 10+ dashes.
|
||||
LINKEDIN_BODY = """\
|
||||
Software Engineer
|
||||
Acme Corp
|
||||
San Francisco, CA
|
||||
|
||||
View job: https://www.linkedin.com/comm/jobs/view/1111111111/?refId=abc&trackingId=xyz
|
||||
|
||||
--------------------------------------------------
|
||||
Senior Developer
|
||||
Widget Inc
|
||||
Remote
|
||||
|
||||
View job: https://www.linkedin.com/comm/jobs/view/2222222222/?refId=def
|
||||
"""
|
||||
|
||||
LINKEDIN_BODY_EMPTY = "No jobs matched your alert this week."
|
||||
|
||||
LINKEDIN_BODY_NO_URL = """\
|
||||
Software Engineer
|
||||
Acme Corp
|
||||
San Francisco, CA
|
||||
|
||||
--------------------------------------------------
|
||||
"""
|
||||
|
||||
|
||||
def test_dispatcher_linkedin_sender():
|
||||
cards = parse_digest("LinkedIn <jobalerts@linkedin.com>", LINKEDIN_BODY)
|
||||
assert cards is not None
|
||||
assert len(cards) == 2
|
||||
|
||||
|
||||
def test_dispatcher_unknown_sender_returns_none():
|
||||
result = parse_digest("noreply@randomboard.com", LINKEDIN_BODY)
|
||||
assert result is None
|
||||
|
||||
|
||||
def test_dispatcher_case_insensitive_sender():
|
||||
cards = parse_digest("JOBALERTS@LINKEDIN.COM", LINKEDIN_BODY)
|
||||
assert cards is not None
|
||||
|
||||
|
||||
def test_parse_linkedin_returns_correct_fields():
|
||||
cards = parse_linkedin(LINKEDIN_BODY)
|
||||
assert cards[0]["title"] == "Software Engineer"
|
||||
assert cards[0]["company"] == "Acme Corp"
|
||||
assert cards[0]["location"] == "San Francisco, CA"
|
||||
assert cards[0]["source"] == "linkedin"
|
||||
|
||||
|
||||
def test_parse_linkedin_url_canonicalized():
|
||||
"""Tracking params stripped; canonical jobs/view/<id>/ form."""
|
||||
cards = parse_linkedin(LINKEDIN_BODY)
|
||||
assert cards[0]["url"] == "https://www.linkedin.com/jobs/view/1111111111/"
|
||||
assert "refId" not in cards[0]["url"]
|
||||
assert "trackingId" not in cards[0]["url"]
|
||||
|
||||
|
||||
def test_parse_linkedin_empty_body_returns_empty_list():
|
||||
assert parse_linkedin(LINKEDIN_BODY_EMPTY) == []
|
||||
|
||||
|
||||
def test_parse_linkedin_block_without_url_skipped():
|
||||
cards = parse_linkedin(LINKEDIN_BODY_NO_URL)
|
||||
assert cards == []
|
||||
```
|
||||
|
||||
**Step 2: Run tests to verify they fail**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py -v
|
||||
```
|
||||
Expected: `ImportError: cannot import name 'parse_digest'`
|
||||
|
||||
---
|
||||
|
||||
**Step 3: Write `digest_parsers.py`**
|
||||
|
||||
Create `peregrine/scripts/digest_parsers.py`:
|
||||
|
||||
```python
|
||||
"""Digest email parser registry for Peregrine.
|
||||
|
||||
Each parser extracts job listings from a known digest sender's plain-text body.
|
||||
New parsers are added by decorating with @_register(sender_substring, source_name).
|
||||
|
||||
Usage:
|
||||
from scripts.digest_parsers import parse_digest
|
||||
|
||||
cards = parse_digest(from_addr, body)
|
||||
# None → unknown sender (fall through to LLM path)
|
||||
# [] → known sender, nothing extractable
|
||||
# [...] → list of {title, company, location, url, source} dicts
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from typing import Callable
|
||||
|
||||
# ── Registry ──────────────────────────────────────────────────────────────────
|
||||
|
||||
# Maps sender substring (lowercased) → (source_name, parse_fn)
|
||||
DIGEST_PARSERS: dict[str, tuple[str, Callable[[str], list[dict]]]] = {}
|
||||
|
||||
|
||||
def _register(sender: str, source: str):
|
||||
"""Decorator to register a parser for a given sender substring."""
|
||||
def decorator(fn: Callable[[str], list[dict]]):
|
||||
DIGEST_PARSERS[sender.lower()] = (source, fn)
|
||||
return fn
|
||||
return decorator
|
||||
|
||||
|
||||
def parse_digest(from_addr: str, body: str) -> list[dict] | None:
|
||||
"""Dispatch to the appropriate parser based on sender address.
|
||||
|
||||
Returns:
|
||||
None — no parser matched (caller should use LLM fallback)
|
||||
[] — known sender, no extractable jobs
|
||||
[dict, ...] — one dict per job card with keys:
|
||||
title, company, location, url, source
|
||||
"""
|
||||
addr = from_addr.lower()
|
||||
for sender, (source, parse_fn) in DIGEST_PARSERS.items():
|
||||
if sender in addr:
|
||||
return parse_fn(body)
|
||||
return None
|
||||
|
||||
|
||||
# ── Shared helpers ─────────────────────────────────────────────────────────────
|
||||
|
||||
_LINKEDIN_SKIP_PHRASES = {
|
||||
"promoted", "easily apply", "apply now", "job alert",
|
||||
"unsubscribe", "linkedin corporation",
|
||||
}
|
||||
|
||||
|
||||
# ── LinkedIn Job Alert ─────────────────────────────────────────────────────────
|
||||
|
||||
@_register("jobalerts@linkedin.com", "linkedin")
|
||||
def parse_linkedin(body: str) -> list[dict]:
|
||||
"""Parse LinkedIn Job Alert digest email body.
|
||||
|
||||
Blocks are separated by lines of 10+ dashes. Each block contains:
|
||||
Line 0: job title
|
||||
Line 1: company
|
||||
Line 2: location (optional)
|
||||
'View job: <url>' → canonicalized to /jobs/view/<id>/
|
||||
"""
|
||||
jobs = []
|
||||
blocks = re.split(r"\n\s*-{10,}\s*\n", body)
|
||||
for block in blocks:
|
||||
lines = [ln.strip() for ln in block.strip().splitlines() if ln.strip()]
|
||||
|
||||
url = None
|
||||
for line in lines:
|
||||
m = re.search(r"View job:\s*(https?://\S+)", line, re.IGNORECASE)
|
||||
if m:
|
||||
raw_url = m.group(1)
|
||||
job_id_m = re.search(r"/jobs/view/(\d+)", raw_url)
|
||||
if job_id_m:
|
||||
url = f"https://www.linkedin.com/jobs/view/{job_id_m.group(1)}/"
|
||||
break
|
||||
if not url:
|
||||
continue
|
||||
|
||||
content = [
|
||||
ln for ln in lines
|
||||
if not any(p in ln.lower() for p in _LINKEDIN_SKIP_PHRASES)
|
||||
and not ln.lower().startswith("view job:")
|
||||
and not ln.startswith("http")
|
||||
]
|
||||
if len(content) < 2:
|
||||
continue
|
||||
|
||||
jobs.append({
|
||||
"title": content[0],
|
||||
"company": content[1],
|
||||
"location": content[2] if len(content) > 2 else "",
|
||||
"url": url,
|
||||
"source": "linkedin",
|
||||
})
|
||||
return jobs
|
||||
|
||||
|
||||
# ── Adzuna Job Alert ───────────────────────────────────────────────────────────
|
||||
|
||||
@_register("noreply@adzuna.com", "adzuna")
|
||||
def parse_adzuna(body: str) -> list[dict]:
|
||||
"""Parse Adzuna job alert digest email body.
|
||||
|
||||
TODO: implement after reviewing samples in avocet/data/digest_samples.jsonl
|
||||
See Task 3 in docs/plans/2026-03-05-digest-parsers-plan.md
|
||||
"""
|
||||
return []
|
||||
|
||||
|
||||
# ── The Ladders Job Alert ──────────────────────────────────────────────────────
|
||||
|
||||
@_register("noreply@theladders.com", "theladders")
|
||||
def parse_theladders(body: str) -> list[dict]:
|
||||
"""Parse The Ladders job alert digest email body.
|
||||
|
||||
TODO: implement after reviewing samples in avocet/data/digest_samples.jsonl
|
||||
See Task 4 in docs/plans/2026-03-05-digest-parsers-plan.md
|
||||
"""
|
||||
return []
|
||||
```
|
||||
|
||||
**Step 4: Run tests to verify they pass**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py -v
|
||||
```
|
||||
Expected: all 8 tests PASS
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/digest_parsers.py tests/test_digest_parsers.py
|
||||
git commit -m "feat: digest parser registry + LinkedIn parser (moved from imap_sync)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Fetch digest samples from IMAP
|
||||
|
||||
**Files:**
|
||||
- Create: `avocet/scripts/fetch_digest_samples.py`
|
||||
|
||||
**Context:**
|
||||
We need real Adzuna and Ladders email bodies to write parsers against. This one-off script
|
||||
searches the configured IMAP account by sender domain and writes results to
|
||||
`data/digest_samples.jsonl`. Run it once; the output file feeds Tasks 3 and 4.
|
||||
|
||||
---
|
||||
|
||||
**Step 1: Create the fetch script**
|
||||
|
||||
Create `avocet/scripts/fetch_digest_samples.py`:
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
"""Fetch digest email samples from IMAP into data/digest_samples.jsonl.
|
||||
|
||||
Searches for emails from known digest sender domains, deduplicates against
|
||||
any existing samples, and appends new ones.
|
||||
|
||||
Usage:
|
||||
conda run -n job-seeker python scripts/fetch_digest_samples.py
|
||||
|
||||
Reads config/label_tool.yaml for IMAP credentials (first account used).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import imaplib
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import yaml
|
||||
|
||||
ROOT = Path(__file__).parent.parent
|
||||
CONFIG = ROOT / "config" / "label_tool.yaml"
|
||||
OUTPUT = ROOT / "data" / "digest_samples.jsonl"
|
||||
|
||||
# Sender domains to search — add new ones here as needed
|
||||
DIGEST_SENDERS = [
|
||||
"adzuna.com",
|
||||
"theladders.com",
|
||||
"jobalerts@linkedin.com",
|
||||
]
|
||||
|
||||
# Import shared helpers from avocet
|
||||
sys.path.insert(0, str(ROOT))
|
||||
from app.imap_fetch import _decode_str, _extract_body, entry_key # noqa: E402
|
||||
|
||||
|
||||
def _load_existing_keys() -> set[str]:
|
||||
if not OUTPUT.exists():
|
||||
return set()
|
||||
keys = set()
|
||||
for line in OUTPUT.read_text().splitlines():
|
||||
try:
|
||||
keys.add(entry_key(json.loads(line)))
|
||||
except Exception:
|
||||
pass
|
||||
return keys
|
||||
|
||||
|
||||
def main() -> None:
|
||||
cfg = yaml.safe_load(CONFIG.read_text())
|
||||
accounts = cfg.get("accounts", [])
|
||||
if not accounts:
|
||||
print("No accounts configured in config/label_tool.yaml")
|
||||
sys.exit(1)
|
||||
|
||||
acc = accounts[0]
|
||||
host = acc.get("host", "imap.gmail.com")
|
||||
port = int(acc.get("port", 993))
|
||||
use_ssl = acc.get("use_ssl", True)
|
||||
username = acc["username"]
|
||||
password = acc["password"]
|
||||
folder = acc.get("folder", "INBOX")
|
||||
days_back = int(acc.get("days_back", 90))
|
||||
|
||||
from datetime import datetime, timedelta
|
||||
import email as _email_lib
|
||||
|
||||
since = (datetime.now() - timedelta(days=days_back)).strftime("%d-%b-%Y")
|
||||
|
||||
conn = (imaplib.IMAP4_SSL if use_ssl else imaplib.IMAP4)(host, port)
|
||||
conn.login(username, password)
|
||||
conn.select(folder, readonly=True)
|
||||
|
||||
known_keys = _load_existing_keys()
|
||||
found: list[dict] = []
|
||||
seen_uids: dict[bytes, None] = {}
|
||||
|
||||
for sender in DIGEST_SENDERS:
|
||||
try:
|
||||
_, data = conn.search(None, f'(FROM "{sender}" SINCE "{since}")')
|
||||
for uid in (data[0] or b"").split():
|
||||
seen_uids[uid] = None
|
||||
except Exception as exc:
|
||||
print(f" search error for {sender!r}: {exc}")
|
||||
|
||||
print(f"Found {len(seen_uids)} candidate UIDs across {len(DIGEST_SENDERS)} senders")
|
||||
|
||||
for uid in seen_uids:
|
||||
try:
|
||||
_, raw_data = conn.fetch(uid, "(RFC822)")
|
||||
if not raw_data or not raw_data[0]:
|
||||
continue
|
||||
msg = _email_lib.message_from_bytes(raw_data[0][1])
|
||||
entry = {
|
||||
"subject": _decode_str(msg.get("Subject", "")),
|
||||
"body": _extract_body(msg)[:2000], # larger cap for parser dev
|
||||
"from_addr": _decode_str(msg.get("From", "")),
|
||||
"date": _decode_str(msg.get("Date", "")),
|
||||
"account": acc.get("name", username),
|
||||
}
|
||||
k = entry_key(entry)
|
||||
if k not in known_keys:
|
||||
known_keys.add(k)
|
||||
found.append(entry)
|
||||
except Exception as exc:
|
||||
print(f" fetch error uid {uid}: {exc}")
|
||||
|
||||
conn.logout()
|
||||
|
||||
if not found:
|
||||
print("No new digest samples found.")
|
||||
return
|
||||
|
||||
OUTPUT.parent.mkdir(exist_ok=True)
|
||||
with OUTPUT.open("a", encoding="utf-8") as f:
|
||||
for entry in found:
|
||||
f.write(json.dumps(entry) + "\n")
|
||||
|
||||
print(f"Wrote {len(found)} new samples to {OUTPUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
**Step 2: Run the fetch script**
|
||||
|
||||
```
|
||||
cd /Library/Development/CircuitForge/avocet
|
||||
conda run -n job-seeker python scripts/fetch_digest_samples.py
|
||||
```
|
||||
|
||||
Expected output: `Wrote N new samples to data/digest_samples.jsonl`
|
||||
|
||||
**Step 3: Inspect the samples**
|
||||
|
||||
```
|
||||
# View first few entries — look at from_addr and body for Adzuna and Ladders format
|
||||
conda run -n job-seeker python -c "
|
||||
import json
|
||||
from pathlib import Path
|
||||
for line in Path('data/digest_samples.jsonl').read_text().splitlines()[:10]:
|
||||
e = json.loads(line)
|
||||
print('FROM:', e['from_addr'])
|
||||
print('SUBJECT:', e['subject'])
|
||||
print('BODY[:500]:', e['body'][:500])
|
||||
print('---')
|
||||
"
|
||||
```
|
||||
|
||||
Note down:
|
||||
- The exact sender addresses for Adzuna and Ladders (update `DIGEST_PARSERS` in `digest_parsers.py` if different from `noreply@adzuna.com` / `noreply@theladders.com`)
|
||||
- The structure of each job block in the body (separator lines, field order, URL format)
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
cd /Library/Development/CircuitForge/avocet
|
||||
git add scripts/fetch_digest_samples.py
|
||||
git commit -m "feat: fetch_digest_samples script for building new parsers"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Build and test Adzuna parser
|
||||
|
||||
**Files:**
|
||||
- Modify: `peregrine/scripts/digest_parsers.py` — implement `parse_adzuna`
|
||||
- Modify: `peregrine/tests/test_digest_parsers.py` — add Adzuna fixtures + tests
|
||||
|
||||
**Context:**
|
||||
After running Task 2, you have real Adzuna email bodies in `avocet/data/digest_samples.jsonl`.
|
||||
Inspect them (see Task 2 Step 3), identify the structure, then write the test fixture from
|
||||
a real sample before implementing the parser.
|
||||
|
||||
---
|
||||
|
||||
**Step 1: Write a failing Adzuna test**
|
||||
|
||||
Inspect a real Adzuna sample from `data/digest_samples.jsonl` and identify:
|
||||
- How job blocks are separated (blank lines? dashes? headers?)
|
||||
- Field order (title first? company first?)
|
||||
- Where the job URL appears and what format it uses
|
||||
- Any noise lines to filter (unsubscribe, promo text, etc.)
|
||||
|
||||
Add to `peregrine/tests/test_digest_parsers.py`:
|
||||
|
||||
```python
|
||||
from scripts.digest_parsers import parse_adzuna
|
||||
|
||||
# Replace ADZUNA_BODY with a real excerpt from avocet/data/digest_samples.jsonl
|
||||
# Copy 2-3 job blocks verbatim; replace real company names with "Test Co" etc. if desired
|
||||
ADZUNA_BODY = """
|
||||
<paste real Adzuna body excerpt here — 2-3 job blocks>
|
||||
"""
|
||||
|
||||
def test_dispatcher_adzuna_sender():
|
||||
# Update sender string if real sender differs from noreply@adzuna.com
|
||||
cards = parse_digest("noreply@adzuna.com", ADZUNA_BODY)
|
||||
assert cards is not None
|
||||
assert len(cards) >= 1
|
||||
|
||||
def test_parse_adzuna_fields():
|
||||
cards = parse_adzuna(ADZUNA_BODY)
|
||||
assert cards[0]["title"] # non-empty
|
||||
assert cards[0]["company"] # non-empty
|
||||
assert cards[0]["url"].startswith("http")
|
||||
assert cards[0]["source"] == "adzuna"
|
||||
|
||||
def test_parse_adzuna_url_no_tracking():
|
||||
"""Adzuna URLs often contain tracking params — strip them."""
|
||||
cards = parse_adzuna(ADZUNA_BODY)
|
||||
# Adjust assertion to match actual URL format once you've seen real samples
|
||||
for card in cards:
|
||||
assert "utm_" not in card["url"]
|
||||
|
||||
def test_parse_adzuna_empty_body():
|
||||
assert parse_adzuna("No jobs this week.") == []
|
||||
```
|
||||
|
||||
**Step 2: Run tests to verify they fail**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py::test_parse_adzuna_fields -v
|
||||
```
|
||||
Expected: FAIL (stub returns `[]`)
|
||||
|
||||
**Step 3: Implement `parse_adzuna` in `digest_parsers.py`**
|
||||
|
||||
Replace the stub body of `parse_adzuna` based on the actual email structure you observed.
|
||||
Pattern to follow (adapt field positions to match Adzuna's actual format):
|
||||
|
||||
```python
|
||||
@_register("noreply@adzuna.com", "adzuna") # update sender if needed
|
||||
def parse_adzuna(body: str) -> list[dict]:
|
||||
jobs = []
|
||||
# Split on whatever delimiter Adzuna uses between blocks
|
||||
# e.g.: blocks = re.split(r"\n\s*\n{2,}", body) # double blank line
|
||||
# For each block, extract title, company, location, url
|
||||
# Strip tracking params from URL: re.sub(r"\?.*", "", url) or parse with urllib
|
||||
return jobs
|
||||
```
|
||||
|
||||
If Adzuna sender differs from `noreply@adzuna.com`, update the `@_register` decorator
|
||||
**and** the `DIGEST_PARSERS` key in the registry (they're set by the decorator — just change
|
||||
the decorator argument).
|
||||
|
||||
**Step 4: Run all digest tests**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py -v
|
||||
```
|
||||
Expected: all tests PASS
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
cd /Library/Development/CircuitForge/peregrine
|
||||
git add scripts/digest_parsers.py tests/test_digest_parsers.py
|
||||
git commit -m "feat: Adzuna digest email parser"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Build and test The Ladders parser
|
||||
|
||||
**Files:**
|
||||
- Modify: `peregrine/scripts/digest_parsers.py` — implement `parse_theladders`
|
||||
- Modify: `peregrine/tests/test_digest_parsers.py` — add Ladders fixtures + tests
|
||||
|
||||
**Context:**
|
||||
Same approach as Task 3. The Ladders already has a web scraper in
|
||||
`scripts/custom_boards/theladders.py` — check it for URL patterns that may apply here.
|
||||
|
||||
---
|
||||
|
||||
**Step 1: Write failing Ladders tests**
|
||||
|
||||
Inspect a real Ladders sample from `avocet/data/digest_samples.jsonl`. Add to test file:
|
||||
|
||||
```python
|
||||
from scripts.digest_parsers import parse_theladders
|
||||
|
||||
# Replace with real Ladders body excerpt
|
||||
LADDERS_BODY = """
|
||||
<paste real Ladders body excerpt here — 2-3 job blocks>
|
||||
"""
|
||||
|
||||
def test_dispatcher_ladders_sender():
|
||||
cards = parse_digest("noreply@theladders.com", LADDERS_BODY)
|
||||
assert cards is not None
|
||||
assert len(cards) >= 1
|
||||
|
||||
def test_parse_theladders_fields():
|
||||
cards = parse_theladders(LADDERS_BODY)
|
||||
assert cards[0]["title"]
|
||||
assert cards[0]["company"]
|
||||
assert cards[0]["url"].startswith("http")
|
||||
assert cards[0]["source"] == "theladders"
|
||||
|
||||
def test_parse_theladders_empty_body():
|
||||
assert parse_theladders("No new jobs.") == []
|
||||
```
|
||||
|
||||
**Step 2: Run tests to verify they fail**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py::test_parse_theladders_fields -v
|
||||
```
|
||||
Expected: FAIL
|
||||
|
||||
**Step 3: Implement `parse_theladders`**
|
||||
|
||||
Replace the stub. The Ladders URLs often use redirect wrappers — canonicalize to the
|
||||
`theladders.com/job/<id>` form if possible, otherwise just strip tracking params.
|
||||
|
||||
**Step 4: Run all digest tests**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_parsers.py -v
|
||||
```
|
||||
Expected: all tests PASS
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/digest_parsers.py tests/test_digest_parsers.py
|
||||
git commit -m "feat: The Ladders digest email parser"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Update `imap_sync.py` to use the dispatcher
|
||||
|
||||
**Files:**
|
||||
- Modify: `peregrine/scripts/imap_sync.py`
|
||||
|
||||
**Context:**
|
||||
The LinkedIn-specific block in `_scan_unmatched_leads()` (search for
|
||||
`_LINKEDIN_ALERT_SENDER`) gets replaced with a generic `parse_digest()` call.
|
||||
The existing behavior is preserved — only the dispatch mechanism changes.
|
||||
|
||||
---
|
||||
|
||||
**Step 1: Add the import**
|
||||
|
||||
At the top of `imap_sync.py`, alongside other local imports, add:
|
||||
|
||||
```python
|
||||
from scripts.digest_parsers import parse_digest
|
||||
```
|
||||
|
||||
**Step 2: Find the LinkedIn-specific block**
|
||||
|
||||
Search for `_LINKEDIN_ALERT_SENDER` in `imap_sync.py`. The block looks like:
|
||||
|
||||
```python
|
||||
if _LINKEDIN_ALERT_SENDER in parsed["from_addr"].lower():
|
||||
cards = parse_linkedin_alert(parsed["body"])
|
||||
for card in cards:
|
||||
...
|
||||
known_message_ids.add(mid)
|
||||
continue
|
||||
```
|
||||
|
||||
**Step 3: Replace with the generic dispatcher**
|
||||
|
||||
```python
|
||||
# ── Digest email — dispatch to parser registry ────────────────────────
|
||||
cards = parse_digest(parsed["from_addr"], parsed["body"])
|
||||
if cards is not None:
|
||||
for card in cards:
|
||||
if card["url"] in existing_urls:
|
||||
continue
|
||||
job_id = insert_job(db_path, {
|
||||
"title": card["title"],
|
||||
"company": card["company"],
|
||||
"url": card["url"],
|
||||
"source": card["source"],
|
||||
"location": card["location"],
|
||||
"is_remote": 0,
|
||||
"salary": "",
|
||||
"description": "",
|
||||
"date_found": datetime.now().isoformat()[:10],
|
||||
})
|
||||
if job_id:
|
||||
submit_task(db_path, "scrape_url", job_id)
|
||||
existing_urls.add(card["url"])
|
||||
new_leads += 1
|
||||
print(f"[imap] digest ({card['source']}) → {card['company']} — {card['title']}")
|
||||
known_message_ids.add(mid)
|
||||
continue
|
||||
```
|
||||
|
||||
**Step 4: Remove the now-unused `parse_linkedin_alert` import/definition**
|
||||
|
||||
`parse_linkedin_alert` was defined in `imap_sync.py`. It's now `parse_linkedin` in
|
||||
`digest_parsers.py`. Delete the old function from `imap_sync.py`. Also remove
|
||||
`_LINKEDIN_ALERT_SENDER` constant if it's no longer referenced.
|
||||
|
||||
**Step 5: Run the full test suite**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v
|
||||
```
|
||||
Expected: all existing tests still pass; no regressions
|
||||
|
||||
**Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/imap_sync.py
|
||||
git commit -m "refactor: imap_sync uses digest_parsers dispatcher; remove inline LinkedIn parser"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Avocet digest bucket
|
||||
|
||||
**Files:**
|
||||
- Modify: `avocet/app/label_tool.py`
|
||||
- Modify: `avocet/app/api.py`
|
||||
- Create: `avocet/tests/test_digest_bucket.py`
|
||||
- Create: `avocet/data/digest_samples.jsonl.example`
|
||||
|
||||
**Context:**
|
||||
When either label path (`_do_label` in the Streamlit UI or `POST /api/label` in the FastAPI
|
||||
app) assigns the `digest` label, the full email record is appended to
|
||||
`data/digest_samples.jsonl`. This is the sample corpus for building future parsers.
|
||||
|
||||
---
|
||||
|
||||
**Step 1: Write failing tests**
|
||||
|
||||
Create `avocet/tests/test_digest_bucket.py`:
|
||||
|
||||
```python
|
||||
"""Tests for digest sample bucket write behavior."""
|
||||
import json
|
||||
import pytest
|
||||
from pathlib import Path
|
||||
from unittest.mock import patch, MagicMock
|
||||
|
||||
|
||||
# ── Helpers ───────────────────────────────────────────────────────────────────
|
||||
|
||||
def _read_bucket(tmp_path: Path) -> list[dict]:
|
||||
bucket = tmp_path / "data" / "digest_samples.jsonl"
|
||||
if not bucket.exists():
|
||||
return []
|
||||
return [json.loads(line) for line in bucket.read_text().splitlines() if line.strip()]
|
||||
|
||||
|
||||
SAMPLE_ENTRY = {
|
||||
"subject": "10 new jobs for you",
|
||||
"body": "Software Engineer\nAcme Corp\nRemote\nView job: https://example.com/123",
|
||||
"from_addr": "noreply@adzuna.com",
|
||||
"date": "Mon, 03 Mar 2026 09:00:00 +0000",
|
||||
"account": "test@example.com",
|
||||
}
|
||||
|
||||
|
||||
# ── api.py bucket tests ───────────────────────────────────────────────────────
|
||||
|
||||
def test_api_digest_label_writes_to_bucket(tmp_path):
|
||||
from app.api import _append_digest_sample
|
||||
data_dir = tmp_path / "data"
|
||||
_append_digest_sample(SAMPLE_ENTRY, data_dir=data_dir)
|
||||
rows = _read_bucket(tmp_path)
|
||||
assert len(rows) == 1
|
||||
assert rows[0]["from_addr"] == "noreply@adzuna.com"
|
||||
|
||||
|
||||
def test_api_non_digest_label_does_not_write(tmp_path):
|
||||
from app.api import _append_digest_sample
|
||||
data_dir = tmp_path / "data"
|
||||
# _append_digest_sample should only be called for digest; confirm it writes when called
|
||||
# Confirm that callers gate on label == "digest" — tested via integration below
|
||||
_append_digest_sample(SAMPLE_ENTRY, data_dir=data_dir)
|
||||
rows = _read_bucket(tmp_path)
|
||||
assert len(rows) == 1 # called directly, always writes
|
||||
|
||||
|
||||
def test_api_digest_creates_data_dir(tmp_path):
|
||||
from app.api import _append_digest_sample
|
||||
data_dir = tmp_path / "nonexistent" / "data"
|
||||
assert not data_dir.exists()
|
||||
_append_digest_sample(SAMPLE_ENTRY, data_dir=data_dir)
|
||||
assert data_dir.exists()
|
||||
|
||||
|
||||
def test_api_digest_appends_multiple(tmp_path):
|
||||
from app.api import _append_digest_sample
|
||||
data_dir = tmp_path / "data"
|
||||
_append_digest_sample(SAMPLE_ENTRY, data_dir=data_dir)
|
||||
_append_digest_sample({**SAMPLE_ENTRY, "subject": "5 more jobs"}, data_dir=data_dir)
|
||||
rows = _read_bucket(tmp_path)
|
||||
assert len(rows) == 2
|
||||
```
|
||||
|
||||
**Step 2: Run tests to verify they fail**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/test_digest_bucket.py -v
|
||||
```
|
||||
Expected: `ImportError: cannot import name '_append_digest_sample'`
|
||||
|
||||
---
|
||||
|
||||
**Step 3: Add `_append_digest_sample` to `api.py`**
|
||||
|
||||
In `avocet/app/api.py`, add this helper (near the top, after the imports and `_DATA_DIR`
|
||||
constant):
|
||||
|
||||
```python
|
||||
_DIGEST_SAMPLES_FILE = _DATA_DIR / "digest_samples.jsonl"
|
||||
|
||||
|
||||
def _append_digest_sample(entry: dict, data_dir: Path | None = None) -> None:
|
||||
"""Append a digest-labeled email to the sample corpus."""
|
||||
target_dir = data_dir if data_dir is not None else _DATA_DIR
|
||||
target_dir.mkdir(parents=True, exist_ok=True)
|
||||
bucket = target_dir / "digest_samples.jsonl"
|
||||
record = {
|
||||
"subject": entry.get("subject", ""),
|
||||
"body": entry.get("body", ""),
|
||||
"from_addr": entry.get("from_addr", entry.get("from", "")),
|
||||
"date": entry.get("date", ""),
|
||||
"account": entry.get("account", entry.get("source", "")),
|
||||
}
|
||||
with bucket.open("a", encoding="utf-8") as f:
|
||||
f.write(json.dumps(record) + "\n")
|
||||
```
|
||||
|
||||
Then in `post_label()` (around line 127, after `_append_jsonl(_score_file(), record)`):
|
||||
|
||||
```python
|
||||
if req.label == "digest":
|
||||
_append_digest_sample(match)
|
||||
```
|
||||
|
||||
**Step 4: Add the same write to `label_tool.py`**
|
||||
|
||||
In `avocet/app/label_tool.py`, add a module-level constant after `_SCORE_FILE`:
|
||||
|
||||
```python
|
||||
_DIGEST_SAMPLES_FILE = _ROOT / "data" / "digest_samples.jsonl"
|
||||
```
|
||||
|
||||
In `_do_label()` (around line 728, after `_append_jsonl(_SCORE_FILE, row)`):
|
||||
|
||||
```python
|
||||
if label == "digest":
|
||||
_append_jsonl(
|
||||
_DIGEST_SAMPLES_FILE,
|
||||
{
|
||||
"subject": entry.get("subject", ""),
|
||||
"body": (entry.get("body", ""))[:2000],
|
||||
"from_addr": entry.get("from_addr", ""),
|
||||
"date": entry.get("date", ""),
|
||||
"account": entry.get("account", ""),
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
(`_append_jsonl` already exists in label_tool.py at line ~396 — reuse it.)
|
||||
|
||||
**Step 5: Create the example file**
|
||||
|
||||
Create `avocet/data/digest_samples.jsonl.example`:
|
||||
|
||||
```json
|
||||
{"subject": "10 new Software Engineer jobs for you", "body": "Software Engineer\nAcme Corp\nSan Francisco, CA\n\nView job: https://www.linkedin.com/jobs/view/1234567890/\n", "from_addr": "LinkedIn <jobalerts@linkedin.com>", "date": "Mon, 03 Mar 2026 09:00:00 +0000", "account": "example@gmail.com"}
|
||||
```
|
||||
|
||||
**Step 6: Update `.gitignore` in avocet**
|
||||
|
||||
Verify `data/digest_samples.jsonl` is gitignored. Open `avocet/.gitignore` — it should
|
||||
already have `data/*.jsonl`. If not, add:
|
||||
|
||||
```
|
||||
data/digest_samples.jsonl
|
||||
```
|
||||
|
||||
**Step 7: Run all avocet tests**
|
||||
|
||||
```
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v
|
||||
```
|
||||
Expected: all tests PASS
|
||||
|
||||
**Step 8: Commit**
|
||||
|
||||
```bash
|
||||
cd /Library/Development/CircuitForge/avocet
|
||||
git add app/api.py app/label_tool.py tests/test_digest_bucket.py data/digest_samples.jsonl.example
|
||||
git commit -m "feat: digest sample bucket — write digest-labeled emails to digest_samples.jsonl"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Task | Repo | Commit message |
|
||||
|------|------|----------------|
|
||||
| 1 | peregrine | `feat: digest parser registry + LinkedIn parser (moved from imap_sync)` |
|
||||
| 2 | avocet | `feat: fetch_digest_samples script for building new parsers` |
|
||||
| 3 | peregrine | `feat: Adzuna digest email parser` |
|
||||
| 4 | peregrine | `feat: The Ladders digest email parser` |
|
||||
| 5 | peregrine | `refactor: imap_sync uses digest_parsers dispatcher; remove inline LinkedIn parser` |
|
||||
| 6 | avocet | `feat: digest sample bucket — write digest-labeled emails to digest_samples.jsonl` |
|
||||
|
||||
Tasks 1, 2, and 6 are independent and can be done in any order.
|
||||
Tasks 3 and 4 depend on Task 2 (samples needed before implementing parsers).
|
||||
Task 5 depends on Tasks 1, 3, and 4 (all parsers should be ready before switching imap_sync).
|
||||
|
|
@ -1,161 +0,0 @@
|
|||
# CircuitForge Hooks — Secret & PII Scanning Design
|
||||
|
||||
**Date:** 2026-03-07
|
||||
**Scope:** All CircuitForge repos (Peregrine first; others on public release)
|
||||
**Status:** Approved, ready for implementation
|
||||
|
||||
## Problem
|
||||
|
||||
A live Forgejo API token was committed in `docs/plans/2026-03-03-feedback-button-plan.md`
|
||||
and required emergency history scrubbing via `git-filter-repo`. Root causes:
|
||||
|
||||
1. `core.hooksPath` was never configured — the existing `.githooks/pre-commit` ran on zero commits
|
||||
2. The token format (`FORGEJO_API_TOKEN=<hex>`) matched none of the hook's three regexes
|
||||
3. No pre-push safety net existed
|
||||
|
||||
## Solution
|
||||
|
||||
Centralised hook repo (`circuitforge-hooks`) shared across all products.
|
||||
Each repo activates it with one command. The heavy lifting is delegated to
|
||||
`gitleaks` — an actively-maintained binary with 150+ built-in secret patterns,
|
||||
native Forgejo/Gitea token detection, and a clean allowlist system.
|
||||
|
||||
## Repository Structure
|
||||
|
||||
```
|
||||
/Library/Development/CircuitForge/circuitforge-hooks/
|
||||
├── hooks/
|
||||
│ ├── pre-commit # gitleaks --staged scan (fast, every commit)
|
||||
│ ├── commit-msg # conventional commits enforcement
|
||||
│ └── pre-push # gitleaks full-branch scan (safety net)
|
||||
├── gitleaks.toml # shared base config
|
||||
├── install.sh # wires core.hooksPath in the calling repo
|
||||
├── tests/
|
||||
│ └── test_hooks.sh # migrated + extended from Peregrine
|
||||
└── README.md
|
||||
```
|
||||
|
||||
Forgejo remote: `git.opensourcesolarpunk.com/pyr0ball/circuitforge-hooks`
|
||||
|
||||
## Hook Behaviour
|
||||
|
||||
### pre-commit
|
||||
- Runs `gitleaks protect --staged` — scans only the staged diff
|
||||
- Sub-second on typical commits
|
||||
- Blocks commit and prints redacted match on failure
|
||||
- Merges per-repo `.gitleaks.toml` allowlist if present
|
||||
|
||||
### pre-push
|
||||
- Runs `gitleaks git` — scans full branch history not yet on remote
|
||||
- Catches anything committed with `--no-verify` or before hooks were wired
|
||||
- Same config resolution as pre-commit
|
||||
|
||||
### commit-msg
|
||||
- Enforces conventional commits format (`type(scope): subject`)
|
||||
- Migrated unchanged from `peregrine/.githooks/commit-msg`
|
||||
|
||||
## gitleaks Config
|
||||
|
||||
### Shared base (`circuitforge-hooks/gitleaks.toml`)
|
||||
|
||||
```toml
|
||||
title = "CircuitForge secret + PII scanner"
|
||||
|
||||
[extend]
|
||||
useDefault = true # inherit all 150+ built-in rules
|
||||
|
||||
[[rules]]
|
||||
id = "cf-generic-env-token"
|
||||
description = "Generic KEY=<token> in env-style assignment"
|
||||
regex = '''(?i)(token|secret|key|password|passwd|pwd|api_key)\s*[=:]\s*['\"]?[A-Za-z0-9\-_]{20,}['\"]?'''
|
||||
[rules.allowlist]
|
||||
regexes = ['api_key:\s*ollama', 'api_key:\s*any']
|
||||
|
||||
[[rules]]
|
||||
id = "cf-phone-number"
|
||||
description = "US phone number in source or config"
|
||||
regex = '''\b(\+1[\s\-.]?)?\(?\d{3}\)?[\s\-.]?\d{3}[\s\-.]?\d{4}\b'''
|
||||
[rules.allowlist]
|
||||
regexes = ['555-\d{4}', '555\.\d{4}', '5550', '1234567890', '0000000000']
|
||||
|
||||
[[rules]]
|
||||
id = "cf-personal-email"
|
||||
description = "Personal email address in source/config (not .example files)"
|
||||
regex = '''[a-zA-Z0-9._%+\-]+@(gmail|yahoo|icloud|hotmail|outlook|proton)\.(com|me)'''
|
||||
[rules.allowlist]
|
||||
paths = ['.*\.example$', '.*test.*', '.*docs/.*']
|
||||
|
||||
[allowlist]
|
||||
description = "CircuitForge global allowlist"
|
||||
paths = [
|
||||
'.*\.example$',
|
||||
'docs/reference/.*',
|
||||
'gitleaks\.toml$',
|
||||
]
|
||||
regexes = [
|
||||
'sk-abcdefghijklmnopqrstuvwxyz',
|
||||
'your-forgejo-api-token-here',
|
||||
]
|
||||
```
|
||||
|
||||
### Per-repo override (e.g. `peregrine/.gitleaks.toml`)
|
||||
|
||||
```toml
|
||||
[extend]
|
||||
path = "/Library/Development/CircuitForge/circuitforge-hooks/gitleaks.toml"
|
||||
|
||||
[allowlist]
|
||||
regexes = [
|
||||
'\d{10}\.html', # Craigslist listing IDs (10-digit, look like phone numbers)
|
||||
]
|
||||
```
|
||||
|
||||
## Activation Per Repo
|
||||
|
||||
Each repo's `setup.sh` or `manage.sh` calls:
|
||||
|
||||
```bash
|
||||
bash /Library/Development/CircuitForge/circuitforge-hooks/install.sh
|
||||
```
|
||||
|
||||
`install.sh` does exactly one thing:
|
||||
|
||||
```bash
|
||||
git config core.hooksPath /Library/Development/CircuitForge/circuitforge-hooks/hooks
|
||||
```
|
||||
|
||||
For Heimdall live deploys (`/devl/<repo>/`), the same line goes in the deploy
|
||||
script / post-receive hook.
|
||||
|
||||
## Migration from Peregrine
|
||||
|
||||
- `peregrine/.githooks/pre-commit` → replaced by gitleaks wrapper
|
||||
- `peregrine/.githooks/commit-msg` → copied verbatim to hooks repo
|
||||
- `peregrine/tests/test_hooks.sh` → migrated and extended in hooks repo
|
||||
- `peregrine/.githooks/` directory → kept temporarily, then removed after cutover
|
||||
|
||||
## Rollout Order
|
||||
|
||||
1. `circuitforge-hooks` repo — create, implement, test
|
||||
2. `peregrine` — activate (highest priority, already public)
|
||||
3. `circuitforge-license` (heimdall) — activate before any public release
|
||||
4. All subsequent repos — activate as part of their public-release checklist
|
||||
|
||||
## Testing
|
||||
|
||||
`tests/test_hooks.sh` covers:
|
||||
|
||||
- Staged file with live-format token → blocked
|
||||
- Staged file with phone number → blocked
|
||||
- Staged file with personal email in source → blocked
|
||||
- `.example` file with placeholders → allowed
|
||||
- Craigslist URL with 10-digit ID → allowed (Peregrine allowlist)
|
||||
- Valid conventional commit message → accepted
|
||||
- Non-conventional commit message → rejected
|
||||
|
||||
## What This Does Not Cover
|
||||
|
||||
- Scanning existing history on new repos (run `gitleaks git` manually before
|
||||
making any repo public — add to the public-release checklist)
|
||||
- CI/server-side enforcement (future: Forgejo Actions job on push to main)
|
||||
- Binary files or encrypted secrets at rest
|
||||
|
|
@ -1,705 +0,0 @@
|
|||
# CircuitForge Hooks Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Create the `circuitforge-hooks` repo with gitleaks-based secret/PII scanning, activate it in Peregrine, and retire the old hand-rolled `.githooks/pre-commit`.
|
||||
|
||||
**Architecture:** A standalone git repo holds three hook scripts (pre-commit, commit-msg, pre-push) and a shared `gitleaks.toml`. Each product repo activates it with `git config core.hooksPath`. Per-repo `.gitleaks.toml` files extend the base config with repo-specific allowlists.
|
||||
|
||||
**Tech Stack:** gitleaks (Go binary, apt install), bash, TOML config
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Install gitleaks
|
||||
|
||||
**Files:**
|
||||
- None — binary install only
|
||||
|
||||
**Step 1: Install gitleaks**
|
||||
|
||||
```bash
|
||||
sudo apt-get install -y gitleaks
|
||||
```
|
||||
|
||||
If not in apt (older Ubuntu), use the GitHub release:
|
||||
```bash
|
||||
GITLEAKS_VERSION=$(curl -s https://api.github.com/repos/gitleaks/gitleaks/releases/latest | python3 -c "import sys,json; print(json.load(sys.stdin)['tag_name'])")
|
||||
curl -sSfL "https://github.com/gitleaks/gitleaks/releases/download/${GITLEAKS_VERSION}/gitleaks_${GITLEAKS_VERSION#v}_linux_x64.tar.gz" | sudo tar -xz -C /usr/local/bin gitleaks
|
||||
```
|
||||
|
||||
**Step 2: Verify**
|
||||
|
||||
```bash
|
||||
gitleaks version
|
||||
```
|
||||
Expected: prints version string e.g. `v8.x.x`
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Create repo and write gitleaks.toml
|
||||
|
||||
**Files:**
|
||||
- Create: `/Library/Development/CircuitForge/circuitforge-hooks/gitleaks.toml`
|
||||
|
||||
**Step 1: Scaffold repo**
|
||||
|
||||
```bash
|
||||
mkdir -p /Library/Development/CircuitForge/circuitforge-hooks/hooks
|
||||
mkdir -p /Library/Development/CircuitForge/circuitforge-hooks/tests
|
||||
cd /Library/Development/CircuitForge/circuitforge-hooks
|
||||
git init
|
||||
```
|
||||
|
||||
**Step 2: Write gitleaks.toml**
|
||||
|
||||
Create `/Library/Development/CircuitForge/circuitforge-hooks/gitleaks.toml`:
|
||||
|
||||
```toml
|
||||
title = "CircuitForge secret + PII scanner"
|
||||
|
||||
[extend]
|
||||
useDefault = true # inherit all 150+ built-in gitleaks rules
|
||||
|
||||
# ── CircuitForge-specific secret patterns ────────────────────────────────────
|
||||
|
||||
[[rules]]
|
||||
id = "cf-generic-env-token"
|
||||
description = "Generic KEY=<token> in env-style assignment — catches FORGEJO_API_TOKEN=hex etc."
|
||||
regex = '''(?i)(token|secret|key|password|passwd|pwd|api_key)\s*[=:]\s*['"]?[A-Za-z0-9\-_]{20,}['"]?'''
|
||||
[rules.allowlist]
|
||||
regexes = [
|
||||
'api_key:\s*ollama',
|
||||
'api_key:\s*any',
|
||||
'your-[a-z\-]+-here',
|
||||
'replace-with-',
|
||||
'xxxx',
|
||||
]
|
||||
|
||||
# ── PII patterns ──────────────────────────────────────────────────────────────
|
||||
|
||||
[[rules]]
|
||||
id = "cf-phone-number"
|
||||
description = "US phone number committed in source or config"
|
||||
regex = '''\b(\+1[\s\-.]?)?\(?\d{3}\)?[\s\-.]?\d{3}[\s\-.]?\d{4}\b'''
|
||||
[rules.allowlist]
|
||||
regexes = [
|
||||
'555-\d{4}',
|
||||
'555\.\d{4}',
|
||||
'5550\d{4}',
|
||||
'^1234567890$',
|
||||
'0000000000',
|
||||
'1111111111',
|
||||
'2222222222',
|
||||
'9999999999',
|
||||
]
|
||||
|
||||
[[rules]]
|
||||
id = "cf-personal-email"
|
||||
description = "Personal webmail address committed in source or config (not .example files)"
|
||||
regex = '''[a-zA-Z0-9._%+\-]+@(gmail|yahoo|icloud|hotmail|outlook|proton)\.(com|me)'''
|
||||
[rules.allowlist]
|
||||
paths = [
|
||||
'.*\.example$',
|
||||
'.*test.*',
|
||||
'.*docs/.*',
|
||||
'.*\.md$',
|
||||
]
|
||||
|
||||
# ── Global allowlist ──────────────────────────────────────────────────────────
|
||||
|
||||
[allowlist]
|
||||
description = "CircuitForge global allowlist"
|
||||
paths = [
|
||||
'.*\.example$',
|
||||
'docs/reference/.*',
|
||||
'gitleaks\.toml$',
|
||||
]
|
||||
regexes = [
|
||||
'sk-abcdefghijklmnopqrstuvwxyz',
|
||||
'your-forgejo-api-token-here',
|
||||
'your-[a-z\-]+-here',
|
||||
]
|
||||
```
|
||||
|
||||
**Step 3: Smoke-test config syntax**
|
||||
|
||||
```bash
|
||||
cd /Library/Development/CircuitForge/circuitforge-hooks
|
||||
gitleaks detect --config gitleaks.toml --no-git --source . 2>&1 | head -5
|
||||
```
|
||||
Expected: no "invalid config" errors. (May report findings in the config itself — that's fine.)
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
cd /Library/Development/CircuitForge/circuitforge-hooks
|
||||
git add gitleaks.toml
|
||||
git commit -m "feat: add shared gitleaks config with CF secret + PII rules"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Write hook scripts
|
||||
|
||||
**Files:**
|
||||
- Create: `hooks/pre-commit`
|
||||
- Create: `hooks/commit-msg`
|
||||
- Create: `hooks/pre-push`
|
||||
|
||||
**Step 1: Write hooks/pre-commit**
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# pre-commit — scan staged diff for secrets + PII via gitleaks
|
||||
set -euo pipefail
|
||||
|
||||
HOOKS_REPO="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
BASE_CONFIG="$HOOKS_REPO/gitleaks.toml"
|
||||
REPO_ROOT="$(git rev-parse --show-toplevel)"
|
||||
REPO_CONFIG="$REPO_ROOT/.gitleaks.toml"
|
||||
|
||||
if ! command -v gitleaks &>/dev/null; then
|
||||
echo "ERROR: gitleaks not found. Install with: sudo apt-get install gitleaks"
|
||||
echo " or: https://github.com/gitleaks/gitleaks#installing"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
CONFIG_ARG="--config=$BASE_CONFIG"
|
||||
[[ -f "$REPO_CONFIG" ]] && CONFIG_ARG="--config=$REPO_CONFIG"
|
||||
|
||||
if ! gitleaks protect --staged $CONFIG_ARG --redact 2>&1; then
|
||||
echo ""
|
||||
echo "Commit blocked: secrets or PII detected in staged changes."
|
||||
echo "Review above, remove the sensitive value, then re-stage and retry."
|
||||
echo "If this is a false positive, add an allowlist entry to .gitleaks.toml"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
**Step 2: Write hooks/commit-msg**
|
||||
|
||||
Copy verbatim from Peregrine:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# commit-msg — enforces conventional commit format
|
||||
set -euo pipefail
|
||||
|
||||
RED='\033[0;31m'; YELLOW='\033[1;33m'; NC='\033[0m'
|
||||
|
||||
VALID_TYPES="feat|fix|docs|chore|test|refactor|perf|ci|build|security"
|
||||
MSG_FILE="$1"
|
||||
MSG=$(head -1 "$MSG_FILE")
|
||||
|
||||
if [[ -z "${MSG// }" ]]; then
|
||||
echo -e "${RED}Commit rejected:${NC} Commit message is empty."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! echo "$MSG" | grep -qE "^($VALID_TYPES)(\(.+\))?: .+"; then
|
||||
echo -e "${RED}Commit rejected:${NC} Message does not follow conventional commit format."
|
||||
echo ""
|
||||
echo -e " Required: ${YELLOW}type: description${NC} or ${YELLOW}type(scope): description${NC}"
|
||||
echo -e " Valid types: ${YELLOW}$VALID_TYPES${NC}"
|
||||
echo ""
|
||||
echo -e " Your message: ${YELLOW}$MSG${NC}"
|
||||
echo ""
|
||||
echo -e " Examples:"
|
||||
echo -e " ${YELLOW}feat: add cover letter refinement${NC}"
|
||||
echo -e " ${YELLOW}fix(wizard): handle missing user.yaml gracefully${NC}"
|
||||
echo -e " ${YELLOW}security: rotate leaked API token${NC}"
|
||||
exit 1
|
||||
fi
|
||||
exit 0
|
||||
```
|
||||
|
||||
Note: added `security` to VALID_TYPES vs the Peregrine original.
|
||||
|
||||
**Step 3: Write hooks/pre-push**
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# pre-push — scan full branch history not yet on remote
|
||||
# Safety net: catches anything committed with --no-verify or before hooks were wired
|
||||
set -euo pipefail
|
||||
|
||||
HOOKS_REPO="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
BASE_CONFIG="$HOOKS_REPO/gitleaks.toml"
|
||||
REPO_ROOT="$(git rev-parse --show-toplevel)"
|
||||
REPO_CONFIG="$REPO_ROOT/.gitleaks.toml"
|
||||
|
||||
if ! command -v gitleaks &>/dev/null; then
|
||||
echo "ERROR: gitleaks not found. Install with: sudo apt-get install gitleaks"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
CONFIG_ARG="--config=$BASE_CONFIG"
|
||||
[[ -f "$REPO_CONFIG" ]] && CONFIG_ARG="--config=$REPO_CONFIG"
|
||||
|
||||
if ! gitleaks git $CONFIG_ARG --redact 2>&1; then
|
||||
echo ""
|
||||
echo "Push blocked: secrets or PII found in branch history."
|
||||
echo "Use git-filter-repo to scrub, then force-push."
|
||||
echo "See: https://github.com/newren/git-filter-repo"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
**Step 4: Make hooks executable**
|
||||
|
||||
```bash
|
||||
chmod +x hooks/pre-commit hooks/commit-msg hooks/pre-push
|
||||
```
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
cd /Library/Development/CircuitForge/circuitforge-hooks
|
||||
git add hooks/
|
||||
git commit -m "feat: add pre-commit, commit-msg, and pre-push hook scripts"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Write install.sh
|
||||
|
||||
**Files:**
|
||||
- Create: `install.sh`
|
||||
|
||||
**Step 1: Write install.sh**
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# install.sh — wire circuitforge-hooks into the calling git repo
|
||||
# Usage: bash /Library/Development/CircuitForge/circuitforge-hooks/install.sh
|
||||
set -euo pipefail
|
||||
|
||||
HOOKS_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/hooks" && pwd)"
|
||||
|
||||
if ! git rev-parse --git-dir &>/dev/null; then
|
||||
echo "ERROR: not inside a git repo. Run from your product repo root."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
git config core.hooksPath "$HOOKS_DIR"
|
||||
echo "CircuitForge hooks installed."
|
||||
echo " core.hooksPath → $HOOKS_DIR"
|
||||
echo ""
|
||||
echo "Verify gitleaks is available: gitleaks version"
|
||||
```
|
||||
|
||||
**Step 2: Make executable**
|
||||
|
||||
```bash
|
||||
chmod +x install.sh
|
||||
```
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add install.sh
|
||||
git commit -m "feat: add install.sh for one-command hook activation"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Write tests
|
||||
|
||||
**Files:**
|
||||
- Create: `tests/test_hooks.sh`
|
||||
|
||||
**Step 1: Write tests/test_hooks.sh**
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# tests/test_hooks.sh — integration tests for circuitforge-hooks
|
||||
# Requires: gitleaks installed, bash 4+
|
||||
set -euo pipefail
|
||||
|
||||
HOOKS_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)/hooks"
|
||||
PASS_COUNT=0
|
||||
FAIL_COUNT=0
|
||||
|
||||
pass() { echo " PASS: $1"; PASS_COUNT=$((PASS_COUNT + 1)); }
|
||||
fail() { echo " FAIL: $1"; FAIL_COUNT=$((FAIL_COUNT + 1)); }
|
||||
|
||||
# Create a temp git repo for realistic staged-content tests
|
||||
setup_temp_repo() {
|
||||
local dir
|
||||
dir=$(mktemp -d)
|
||||
git init "$dir" -q
|
||||
git -C "$dir" config user.email "test@example.com"
|
||||
git -C "$dir" config user.name "Test"
|
||||
git -C "$dir" config core.hooksPath "$HOOKS_DIR"
|
||||
echo "$dir"
|
||||
}
|
||||
|
||||
run_pre_commit_in() {
|
||||
local repo="$1" file="$2" content="$3"
|
||||
echo "$content" > "$repo/$file"
|
||||
git -C "$repo" add "$file"
|
||||
bash "$HOOKS_DIR/pre-commit" 2>&1
|
||||
echo $?
|
||||
}
|
||||
|
||||
echo ""
|
||||
echo "=== pre-commit hook tests ==="
|
||||
|
||||
# Test 1: blocks live-format Forgejo token
|
||||
echo "Test 1: blocks FORGEJO_API_TOKEN=<hex>"
|
||||
REPO=$(setup_temp_repo)
|
||||
echo 'FORGEJO_API_TOKEN=4ea4353b88d6388e8fafab9eb36662226f3a06b0' > "$REPO/test.env"
|
||||
git -C "$REPO" add test.env
|
||||
RESULT=$(cd "$REPO" && bash "$HOOKS_DIR/pre-commit" 2>&1; echo "EXIT:$?")
|
||||
if echo "$RESULT" | grep -q "EXIT:1"; then pass "blocked FORGEJO_API_TOKEN"; else fail "should have blocked FORGEJO_API_TOKEN"; fi
|
||||
rm -rf "$REPO"
|
||||
|
||||
# Test 2: blocks OpenAI-style sk- key
|
||||
echo "Test 2: blocks sk-<key> pattern"
|
||||
REPO=$(setup_temp_repo)
|
||||
echo 'api_key = "sk-abcXYZ1234567890abcXYZ1234567890"' > "$REPO/config.py"
|
||||
git -C "$REPO" add config.py
|
||||
RESULT=$(cd "$REPO" && bash "$HOOKS_DIR/pre-commit" 2>&1; echo "EXIT:$?")
|
||||
if echo "$RESULT" | grep -q "EXIT:1"; then pass "blocked sk- key"; else fail "should have blocked sk- key"; fi
|
||||
rm -rf "$REPO"
|
||||
|
||||
# Test 3: blocks US phone number
|
||||
echo "Test 3: blocks US phone number"
|
||||
REPO=$(setup_temp_repo)
|
||||
echo 'phone: "5107643155"' > "$REPO/config.yaml"
|
||||
git -C "$REPO" add config.yaml
|
||||
RESULT=$(cd "$REPO" && bash "$HOOKS_DIR/pre-commit" 2>&1; echo "EXIT:$?")
|
||||
if echo "$RESULT" | grep -q "EXIT:1"; then pass "blocked phone number"; else fail "should have blocked phone number"; fi
|
||||
rm -rf "$REPO"
|
||||
|
||||
# Test 4: blocks personal email in source
|
||||
echo "Test 4: blocks personal gmail address in .py file"
|
||||
REPO=$(setup_temp_repo)
|
||||
echo 'DEFAULT_EMAIL = "someone@gmail.com"' > "$REPO/app.py"
|
||||
git -C "$REPO" add app.py
|
||||
RESULT=$(cd "$REPO" && bash "$HOOKS_DIR/pre-commit" 2>&1; echo "EXIT:$?")
|
||||
if echo "$RESULT" | grep -q "EXIT:1"; then pass "blocked personal email"; else fail "should have blocked personal email"; fi
|
||||
rm -rf "$REPO"
|
||||
|
||||
# Test 5: allows .example file with placeholders
|
||||
echo "Test 5: allows .example file with placeholder values"
|
||||
REPO=$(setup_temp_repo)
|
||||
echo 'FORGEJO_API_TOKEN=your-forgejo-api-token-here' > "$REPO/config.env.example"
|
||||
git -C "$REPO" add config.env.example
|
||||
RESULT=$(cd "$REPO" && bash "$HOOKS_DIR/pre-commit" 2>&1; echo "EXIT:$?")
|
||||
if echo "$RESULT" | grep -q "EXIT:0"; then pass "allowed .example placeholder"; else fail "should have allowed .example file"; fi
|
||||
rm -rf "$REPO"
|
||||
|
||||
# Test 6: allows ollama api_key placeholder
|
||||
echo "Test 6: allows api_key: ollama (known safe placeholder)"
|
||||
REPO=$(setup_temp_repo)
|
||||
printf 'backends:\n - api_key: ollama\n' > "$REPO/llm.yaml"
|
||||
git -C "$REPO" add llm.yaml
|
||||
RESULT=$(cd "$REPO" && bash "$HOOKS_DIR/pre-commit" 2>&1; echo "EXIT:$?")
|
||||
if echo "$RESULT" | grep -q "EXIT:0"; then pass "allowed ollama api_key"; else fail "should have allowed ollama api_key"; fi
|
||||
rm -rf "$REPO"
|
||||
|
||||
# Test 7: allows safe source file
|
||||
echo "Test 7: allows normal Python import"
|
||||
REPO=$(setup_temp_repo)
|
||||
echo 'import streamlit as st' > "$REPO/app.py"
|
||||
git -C "$REPO" add app.py
|
||||
RESULT=$(cd "$REPO" && bash "$HOOKS_DIR/pre-commit" 2>&1; echo "EXIT:$?")
|
||||
if echo "$RESULT" | grep -q "EXIT:0"; then pass "allowed safe file"; else fail "should have allowed safe file"; fi
|
||||
rm -rf "$REPO"
|
||||
|
||||
echo ""
|
||||
echo "=== commit-msg hook tests ==="
|
||||
|
||||
tmpfile=$(mktemp)
|
||||
|
||||
echo "Test 8: accepts feat: message"
|
||||
echo "feat: add gitleaks scanning" > "$tmpfile"
|
||||
if bash "$HOOKS_DIR/commit-msg" "$tmpfile" &>/dev/null; then pass "accepted feat:"; else fail "rejected valid feat:"; fi
|
||||
|
||||
echo "Test 9: accepts security: message (new type)"
|
||||
echo "security: rotate leaked API token" > "$tmpfile"
|
||||
if bash "$HOOKS_DIR/commit-msg" "$tmpfile" &>/dev/null; then pass "accepted security:"; else fail "rejected valid security:"; fi
|
||||
|
||||
echo "Test 10: accepts fix(scope): message"
|
||||
echo "fix(wizard): handle missing user.yaml" > "$tmpfile"
|
||||
if bash "$HOOKS_DIR/commit-msg" "$tmpfile" &>/dev/null; then pass "accepted fix(scope):"; else fail "rejected valid fix(scope):"; fi
|
||||
|
||||
echo "Test 11: rejects non-conventional message"
|
||||
echo "updated the thing" > "$tmpfile"
|
||||
if bash "$HOOKS_DIR/commit-msg" "$tmpfile" &>/dev/null; then fail "should have rejected"; else pass "rejected non-conventional"; fi
|
||||
|
||||
echo "Test 12: rejects empty message"
|
||||
echo "" > "$tmpfile"
|
||||
if bash "$HOOKS_DIR/commit-msg" "$tmpfile" &>/dev/null; then fail "should have rejected empty"; else pass "rejected empty message"; fi
|
||||
|
||||
rm -f "$tmpfile"
|
||||
|
||||
echo ""
|
||||
echo "=== Results ==="
|
||||
echo " Passed: $PASS_COUNT"
|
||||
echo " Failed: $FAIL_COUNT"
|
||||
[[ $FAIL_COUNT -eq 0 ]] && echo "All tests passed." || { echo "FAILURES detected."; exit 1; }
|
||||
```
|
||||
|
||||
**Step 2: Make executable**
|
||||
|
||||
```bash
|
||||
chmod +x tests/test_hooks.sh
|
||||
```
|
||||
|
||||
**Step 3: Run tests (expect failures — hooks not yet fully wired)**
|
||||
|
||||
```bash
|
||||
cd /Library/Development/CircuitForge/circuitforge-hooks
|
||||
bash tests/test_hooks.sh
|
||||
```
|
||||
|
||||
Expected: Tests 1-4 should PASS (gitleaks catches real secrets), Tests 5-7 may fail if allowlists need tuning — note any failures for the next step.
|
||||
|
||||
**Step 4: Tune allowlists in gitleaks.toml if any false positives**
|
||||
|
||||
If Test 5 (`.example` file) or Test 6 (ollama) fail, add the relevant pattern to the `[allowlist]` or `[rules.allowlist]` sections in `gitleaks.toml` and re-run until all 12 pass.
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add tests/
|
||||
git commit -m "test: add integration tests for pre-commit and commit-msg hooks"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Write README and push to Forgejo
|
||||
|
||||
**Files:**
|
||||
- Create: `README.md`
|
||||
|
||||
**Step 1: Write README.md**
|
||||
|
||||
```markdown
|
||||
# circuitforge-hooks
|
||||
|
||||
Centralised git hooks for all CircuitForge repos.
|
||||
|
||||
## What it does
|
||||
|
||||
- **pre-commit** — scans staged changes for secrets and PII via gitleaks
|
||||
- **commit-msg** — enforces conventional commit format
|
||||
- **pre-push** — scans full branch history as a safety net before push
|
||||
|
||||
## Install
|
||||
|
||||
From any CircuitForge product repo root:
|
||||
|
||||
```bash
|
||||
bash /Library/Development/CircuitForge/circuitforge-hooks/install.sh
|
||||
```
|
||||
|
||||
On Heimdall live deploys (`/devl/<repo>/`), add the same line to the deploy script.
|
||||
|
||||
## Per-repo allowlists
|
||||
|
||||
Create `.gitleaks.toml` at the repo root to extend the base config:
|
||||
|
||||
```toml
|
||||
[extend]
|
||||
path = "/Library/Development/CircuitForge/circuitforge-hooks/gitleaks.toml"
|
||||
|
||||
[allowlist]
|
||||
regexes = [
|
||||
'\d{10}\.html', # example: Craigslist listing IDs
|
||||
]
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
bash tests/test_hooks.sh
|
||||
```
|
||||
|
||||
## Requirements
|
||||
|
||||
- `gitleaks` binary: `sudo apt-get install gitleaks`
|
||||
- bash 4+
|
||||
|
||||
## Adding a new rule
|
||||
|
||||
Edit `gitleaks.toml`. Follow the pattern of the existing `[[rules]]` blocks.
|
||||
Add tests to `tests/test_hooks.sh` covering both the blocked and allowed cases.
|
||||
```
|
||||
|
||||
**Step 2: Create Forgejo repo and push**
|
||||
|
||||
```bash
|
||||
# Create repo on Forgejo
|
||||
curl -s -X POST "https://git.opensourcesolarpunk.com/api/v1/user/repos" \
|
||||
-H "Authorization: token 4ea4353b88d6388e8fafab9eb36662226f3a06b0" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"name": "circuitforge-hooks",
|
||||
"description": "Centralised git hooks for CircuitForge repos — gitleaks secret + PII scanning",
|
||||
"private": false,
|
||||
"auto_init": false
|
||||
}' | python3 -c "import json,sys; r=json.load(sys.stdin); print('Created:', r.get('html_url','ERROR:', r))"
|
||||
|
||||
# Add remote and push
|
||||
cd /Library/Development/CircuitForge/circuitforge-hooks
|
||||
git add README.md
|
||||
git commit -m "docs: add README with install and usage instructions"
|
||||
git remote add origin https://git.opensourcesolarpunk.com/pyr0ball/circuitforge-hooks.git
|
||||
git push -u origin main
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Activate in Peregrine
|
||||
|
||||
**Files:**
|
||||
- Create: `peregrine/.gitleaks.toml`
|
||||
- Modify: `peregrine/manage.sh` (add install.sh call)
|
||||
- Delete: `peregrine/.githooks/pre-commit` (replaced by gitleaks wrapper)
|
||||
|
||||
**Step 1: Write peregrine/.gitleaks.toml**
|
||||
|
||||
```toml
|
||||
# peregrine/.gitleaks.toml — per-repo allowlists extending the shared base config
|
||||
[extend]
|
||||
path = "/Library/Development/CircuitForge/circuitforge-hooks/gitleaks.toml"
|
||||
|
||||
[allowlist]
|
||||
description = "Peregrine-specific allowlists"
|
||||
regexes = [
|
||||
'\d{10}\.html', # Craigslist listing IDs (10-digit paths, look like phone numbers)
|
||||
'\d{10}\/', # LinkedIn job IDs in URLs
|
||||
'localhost:\d{4,5}', # port numbers that could trip phone pattern
|
||||
]
|
||||
```
|
||||
|
||||
**Step 2: Activate hooks in Peregrine**
|
||||
|
||||
```bash
|
||||
cd /Library/Development/CircuitForge/peregrine
|
||||
bash /Library/Development/CircuitForge/circuitforge-hooks/install.sh
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
CircuitForge hooks installed.
|
||||
core.hooksPath → /Library/Development/CircuitForge/circuitforge-hooks/hooks
|
||||
```
|
||||
|
||||
Verify:
|
||||
```bash
|
||||
git config core.hooksPath
|
||||
```
|
||||
Expected: prints the absolute path to `circuitforge-hooks/hooks`
|
||||
|
||||
**Step 3: Add install.sh call to manage.sh**
|
||||
|
||||
In `peregrine/manage.sh`, find the section that runs setup/preflight (near the top of the `start` command handling). Add after the existing setup checks:
|
||||
|
||||
```bash
|
||||
# Wire CircuitForge hooks (idempotent — safe to run every time)
|
||||
if [[ -f "/Library/Development/CircuitForge/circuitforge-hooks/install.sh" ]]; then
|
||||
bash /Library/Development/CircuitForge/circuitforge-hooks/install.sh --quiet 2>/dev/null || true
|
||||
fi
|
||||
```
|
||||
|
||||
Also add a `--quiet` flag to `install.sh` to suppress output when called from manage.sh:
|
||||
|
||||
In `circuitforge-hooks/install.sh`, modify to accept `--quiet`:
|
||||
```bash
|
||||
QUIET=false
|
||||
[[ "${1:-}" == "--quiet" ]] && QUIET=true
|
||||
|
||||
git config core.hooksPath "$HOOKS_DIR"
|
||||
if [[ "$QUIET" == "false" ]]; then
|
||||
echo "CircuitForge hooks installed."
|
||||
echo " core.hooksPath → $HOOKS_DIR"
|
||||
fi
|
||||
```
|
||||
|
||||
**Step 4: Retire old .githooks/pre-commit**
|
||||
|
||||
The old hook used hand-rolled regexes and is now superseded. Remove it:
|
||||
|
||||
```bash
|
||||
cd /Library/Development/CircuitForge/peregrine
|
||||
rm .githooks/pre-commit
|
||||
```
|
||||
|
||||
Keep `.githooks/commit-msg` until verified the new one is working (then remove in a follow-up).
|
||||
|
||||
**Step 5: Smoke-test — try to commit a fake secret**
|
||||
|
||||
```bash
|
||||
cd /Library/Development/CircuitForge/peregrine
|
||||
echo 'TEST_TOKEN=abc123def456ghi789jkl012mno345' >> /tmp/leak-test.txt
|
||||
git add /tmp/leak-test.txt 2>/dev/null || true
|
||||
# Easier: stage it directly
|
||||
echo 'BAD_TOKEN=abc123def456ghi789jkl012mno345pqr' > /tmp/test-secret.py
|
||||
cp /tmp/test-secret.py .
|
||||
git add test-secret.py
|
||||
git commit -m "test: this should be blocked" 2>&1
|
||||
```
|
||||
Expected: commit blocked with gitleaks output. Clean up:
|
||||
```bash
|
||||
git restore --staged test-secret.py && rm test-secret.py
|
||||
```
|
||||
|
||||
**Step 6: Commit Peregrine changes**
|
||||
|
||||
```bash
|
||||
cd /Library/Development/CircuitForge/peregrine
|
||||
git add .gitleaks.toml manage.sh
|
||||
git rm .githooks/pre-commit
|
||||
git commit -m "chore: activate circuitforge-hooks, add .gitleaks.toml, retire old pre-commit"
|
||||
```
|
||||
|
||||
**Step 7: Push Peregrine**
|
||||
|
||||
```bash
|
||||
git push origin main
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 8: Run full test suite and verify
|
||||
|
||||
**Step 1: Run the hooks test suite**
|
||||
|
||||
```bash
|
||||
bash /Library/Development/CircuitForge/circuitforge-hooks/tests/test_hooks.sh
|
||||
```
|
||||
Expected: `All tests passed. Passed: 12 Failed: 0`
|
||||
|
||||
**Step 2: Run Peregrine tests to confirm nothing broken**
|
||||
|
||||
```bash
|
||||
cd /Library/Development/CircuitForge/peregrine
|
||||
/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v --tb=short -q 2>&1 | tail -10
|
||||
```
|
||||
Expected: all existing tests still pass.
|
||||
|
||||
**Step 3: Push hooks repo final state**
|
||||
|
||||
```bash
|
||||
cd /Library/Development/CircuitForge/circuitforge-hooks
|
||||
git push origin main
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Public-release checklist (for all future repos)
|
||||
|
||||
Add this to any repo's pre-public checklist:
|
||||
|
||||
```
|
||||
[ ] Run: gitleaks git --config /Library/Development/CircuitForge/circuitforge-hooks/gitleaks.toml
|
||||
(manual full-history scan — pre-push hook only covers branch tip)
|
||||
[ ] Run: bash /Library/Development/CircuitForge/circuitforge-hooks/install.sh
|
||||
[ ] Add .gitleaks.toml with repo-specific allowlists
|
||||
[ ] Verify: git config core.hooksPath
|
||||
[ ] Make repo public on Forgejo
|
||||
```
|
||||
|
|
@ -1,106 +0,0 @@
|
|||
# Email Sync — Testing Checklist
|
||||
|
||||
Generated from audit of `scripts/imap_sync.py`.
|
||||
|
||||
## Bugs fixed (2026-02-23)
|
||||
|
||||
- [x] Gmail label with spaces not quoted for IMAP SELECT → `_quote_folder()` added
|
||||
- [x] `_quote_folder` didn't escape internal double-quotes → RFC 3501 escaping added
|
||||
- [x] `signal is None` in `_scan_unmatched_leads` allowed classifier failures through → now skips
|
||||
- [x] Email with no Message-ID re-inserted on every sync → `_parse_message` returns `None` when ID missing
|
||||
- [x] `todo_attached` missing from early-return dict in `sync_all` → added
|
||||
- [x] Body phrase check truncated at 800 chars (rejection footers missed) → bumped to 1500
|
||||
- [x] `_DONT_FORGET_VARIANTS` missing left single quotation mark `\u2018` → added
|
||||
|
||||
---
|
||||
|
||||
## Unit tests — phrase filter
|
||||
|
||||
- [x] `_has_rejection_or_ats_signal` — rejection phrase at char 1501 (boundary)
|
||||
- [x] `_has_rejection_or_ats_signal` — right single quote `\u2019` in "don't forget"
|
||||
- [x] `_has_rejection_or_ats_signal` — left single quote `\u2018` in "don't forget"
|
||||
- [x] `_has_rejection_or_ats_signal` — ATS subject phrase only checked against subject, not body
|
||||
- [x] `_has_rejection_or_ats_signal` — spam subject prefix `@` match
|
||||
- [x] `_has_rejection_or_ats_signal` — `"UNFORTUNATELY"` (uppercase → lowercased correctly)
|
||||
- [x] `_has_rejection_or_ats_signal` — phrase in body quoted thread (beyond 1500 chars) is not blocked
|
||||
|
||||
## Unit tests — folder quoting
|
||||
|
||||
- [x] `_quote_folder("TO DO JOBS")` → `'"TO DO JOBS"'`
|
||||
- [x] `_quote_folder("INBOX")` → `"INBOX"` (no spaces, no quotes added)
|
||||
- [x] `_quote_folder('My "Jobs"')` → `'"My \\"Jobs\\""'`
|
||||
- [x] `_search_folder` — folder doesn't exist → returns `[]`, no exception
|
||||
- [x] `_search_folder` — special folder `"[Gmail]/All Mail"` (brackets + slash)
|
||||
|
||||
## Unit tests — message-ID dedup
|
||||
|
||||
- [x] `_get_existing_message_ids` — NULL message_id in DB excluded from set
|
||||
- [x] `_get_existing_message_ids` — empty string `""` excluded from set
|
||||
- [x] `_get_existing_message_ids` — job with no contacts returns empty set
|
||||
- [x] `_parse_message` — email with no Message-ID header returns `None`
|
||||
- [x] `_parse_message` — email with RFC2047-encoded subject decodes correctly
|
||||
- [x] No email is inserted twice across two sync runs (integration)
|
||||
|
||||
## Unit tests — classifier & signal
|
||||
|
||||
- [x] `classify_stage_signal` — returns one of 5 labels or `None`
|
||||
- [x] `classify_stage_signal` — returns `None` on LLM error
|
||||
- [x] `classify_stage_signal` — returns `"neutral"` when no label matched in LLM output
|
||||
- [x] `classify_stage_signal` — strips `<think>…</think>` blocks
|
||||
- [x] `_scan_unmatched_leads` — skips when `signal is None`
|
||||
- [x] `_scan_unmatched_leads` — skips when `signal == "rejected"`
|
||||
- [x] `_scan_unmatched_leads` — proceeds when `signal == "neutral"`
|
||||
- [x] `extract_lead_info` — returns `(None, None)` on bad JSON
|
||||
- [x] `extract_lead_info` — returns `(None, None)` on LLM error
|
||||
|
||||
## Integration tests — TODO label scan
|
||||
|
||||
- [x] `_scan_todo_label` — `todo_label` empty string → returns 0
|
||||
- [x] `_scan_todo_label` — `todo_label` missing from config → returns 0
|
||||
- [x] `_scan_todo_label` — folder doesn't exist on IMAP server → returns 0, no crash
|
||||
- [x] `_scan_todo_label` — email matches company + action keyword → contact attached
|
||||
- [x] `_scan_todo_label` — email matches company but no action keyword → skipped
|
||||
- [x] `_scan_todo_label` — email matches no company term → skipped
|
||||
- [x] `_scan_todo_label` — duplicate message-ID → not re-inserted
|
||||
- [x] `_scan_todo_label` — stage_signal set when classifier returns non-neutral
|
||||
- [x] `_scan_todo_label` — body fallback (company only in body[:300]) → still matches
|
||||
- [x] `_scan_todo_label` — email handled by `sync_job_emails` first not re-added by label scan
|
||||
|
||||
## Integration tests — unmatched leads
|
||||
|
||||
- [x] `_scan_unmatched_leads` — genuine lead inserted with synthetic URL `email://domain/hash`
|
||||
- [x] `_scan_unmatched_leads` — same email not re-inserted on second sync run
|
||||
- [x] `_scan_unmatched_leads` — duplicate synthetic URL skipped
|
||||
- [x] `_scan_unmatched_leads` — `extract_lead_info` returns `(None, None)` → no insertion
|
||||
- [x] `_scan_unmatched_leads` — rejection phrase in body → blocked before LLM
|
||||
- [x] `_scan_unmatched_leads` — rejection phrase in quoted thread > 1500 chars → passes filter (acceptable)
|
||||
|
||||
## Integration tests — full sync
|
||||
|
||||
- [x] `sync_all` with no active jobs → returns dict with all 6 keys incl. `todo_attached: 0`
|
||||
- [x] `sync_all` return dict shape identical on all code paths
|
||||
- [x] `sync_all` with `job_ids` filter → only syncs those jobs
|
||||
- [x] `sync_all` `dry_run=True` → no DB writes
|
||||
- [x] `sync_all` `on_stage` callback fires: "connecting", "job N/M", "scanning todo label", "scanning leads"
|
||||
- [x] `sync_all` IMAP connection error → caught, returned in `errors` list
|
||||
- [x] `sync_all` per-job exception → other jobs still sync
|
||||
|
||||
## Config / UI
|
||||
|
||||
- [x] Settings UI field for `todo_label` (currently YAML-only)
|
||||
- [x] Warn in sync summary when `todo_label` folder not found on server
|
||||
- [x] Clear error message when `config/email.yaml` is missing
|
||||
- [x] `test_email_classify.py --verbose` shows correct blocking phrase for each BLOCK
|
||||
|
||||
## Backlog — Known issues
|
||||
|
||||
- [x] **The Ladders emails confuse the classifier** — promotional/job alert emails from `@theladders.com` are matching the recruitment keyword filter and being treated as leads. Fix: add a sender-based skip rule in `_scan_unmatched_leads` for known job board senders (similar to how LinkedIn Alert emails are short-circuited before the LLM classifier). Senders to exclude: `@theladders.com`, and audit for others (Glassdoor alerts, Indeed digest, ZipRecruiter, etc.).
|
||||
|
||||
---
|
||||
|
||||
## Performance & edge cases
|
||||
|
||||
- [x] Email with 10 000-char body → truncated to 4000 chars, no crash
|
||||
- [x] Email with binary attachment → `_parse_message` returns valid dict, no crash
|
||||
- [x] Email with multiple `text/plain` MIME parts → first part taken
|
||||
- [x] `get_all_message_ids` with 100 000 rows → completes in < 1s
|
||||
Loading…
Reference in a new issue