pyr0ball f11a38eb0b chore: seed Peregrine from personal job-seeker (pre-generalization)

App: Peregrine
Company: Circuit Forge LLC
Source: github.com/pyr0ball/job-seeker (personal fork, not linked)

2026-02-24 18:25:39 -08:00

13 KiB

Raw Blame History

Job Seeker Platform — Claude Context

Project

Automated job discovery + resume matching + application pipeline for Alex Rivera.

Full pipeline:

JobSpy → discover.py → SQLite (staging.db) → match.py → Job Review UI
→ Apply Workspace (cover letter + PDF) → Interviews kanban
→ phone_screen → interviewing → offer → hired
         ↓
      Notion DB (synced via sync.py)

Environment

Python env: conda run -n job-seeker <cmd> — always use this, never bare python
Run tests: /devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v (use direct binary — conda run pytest can spawn runaway processes)
Run discovery: conda run -n job-seeker python scripts/discover.py
Recreate env: conda env create -f environment.yml
pytest.ini scopes test collection to tests/ only — never widen this

⚠️ AIHawk env isolation — CRITICAL

NEVER pip install -r aihawk/requirements.txt into the job-seeker env
AIHawk pulls torch + CUDA (~7GB) which causes OOM during test runs
AIHawk must run in its own env: conda create -n aihawk-env python=3.12
job-seeker env must stay lightweight (no torch, no sentence-transformers, no CUDA)

Web UI (Streamlit)

Run: bash scripts/manage-ui.sh start → http://localhost:8501
Manage: start | stop | restart | status | logs
Direct binary: /devl/miniconda3/envs/job-seeker/bin/streamlit run app/app.py
Entry point: app/app.py (uses st.navigation() — do NOT run app/Home.py directly)
staging.db is gitignored — SQLite staging layer between discovery and Notion

Pages

Page	File	Purpose
Home	`app/Home.py`	Dashboard, discovery trigger, danger-zone purge
Job Review	`app/pages/1_Job_Review.py`	Batch approve/reject with sorting
Settings	`app/pages/2_Settings.py`	LLM backends, search profiles, Notion, services
Resume Profile	Settings → Resume Profile tab	Edit AIHawk YAML profile (was standalone `3_Resume_Editor.py`)
Apply Workspace	`app/pages/4_Apply.py`	Cover letter gen + PDF export + mark applied + reject listing
Interviews	`app/pages/5_Interviews.py`	Kanban: phone_screen→interviewing→offer→hired
Interview Prep	`app/pages/6_Interview_Prep.py`	Live reference sheet during calls + Practice Q&A
Survey Assistant	`app/pages/7_Survey.py`	Culture-fit survey help: text paste + screenshot (moondream2)

Job Status Pipeline

pending → approved/rejected          (Job Review)
approved → applied                   (Apply Workspace — mark applied)
approved → rejected                  (Apply Workspace — reject listing button)
applied → survey                     (Interviews — "📋 Survey" button; pre-kanban section)
applied → phone_screen               (Interviews — triggers company research)
survey → phone_screen                (Interviews — after survey completed)
phone_screen → interviewing
interviewing → offer
offer → hired
any stage → rejected (rejection_stage captured for analytics)
applied/approved → synced            (sync.py → Notion)

SQLite Schema (`staging.db`)

`jobs` table key columns

Standard: id, title, company, url, source, location, is_remote, salary, description
Scores: match_score, keyword_gaps
Dates: date_found, applied_at, survey_at, phone_screen_at, interviewing_at, offer_at, hired_at
Interview: interview_date, rejection_stage
Content: cover_letter, notion_page_id

Additional tables

job_contacts — email thread log per job (direction, subject, from/to, body, received_at)
company_research — LLM-generated brief per job (company_brief, ceo_brief, talking_points, raw_output, accessibility_brief)
background_tasks — async LLM task queue (task_type, job_id, status: queued/running/completed/failed)
survey_responses — per-job Q&A pairs (survey_name, received_at, source, raw_input, image_path, mode, llm_output, reported_score)

Scripts

Script	Purpose
`scripts/discover.py`	JobSpy + custom board scrape → SQLite insert
`scripts/custom_boards/adzuna.py`	Adzuna Jobs API (app_id + app_key in config/adzuna.yaml)
`scripts/custom_boards/theladders.py`	The Ladders scraper via curl_cffi + NEXT_DATA SSR parse
`scripts/match.py`	Resume keyword matching → match_score
`scripts/sync.py`	Push approved/applied jobs to Notion
`scripts/llm_router.py`	LLM fallback chain (reads config/llm.yaml)
`scripts/generate_cover_letter.py`	Cover letter via LLM; detects mission-aligned companies (music/animal welfare/education) and injects Para 3 hint
`scripts/company_research.py`	Pre-interview brief via LLM + optional SearXNG scrape; includes Inclusion & Accessibility section
`scripts/prepare_training_data.py`	Extract cover letter JSONL for fine-tuning
`scripts/finetune_local.py`	Unsloth QLoRA fine-tune on local GPU
`scripts/db.py`	All SQLite helpers (single source of truth)
`scripts/task_runner.py`	Background thread executor — `submit_task(db, type, job_id)` dispatches daemon threads for LLM jobs
`scripts/vision_service/main.py`	FastAPI moondream2 inference on port 8002; `manage-vision.sh` lifecycle

LLM Router

Config: config/llm.yaml
Cover letter fallback order: claude_code → ollama (alex-cover-writer:latest) → vllm → copilot → anthropic
Research fallback order: claude_code → vllm (__auto__, ouroboros) → ollama_research (llama3.1:8b) → ...
alex-cover-writer:latest is cover-letter only — it doesn't follow structured markdown prompts for research
LLMRouter.complete() accepts fallback_order= override for per-task routing
LLMRouter.complete() accepts images: list[str] (base64) — vision backends only; non-vision backends skipped when images present
Vision fallback order config key: vision_fallback_order: [vision_service, claude_code, anthropic]
vision_service backend type: POST to /analyze; skipped automatically when no images provided
Claude Code wrapper: /Library/Documents/Post Fight Processing/server-openai-wrapper-v2.js
Copilot wrapper: /Library/Documents/Post Fight Processing/manage-copilot.sh start

Fine-Tuned Model

Model: alex-cover-writer:latest registered in Ollama
Base: unsloth/Llama-3.2-3B-Instruct (QLoRA, rank 16, 10 epochs)
Training data: 62 cover letters from /Library/Documents/JobSearch/
JSONL: /Library/Documents/JobSearch/training_data/cover_letters.jsonl
Adapter: /Library/Documents/JobSearch/training_data/finetune_output/adapter/
Merged: /Library/Documents/JobSearch/training_data/gguf/alex-cover-writer/
Re-train: conda run -n ogma python scripts/finetune_local.py (uses ogma env with unsloth + trl; pin to GPU 0 with CUDA_VISIBLE_DEVICES=0)

Background Tasks

Cover letter gen and company research run as daemon threads via scripts/task_runner.py
Tasks survive page navigation; results written to existing tables when done
On server restart, app.py startup clears any stuck running/queued rows to failed
Dedup: only one queued/running task per (task_type, job_id) at a time
Sidebar indicator (app/app.py) polls every 3s via @st.fragment(run_every=3)
⚠️ Streamlit fragment + sidebar: use with st.sidebar: _fragment() — sidebar context must WRAP the call, not be inside the fragment body

Vision Service

Script: scripts/vision_service/main.py (FastAPI, port 8002)
Model: vikhyatk/moondream2 revision 2025-01-09 — lazy-loaded on first /analyze (~1.8GB download)
GPU: 4-bit quantization when CUDA available (~1.5GB VRAM); CPU fallback
Conda env: job-seeker-vision — separate from job-seeker (torch + transformers live here)
Create env: conda env create -f scripts/vision_service/environment.yml
Manage: bash scripts/manage-vision.sh start|stop|restart|status|logs
Survey page degrades gracefully to text-only when vision service is down
⚠️ Never install vision deps (torch, bitsandbytes, transformers) into the job-seeker env

Company Research

Script: scripts/company_research.py
Auto-triggered when a job moves to phone_screen in the Interviews kanban
Three-phase: (1) SearXNG company scrape → (1b) SearXNG news snippets → (2) LLM synthesis
SearXNG scraper: /Library/Development/scrapers/companyScraper.py
SearXNG Docker: run docker compose up -d from /Library/Development/scrapers/SearXNG/ (port 8888)
beautifulsoup4 and fake-useragent are installed in job-seeker env (required for scraper)
News search hits /search?format=json — JSON format must be enabled in searxng-config/settings.yml
⚠️ settings.yml owned by UID 977 (container user) — use docker cp to update, not direct writes
⚠️ settings.yml requires use_default_settings: true at the top or SearXNG fails schema validation
companyScraper calls sys.exit() on missing deps — use except BaseException not except Exception

Email Classifier Labels

Six labels: interview_request, rejection, offer, follow_up, survey_received, other

survey_received — links or requests to complete a culture-fit survey/assessment

Services (managed via Settings → Services tab)

Service	Port	Notes
Streamlit UI	8501	`bash scripts/manage-ui.sh start`
Ollama	11434	`sudo systemctl start ollama`
Claude Code Wrapper	3009	`manage-services.sh start` in Post Fight Processing
GitHub Copilot Wrapper	3010	`manage-copilot.sh start` in Post Fight Processing
vLLM Server	8000	Manual start only
SearXNG	8888	`docker compose up -d` in scrapers/SearXNG/
Vision Service	8002	`bash scripts/manage-vision.sh start` — moondream2 survey screenshot analysis

Notion

DB: "Tracking Job Applications" (ID: 1bd75cff-7708-8007-8c00-f1de36620a0a)
config/notion.yaml is gitignored (live token); .example is committed
Field names are non-obvious — always read from field_map in config/notion.yaml
"Salary" = Notion title property (unusual — it's the page title field)
"Job Source" = multi_select type
"Role Link" = URL field
"Status of Application" = status field; new listings use "Application Submitted"
Sync pushes approved + applied jobs; marks them synced after

Key Config Files

config/notion.yaml — gitignored, has token + field_map
config/notion.yaml.example — committed template
config/search_profiles.yaml — titles, locations, boards, custom_boards, exclude_keywords, mission_tags (per profile)
config/llm.yaml — LLM backend priority chain + enabled flags
config/tokens.yaml — gitignored, stores HF token (chmod 600)
config/adzuna.yaml — gitignored, Adzuna API app_id + app_key
config/adzuna.yaml.example — committed template

Custom Job Board Scrapers

scripts/custom_boards/adzuna.py — Adzuna Jobs API; credentials in config/adzuna.yaml
scripts/custom_boards/theladders.py — The Ladders SSR scraper; needs curl_cffi installed
Scrapers registered in CUSTOM_SCRAPERS dict in discover.py
Activated per-profile via custom_boards: [adzuna, theladders] in search_profiles.yaml
enrich_all_descriptions() in enrich_descriptions.py covers all sources (not just Glassdoor)
Home page "Fill Missing Descriptions" button dispatches enrich_descriptions task

Mission Alignment & Accessibility

Preferred industries: music, animal welfare, children's education (hardcoded in generate_cover_letter.py)
detect_mission_alignment(company, description) injects a Para 3 hint into cover letters for aligned companies
Company research includes an "Inclusion & Accessibility" section (8th section of the brief) in every brief
Accessibility search query in _SEARCH_QUERIES hits SearXNG for ADA/ERG/disability signals
accessibility_brief column in company_research table; shown in Interview Prep under ♿ section
This info is for personal decision-making ONLY — never disclosed in applications
In generalization: these become profile.mission_industries + profile.accessibility_priority in user.yaml

Document Rule

Resumes and cover letters live in /Library/Documents/JobSearch/ or Notion — never committed to this repo.

AIHawk (LinkedIn Easy Apply)

Cloned to aihawk/ (gitignored)
Config: aihawk/data_folder/plain_text_resume.yaml — search FILL_IN for gaps
Self-ID: non-binary, pronouns any, no disability/drug-test disclosure
Run: conda run -n job-seeker python aihawk/main.py
Playwright: conda run -n job-seeker python -m playwright install chromium

Git Remote

Forgejo self-hosted at https://git.opensourcesolarpunk.com (username: pyr0ball)
git remote add origin https://git.opensourcesolarpunk.com/pyr0ball/job-seeker.git

Subagents

Use general-purpose subagent type (not Bash) when tasks require file writes.

13 KiB Raw Blame History