Compare commits
4 commits
48e7748b43
...
d9f2b452e8
| Author | SHA1 | Date | |
|---|---|---|---|
| d9f2b452e8 | |||
| fedb558b1e | |||
| 15c2a1d4ef | |||
| d5cf02096b |
15 changed files with 908 additions and 745 deletions
5
.gitignore
vendored
5
.gitignore
vendored
|
|
@ -19,6 +19,7 @@ unsloth_compiled_cache/
|
|||
data/survey_screenshots/*
|
||||
!data/survey_screenshots/.gitkeep
|
||||
config/user.yaml
|
||||
config/plain_text_resume.yaml
|
||||
config/.backup-*
|
||||
config/integrations/*.yaml
|
||||
!config/integrations/*.yaml.example
|
||||
|
|
@ -30,3 +31,7 @@ scrapers/raw_scrapes/
|
|||
|
||||
compose.override.yml
|
||||
config/license.json
|
||||
config/user.yaml.working
|
||||
|
||||
# Claude context files — kept out of version control
|
||||
CLAUDE.md
|
||||
|
|
|
|||
212
CLAUDE.md
212
CLAUDE.md
|
|
@ -1,212 +0,0 @@
|
|||
# Job Seeker Platform — Claude Context
|
||||
|
||||
## Project
|
||||
Automated job discovery + resume matching + application pipeline for Meghan McCann.
|
||||
|
||||
Full pipeline:
|
||||
```
|
||||
JobSpy → discover.py → SQLite (staging.db) → match.py → Job Review UI
|
||||
→ Apply Workspace (cover letter + PDF) → Interviews kanban
|
||||
→ phone_screen → interviewing → offer → hired
|
||||
↓
|
||||
Notion DB (synced via sync.py)
|
||||
```
|
||||
|
||||
## Environment
|
||||
- Python env: `conda run -n job-seeker <cmd>` — always use this, never bare python
|
||||
- Run tests: `/devl/miniconda3/envs/job-seeker/bin/pytest tests/ -v`
|
||||
(use direct binary — `conda run pytest` can spawn runaway processes)
|
||||
- Run discovery: `conda run -n job-seeker python scripts/discover.py`
|
||||
- Recreate env: `conda env create -f environment.yml`
|
||||
- pytest.ini scopes test collection to `tests/` only — never widen this
|
||||
|
||||
## ⚠️ AIHawk env isolation — CRITICAL
|
||||
- NEVER `pip install -r aihawk/requirements.txt` into the job-seeker env
|
||||
- AIHawk pulls torch + CUDA (~7GB) which causes OOM during test runs
|
||||
- AIHawk must run in its own env: `conda create -n aihawk-env python=3.12`
|
||||
- job-seeker env must stay lightweight (no torch, no sentence-transformers, no CUDA)
|
||||
|
||||
## Web UI (Streamlit)
|
||||
- Run: `bash scripts/manage-ui.sh start` → http://localhost:8501
|
||||
- Manage: `start | stop | restart | status | logs`
|
||||
- Direct binary: `/devl/miniconda3/envs/job-seeker/bin/streamlit run app/app.py`
|
||||
- Entry point: `app/app.py` (uses `st.navigation()` — do NOT run `app/Home.py` directly)
|
||||
- `staging.db` is gitignored — SQLite staging layer between discovery and Notion
|
||||
|
||||
### Pages
|
||||
| Page | File | Purpose |
|
||||
|------|------|---------|
|
||||
| Home | `app/Home.py` | Dashboard, discovery trigger, danger-zone purge |
|
||||
| Job Review | `app/pages/1_Job_Review.py` | Batch approve/reject with sorting |
|
||||
| Settings | `app/pages/2_Settings.py` | LLM backends, search profiles, Notion, services |
|
||||
| Resume Profile | Settings → Resume Profile tab | Edit AIHawk YAML profile (was standalone `3_Resume_Editor.py`) |
|
||||
| Apply Workspace | `app/pages/4_Apply.py` | Cover letter gen + PDF export + mark applied + reject listing |
|
||||
| Interviews | `app/pages/5_Interviews.py` | Kanban: phone_screen→interviewing→offer→hired |
|
||||
| Interview Prep | `app/pages/6_Interview_Prep.py` | Live reference sheet during calls + Practice Q&A |
|
||||
| Survey Assistant | `app/pages/7_Survey.py` | Culture-fit survey help: text paste + screenshot (moondream2) |
|
||||
|
||||
## Job Status Pipeline
|
||||
```
|
||||
pending → approved/rejected (Job Review)
|
||||
approved → applied (Apply Workspace — mark applied)
|
||||
approved → rejected (Apply Workspace — reject listing button)
|
||||
applied → survey (Interviews — "📋 Survey" button; pre-kanban section)
|
||||
applied → phone_screen (Interviews — triggers company research)
|
||||
survey → phone_screen (Interviews — after survey completed)
|
||||
phone_screen → interviewing
|
||||
interviewing → offer
|
||||
offer → hired
|
||||
any stage → rejected (rejection_stage captured for analytics)
|
||||
applied/approved → synced (sync.py → Notion)
|
||||
```
|
||||
|
||||
## SQLite Schema (`staging.db`)
|
||||
### `jobs` table key columns
|
||||
- Standard: `id, title, company, url, source, location, is_remote, salary, description`
|
||||
- Scores: `match_score, keyword_gaps`
|
||||
- Dates: `date_found, applied_at, survey_at, phone_screen_at, interviewing_at, offer_at, hired_at`
|
||||
- Interview: `interview_date, rejection_stage`
|
||||
- Content: `cover_letter, notion_page_id`
|
||||
|
||||
### Additional tables
|
||||
- `job_contacts` — email thread log per job (direction, subject, from/to, body, received_at)
|
||||
- `company_research` — LLM-generated brief per job (company_brief, ceo_brief, talking_points, raw_output, accessibility_brief)
|
||||
- `background_tasks` — async LLM task queue (task_type, job_id, status: queued/running/completed/failed)
|
||||
- `survey_responses` — per-job Q&A pairs (survey_name, received_at, source, raw_input, image_path, mode, llm_output, reported_score)
|
||||
|
||||
## Scripts
|
||||
| Script | Purpose |
|
||||
|--------|---------|
|
||||
| `scripts/discover.py` | JobSpy + custom board scrape → SQLite insert |
|
||||
| `scripts/custom_boards/adzuna.py` | Adzuna Jobs API (app_id + app_key in config/adzuna.yaml) |
|
||||
| `scripts/custom_boards/theladders.py` | The Ladders scraper via curl_cffi + __NEXT_DATA__ SSR parse |
|
||||
| `scripts/match.py` | Resume keyword matching → match_score |
|
||||
| `scripts/sync.py` | Push approved/applied jobs to Notion |
|
||||
| `scripts/llm_router.py` | LLM fallback chain (reads config/llm.yaml) |
|
||||
| `scripts/generate_cover_letter.py` | Cover letter via LLM; detects mission-aligned companies (music/animal welfare/education) and injects Para 3 hint |
|
||||
| `scripts/company_research.py` | Pre-interview brief via LLM + optional SearXNG scrape; includes Inclusion & Accessibility section |
|
||||
| `scripts/prepare_training_data.py` | Extract cover letter JSONL for fine-tuning |
|
||||
| `scripts/finetune_local.py` | Unsloth QLoRA fine-tune on local GPU |
|
||||
| `scripts/db.py` | All SQLite helpers (single source of truth) |
|
||||
| `scripts/task_runner.py` | Background thread executor — `submit_task(db, type, job_id)` dispatches daemon threads for LLM jobs |
|
||||
| `scripts/vision_service/main.py` | FastAPI moondream2 inference on port 8002; `manage-vision.sh` lifecycle |
|
||||
|
||||
## LLM Router
|
||||
- Config: `config/llm.yaml`
|
||||
- Cover letter fallback order: `claude_code → ollama (meghan-cover-writer:latest) → vllm → copilot → anthropic`
|
||||
- Research fallback order: `claude_code → vllm (__auto__, ouroboros) → ollama_research (llama3.1:8b) → ...`
|
||||
- `meghan-cover-writer:latest` is cover-letter only — it doesn't follow structured markdown prompts for research
|
||||
- `LLMRouter.complete()` accepts `fallback_order=` override for per-task routing
|
||||
- `LLMRouter.complete()` accepts `images: list[str]` (base64) — vision backends only; non-vision backends skipped when images present
|
||||
- Vision fallback order config key: `vision_fallback_order: [vision_service, claude_code, anthropic]`
|
||||
- `vision_service` backend type: POST to `/analyze`; skipped automatically when no images provided
|
||||
- Claude Code wrapper: `/Library/Documents/Post Fight Processing/server-openai-wrapper-v2.js`
|
||||
- Copilot wrapper: `/Library/Documents/Post Fight Processing/manage-copilot.sh start`
|
||||
|
||||
## Fine-Tuned Model
|
||||
- Model: `meghan-cover-writer:latest` registered in Ollama
|
||||
- Base: `unsloth/Llama-3.2-3B-Instruct` (QLoRA, rank 16, 10 epochs)
|
||||
- Training data: 62 cover letters from `/Library/Documents/JobSearch/`
|
||||
- JSONL: `/Library/Documents/JobSearch/training_data/cover_letters.jsonl`
|
||||
- Adapter: `/Library/Documents/JobSearch/training_data/finetune_output/adapter/`
|
||||
- Merged: `/Library/Documents/JobSearch/training_data/gguf/meghan-cover-writer/`
|
||||
- Re-train: `conda run -n ogma python scripts/finetune_local.py`
|
||||
(uses `ogma` env with unsloth + trl; pin to GPU 0 with `CUDA_VISIBLE_DEVICES=0`)
|
||||
|
||||
## Background Tasks
|
||||
- Cover letter gen and company research run as daemon threads via `scripts/task_runner.py`
|
||||
- Tasks survive page navigation; results written to existing tables when done
|
||||
- On server restart, `app.py` startup clears any stuck `running`/`queued` rows to `failed`
|
||||
- Dedup: only one queued/running task per `(task_type, job_id)` at a time
|
||||
- Sidebar indicator (`app/app.py`) polls every 3s via `@st.fragment(run_every=3)`
|
||||
- ⚠️ Streamlit fragment + sidebar: use `with st.sidebar: _fragment()` — sidebar context must WRAP the call, not be inside the fragment body
|
||||
|
||||
## Vision Service
|
||||
- Script: `scripts/vision_service/main.py` (FastAPI, port 8002)
|
||||
- Model: `vikhyatk/moondream2` revision `2025-01-09` — lazy-loaded on first `/analyze` (~1.8GB download)
|
||||
- GPU: 4-bit quantization when CUDA available (~1.5GB VRAM); CPU fallback
|
||||
- Conda env: `job-seeker-vision` — separate from job-seeker (torch + transformers live here)
|
||||
- Create env: `conda env create -f scripts/vision_service/environment.yml`
|
||||
- Manage: `bash scripts/manage-vision.sh start|stop|restart|status|logs`
|
||||
- Survey page degrades gracefully to text-only when vision service is down
|
||||
- ⚠️ Never install vision deps (torch, bitsandbytes, transformers) into the job-seeker env
|
||||
|
||||
## Company Research
|
||||
- Script: `scripts/company_research.py`
|
||||
- Auto-triggered when a job moves to `phone_screen` in the Interviews kanban
|
||||
- Three-phase: (1) SearXNG company scrape → (1b) SearXNG news snippets → (2) LLM synthesis
|
||||
- SearXNG scraper: `/Library/Development/scrapers/companyScraper.py`
|
||||
- SearXNG Docker: run `docker compose up -d` from `/Library/Development/scrapers/SearXNG/` (port 8888)
|
||||
- `beautifulsoup4` and `fake-useragent` are installed in job-seeker env (required for scraper)
|
||||
- News search hits `/search?format=json` — JSON format must be enabled in `searxng-config/settings.yml`
|
||||
- ⚠️ `settings.yml` owned by UID 977 (container user) — use `docker cp` to update, not direct writes
|
||||
- ⚠️ `settings.yml` requires `use_default_settings: true` at the top or SearXNG fails schema validation
|
||||
- `companyScraper` calls `sys.exit()` on missing deps — use `except BaseException` not `except Exception`
|
||||
|
||||
## Email Classifier Labels
|
||||
Six labels: `interview_request`, `rejection`, `offer`, `follow_up`, `survey_received`, `other`
|
||||
- `survey_received` — links or requests to complete a culture-fit survey/assessment
|
||||
|
||||
## Services (managed via Settings → Services tab)
|
||||
| Service | Port | Notes |
|
||||
|---------|------|-------|
|
||||
| Streamlit UI | 8501 | `bash scripts/manage-ui.sh start` |
|
||||
| Ollama | 11434 | `sudo systemctl start ollama` |
|
||||
| Claude Code Wrapper | 3009 | `manage-services.sh start` in Post Fight Processing |
|
||||
| GitHub Copilot Wrapper | 3010 | `manage-copilot.sh start` in Post Fight Processing |
|
||||
| vLLM Server | 8000 | Manual start only |
|
||||
| SearXNG | 8888 | `docker compose up -d` in scrapers/SearXNG/ |
|
||||
| Vision Service | 8002 | `bash scripts/manage-vision.sh start` — moondream2 survey screenshot analysis |
|
||||
|
||||
## Notion
|
||||
- DB: "Tracking Job Applications" (ID: `1bd75cff-7708-8007-8c00-f1de36620a0a`)
|
||||
- `config/notion.yaml` is gitignored (live token); `.example` is committed
|
||||
- Field names are non-obvious — always read from `field_map` in `config/notion.yaml`
|
||||
- "Salary" = Notion title property (unusual — it's the page title field)
|
||||
- "Job Source" = `multi_select` type
|
||||
- "Role Link" = URL field
|
||||
- "Status of Application" = status field; new listings use "Application Submitted"
|
||||
- Sync pushes `approved` + `applied` jobs; marks them `synced` after
|
||||
|
||||
## Key Config Files
|
||||
- `config/notion.yaml` — gitignored, has token + field_map
|
||||
- `config/notion.yaml.example` — committed template
|
||||
- `config/search_profiles.yaml` — titles, locations, boards, custom_boards, exclude_keywords, mission_tags (per profile)
|
||||
- `config/llm.yaml` — LLM backend priority chain + enabled flags
|
||||
- `config/tokens.yaml` — gitignored, stores HF token (chmod 600)
|
||||
- `config/adzuna.yaml` — gitignored, Adzuna API app_id + app_key
|
||||
- `config/adzuna.yaml.example` — committed template
|
||||
|
||||
## Custom Job Board Scrapers
|
||||
- `scripts/custom_boards/adzuna.py` — Adzuna Jobs API; credentials in `config/adzuna.yaml`
|
||||
- `scripts/custom_boards/theladders.py` — The Ladders SSR scraper; needs `curl_cffi` installed
|
||||
- Scrapers registered in `CUSTOM_SCRAPERS` dict in `discover.py`
|
||||
- Activated per-profile via `custom_boards: [adzuna, theladders]` in `search_profiles.yaml`
|
||||
- `enrich_all_descriptions()` in `enrich_descriptions.py` covers all sources (not just Glassdoor)
|
||||
- Home page "Fill Missing Descriptions" button dispatches `enrich_descriptions` task
|
||||
|
||||
## Mission Alignment & Accessibility
|
||||
- Preferred industries: music, animal welfare, children's education (hardcoded in `generate_cover_letter.py`)
|
||||
- `detect_mission_alignment(company, description)` injects a Para 3 hint into cover letters for aligned companies
|
||||
- Company research includes an "Inclusion & Accessibility" section (8th section of the brief) in every brief
|
||||
- Accessibility search query in `_SEARCH_QUERIES` hits SearXNG for ADA/ERG/disability signals
|
||||
- `accessibility_brief` column in `company_research` table; shown in Interview Prep under ♿ section
|
||||
- This info is for personal decision-making ONLY — never disclosed in applications
|
||||
- In generalization: these become `profile.mission_industries` + `profile.accessibility_priority` in `user.yaml`
|
||||
|
||||
## Document Rule
|
||||
Resumes and cover letters live in `/Library/Documents/JobSearch/` or Notion — never committed to this repo.
|
||||
|
||||
## AIHawk (LinkedIn Easy Apply)
|
||||
- Cloned to `aihawk/` (gitignored)
|
||||
- Config: `aihawk/data_folder/plain_text_resume.yaml` — search FILL_IN for gaps
|
||||
- Self-ID: non-binary, pronouns any, no disability/drug-test disclosure
|
||||
- Run: `conda run -n job-seeker python aihawk/main.py`
|
||||
- Playwright: `conda run -n job-seeker python -m playwright install chromium`
|
||||
|
||||
## Git Remote
|
||||
- Forgejo self-hosted at https://git.opensourcesolarpunk.com (username: pyr0ball)
|
||||
- `git remote add origin https://git.opensourcesolarpunk.com/pyr0ball/job-seeker.git`
|
||||
|
||||
## Subagents
|
||||
Use `general-purpose` subagent type (not `Bash`) when tasks require file writes.
|
||||
|
|
@ -405,7 +405,7 @@ elif step == 4:
|
|||
if errs:
|
||||
st.error("\n".join(errs))
|
||||
else:
|
||||
resume_yaml_path = _ROOT / "aihawk" / "data_folder" / "plain_text_resume.yaml"
|
||||
resume_yaml_path = _ROOT / "config" / "plain_text_resume.yaml"
|
||||
resume_yaml_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
resume_data = {**parsed, "experience": experience} if parsed else {"experience": experience}
|
||||
resume_yaml_path.write_text(
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load diff
|
|
@ -28,7 +28,7 @@ from scripts.db import (
|
|||
from scripts.task_runner import submit_task
|
||||
|
||||
DOCS_DIR = _profile.docs_dir if _profile else Path.home() / "Documents" / "JobSearch"
|
||||
RESUME_YAML = Path(__file__).parent.parent.parent / "aihawk" / "data_folder" / "plain_text_resume.yaml"
|
||||
RESUME_YAML = Path(__file__).parent.parent.parent / "config" / "plain_text_resume.yaml"
|
||||
|
||||
st.title("🚀 Apply Workspace")
|
||||
|
||||
|
|
|
|||
|
|
@ -1,4 +1,15 @@
|
|||
profiles:
|
||||
- boards:
|
||||
- linkedin
|
||||
- indeed
|
||||
- glassdoor
|
||||
- zip_recruiter
|
||||
job_titles:
|
||||
- Customer Service Specialist
|
||||
locations:
|
||||
- San Francisco CA
|
||||
name: default
|
||||
remote_only: false
|
||||
- boards:
|
||||
- linkedin
|
||||
- indeed
|
||||
|
|
|
|||
193
config/skills_suggestions.yaml
Normal file
193
config/skills_suggestions.yaml
Normal file
|
|
@ -0,0 +1,193 @@
|
|||
# skills_suggestions.yaml — Bundled tag suggestions for the Skills & Keywords UI.
|
||||
# Shown as searchable options in the multiselect. Users can add custom tags beyond these.
|
||||
# Future: community aggregate (paid tier) will supplement this list from anonymised installs.
|
||||
|
||||
skills:
|
||||
# ── Customer Success & Account Management ──
|
||||
- Customer Success
|
||||
- Technical Account Management
|
||||
- Account Management
|
||||
- Customer Onboarding
|
||||
- Renewal Management
|
||||
- Churn Prevention
|
||||
- Expansion Revenue
|
||||
- Executive Relationship Management
|
||||
- Escalation Management
|
||||
- QBR Facilitation
|
||||
- Customer Advocacy
|
||||
- Voice of the Customer
|
||||
- Customer Health Scoring
|
||||
- Success Planning
|
||||
- Customer Education
|
||||
- Implementation Management
|
||||
# ── Revenue & Operations ──
|
||||
- Revenue Operations
|
||||
- Sales Operations
|
||||
- Pipeline Management
|
||||
- Forecasting
|
||||
- Contract Negotiation
|
||||
- Upsell & Cross-sell
|
||||
- ARR / MRR Management
|
||||
- NRR Optimization
|
||||
- Quota Attainment
|
||||
# ── Leadership & Management ──
|
||||
- Team Leadership
|
||||
- People Management
|
||||
- Cross-functional Collaboration
|
||||
- Change Management
|
||||
- Stakeholder Management
|
||||
- Executive Presentation
|
||||
- Strategic Planning
|
||||
- OKR Setting
|
||||
- Hiring & Recruiting
|
||||
- Coaching & Mentoring
|
||||
- Performance Management
|
||||
# ── Project & Program Management ──
|
||||
- Project Management
|
||||
- Program Management
|
||||
- Agile / Scrum
|
||||
- Kanban
|
||||
- Risk Management
|
||||
- Resource Planning
|
||||
- Process Improvement
|
||||
- SOP Development
|
||||
# ── Technical Skills ──
|
||||
- SQL
|
||||
- Python
|
||||
- Data Analysis
|
||||
- Tableau
|
||||
- Looker
|
||||
- Power BI
|
||||
- Excel / Google Sheets
|
||||
- REST APIs
|
||||
- Salesforce
|
||||
- HubSpot
|
||||
- Gainsight
|
||||
- Totango
|
||||
- ChurnZero
|
||||
- Zendesk
|
||||
- Intercom
|
||||
- Jira
|
||||
- Confluence
|
||||
- Notion
|
||||
- Slack
|
||||
- Zoom
|
||||
# ── Communications & Writing ──
|
||||
- Executive Communication
|
||||
- Technical Writing
|
||||
- Proposal Writing
|
||||
- Presentation Skills
|
||||
- Public Speaking
|
||||
- Stakeholder Communication
|
||||
# ── Compliance & Security ──
|
||||
- Compliance
|
||||
- Risk Assessment
|
||||
- SOC 2
|
||||
- ISO 27001
|
||||
- GDPR
|
||||
- Security Awareness
|
||||
- Vendor Management
|
||||
|
||||
domains:
|
||||
# ── Software & Tech ──
|
||||
- B2B SaaS
|
||||
- Enterprise Software
|
||||
- Cloud Infrastructure
|
||||
- Developer Tools
|
||||
- Cybersecurity
|
||||
- Data & Analytics
|
||||
- AI / ML Platform
|
||||
- FinTech
|
||||
- InsurTech
|
||||
- LegalTech
|
||||
- HR Tech
|
||||
- MarTech
|
||||
- AdTech
|
||||
- DevOps / Platform Engineering
|
||||
- Open Source
|
||||
# ── Industry Verticals ──
|
||||
- Healthcare / HealthTech
|
||||
- Education / EdTech
|
||||
- Non-profit / Social Impact
|
||||
- Government / GovTech
|
||||
- E-commerce / Retail
|
||||
- Manufacturing
|
||||
- Financial Services
|
||||
- Media & Entertainment
|
||||
- Music Industry
|
||||
- Logistics & Supply Chain
|
||||
- Real Estate / PropTech
|
||||
- Energy / CleanTech
|
||||
- Hospitality & Travel
|
||||
# ── Market Segments ──
|
||||
- Enterprise
|
||||
- Mid-Market
|
||||
- SMB / SME
|
||||
- Startup
|
||||
- Fortune 500
|
||||
- Public Sector
|
||||
- International / Global
|
||||
# ── Business Models ──
|
||||
- Subscription / SaaS
|
||||
- Marketplace
|
||||
- Usage-based Pricing
|
||||
- Professional Services
|
||||
- Self-serve / PLG
|
||||
|
||||
keywords:
|
||||
# ── CS Metrics & Outcomes ──
|
||||
- NPS
|
||||
- CSAT
|
||||
- CES
|
||||
- Churn Rate
|
||||
- Net Revenue Retention
|
||||
- Gross Revenue Retention
|
||||
- Logo Retention
|
||||
- Time-to-Value
|
||||
- Product Adoption
|
||||
- Feature Utilisation
|
||||
- Health Score
|
||||
- Customer Lifetime Value
|
||||
# ── Sales & Growth ──
|
||||
- ARR
|
||||
- MRR
|
||||
- GRR
|
||||
- NRR
|
||||
- Expansion ARR
|
||||
- Pipeline Coverage
|
||||
- Win Rate
|
||||
- Average Contract Value
|
||||
- Land & Expand
|
||||
- Multi-threading
|
||||
# ── Process & Delivery ──
|
||||
- Onboarding
|
||||
- Implementation
|
||||
- Knowledge Transfer
|
||||
- Escalation
|
||||
- SLA
|
||||
- Root Cause Analysis
|
||||
- Post-mortem
|
||||
- Runbook
|
||||
- Playbook Development
|
||||
- Feedback Loop
|
||||
- Product Roadmap Input
|
||||
# ── Team & Culture ──
|
||||
- Cross-functional
|
||||
- Distributed Team
|
||||
- Remote-first
|
||||
- High-growth
|
||||
- Fast-paced
|
||||
- Autonomous
|
||||
- Data-driven
|
||||
- Customer-centric
|
||||
- Empathetic Leadership
|
||||
- Inclusive Culture
|
||||
# ── Job-seeker Keywords ──
|
||||
- Strategic
|
||||
- Proactive
|
||||
- Hands-on
|
||||
- Scalable Processes
|
||||
- Operational Excellence
|
||||
- Business Impact
|
||||
- Executive Visibility
|
||||
- Player-Coach
|
||||
|
|
@ -28,7 +28,7 @@ dependencies:
|
|||
- fake-useragent # company scraper rotation
|
||||
|
||||
# ── LLM / AI backends ─────────────────────────────────────────────────────
|
||||
- openai>=1.0 # used for OpenAI-compat backends (ollama, vllm, wrappers)
|
||||
- openai>=1.55.0,<2.0.0 # >=1.55 required for httpx 0.28 compat; <2.0 for langchain-openai
|
||||
- anthropic>=0.80 # direct Anthropic API fallback
|
||||
- ollama # Python client for Ollama management
|
||||
- langchain>=0.2
|
||||
|
|
@ -54,6 +54,9 @@ dependencies:
|
|||
- pyyaml>=6.0
|
||||
- python-dotenv
|
||||
|
||||
# ── Auth / licensing ──────────────────────────────────────────────────────
|
||||
- PyJWT>=2.8
|
||||
|
||||
# ── Utilities ─────────────────────────────────────────────────────────────
|
||||
- sqlalchemy
|
||||
- tqdm
|
||||
|
|
|
|||
|
|
@ -22,7 +22,7 @@ curl_cffi
|
|||
fake-useragent
|
||||
|
||||
# ── LLM / AI backends ─────────────────────────────────────────────────────
|
||||
openai>=1.0
|
||||
openai>=1.55.0,<2.0.0 # >=1.55 required for httpx 0.28 compat; <2.0 for langchain-openai
|
||||
anthropic>=0.80
|
||||
ollama
|
||||
langchain>=0.2
|
||||
|
|
@ -51,6 +51,9 @@ json-repair
|
|||
pyyaml>=6.0
|
||||
python-dotenv
|
||||
|
||||
# ── Auth / licensing ──────────────────────────────────────────────────────
|
||||
PyJWT>=2.8
|
||||
|
||||
# ── Utilities ─────────────────────────────────────────────────────────────
|
||||
sqlalchemy
|
||||
tqdm
|
||||
|
|
|
|||
|
|
@ -193,7 +193,7 @@ def _parse_sections(text: str) -> dict[str, str]:
|
|||
return sections
|
||||
|
||||
|
||||
_RESUME_YAML = Path(__file__).parent.parent / "aihawk" / "data_folder" / "plain_text_resume.yaml"
|
||||
_RESUME_YAML = Path(__file__).parent.parent / "config" / "plain_text_resume.yaml"
|
||||
_KEYWORDS_YAML = Path(__file__).parent.parent / "config" / "resume_keywords.yaml"
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -26,11 +26,19 @@ LETTERS_DIR = _profile.docs_dir if _profile else Path.home() / "Documents" / "Jo
|
|||
LETTER_GLOB = "*Cover Letter*.md"
|
||||
|
||||
# Background injected into every prompt so the model has the candidate's facts
|
||||
SYSTEM_CONTEXT = (
|
||||
f"You are writing cover letters for {_profile.name}. {_profile.career_summary}"
|
||||
if _profile else
|
||||
"You are a professional cover letter writer. Write in first person."
|
||||
)
|
||||
def _build_system_context() -> str:
|
||||
if not _profile:
|
||||
return "You are a professional cover letter writer. Write in first person."
|
||||
parts = [f"You are writing cover letters for {_profile.name}. {_profile.career_summary}"]
|
||||
if _profile.candidate_voice:
|
||||
parts.append(
|
||||
f"Voice and personality: {_profile.candidate_voice} "
|
||||
"Write in a way that reflects these authentic traits — not as a checklist, "
|
||||
"but as a natural expression of who this person is."
|
||||
)
|
||||
return " ".join(parts)
|
||||
|
||||
SYSTEM_CONTEXT = _build_system_context()
|
||||
|
||||
|
||||
# ── Mission-alignment detection ───────────────────────────────────────────────
|
||||
|
|
@ -58,6 +66,13 @@ _MISSION_SIGNALS: dict[str, list[str]] = {
|
|||
"instructure", "canvas lms", "clever", "district", "teacher",
|
||||
"k-12", "k12", "grade", "pedagogy",
|
||||
],
|
||||
"social_impact": [
|
||||
"nonprofit", "non-profit", "501(c)", "social impact", "mission-driven",
|
||||
"public benefit", "community", "underserved", "equity", "justice",
|
||||
"humanitarian", "advocacy", "charity", "foundation", "ngo",
|
||||
"social good", "civic", "public health", "mental health", "food security",
|
||||
"housing", "homelessness", "poverty", "workforce development",
|
||||
],
|
||||
}
|
||||
|
||||
_candidate = _profile.name if _profile else "the candidate"
|
||||
|
|
@ -79,6 +94,11 @@ _MISSION_DEFAULTS: dict[str, str] = {
|
|||
f"{_candidate}'s values. Para 3 should reflect this authentic connection specifically "
|
||||
"and warmly."
|
||||
),
|
||||
"social_impact": (
|
||||
f"This organization is mission-driven / social impact focused — exactly the kind of "
|
||||
f"cause {_candidate} cares deeply about. Para 3 should warmly reflect their genuine "
|
||||
"desire to apply their skills to work that makes a real difference in people's lives."
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -84,9 +84,9 @@ def _extract_career_summary(source: Path) -> str:
|
|||
|
||||
def _extract_personal_info(source: Path) -> dict:
|
||||
"""Extract personal info from aihawk resume yaml."""
|
||||
resume = source / "aihawk" / "data_folder" / "plain_text_resume.yaml"
|
||||
resume = source / "config" / "plain_text_resume.yaml"
|
||||
if not resume.exists():
|
||||
resume = source / "config" / "plain_text_resume.yaml"
|
||||
resume = source / "aihawk" / "data_folder" / "plain_text_resume.yaml"
|
||||
if not resume.exists():
|
||||
return {}
|
||||
data = _load_yaml(resume)
|
||||
|
|
@ -197,8 +197,10 @@ def _copy_configs(source: Path, dest: Path, apply: bool) -> None:
|
|||
|
||||
def _copy_aihawk_resume(source: Path, dest: Path, apply: bool) -> None:
|
||||
print("\n── Copying AIHawk resume profile")
|
||||
src = source / "aihawk" / "data_folder" / "plain_text_resume.yaml"
|
||||
dst = dest / "aihawk" / "data_folder" / "plain_text_resume.yaml"
|
||||
src = source / "config" / "plain_text_resume.yaml"
|
||||
if not src.exists():
|
||||
src = source / "aihawk" / "data_folder" / "plain_text_resume.yaml"
|
||||
dst = dest / "config" / "plain_text_resume.yaml"
|
||||
_copy_file(src, dst, apply)
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -92,6 +92,18 @@ def _find_column_split(page) -> float | None:
|
|||
return split_x if split_x and best_gap > page.width * 0.03 else None
|
||||
|
||||
|
||||
_CID_BULLETS = {127, 149, 183} # common bullet CIDs across ATS-reembedded fonts
|
||||
|
||||
def _clean_cid(text: str) -> str:
|
||||
"""Replace (cid:NNN) glyph references emitted by pdfplumber when a PDF font
|
||||
lacks a ToUnicode map. Known bullet CIDs become '•'; everything else is
|
||||
stripped so downstream section parsing sees clean text."""
|
||||
def _replace(m: re.Match) -> str:
|
||||
n = int(m.group(1))
|
||||
return "•" if n in _CID_BULLETS else ""
|
||||
return re.sub(r"\(cid:(\d+)\)", _replace, text)
|
||||
|
||||
|
||||
def extract_text_from_pdf(file_bytes: bytes) -> str:
|
||||
"""Extract text from PDF, handling two-column layouts via gutter detection.
|
||||
|
||||
|
|
@ -116,12 +128,12 @@ def extract_text_from_pdf(file_bytes: bytes) -> str:
|
|||
pages.append("\n".join(filter(None, [header_text, left_text, right_text])))
|
||||
continue
|
||||
pages.append(page.extract_text() or "")
|
||||
return "\n".join(pages)
|
||||
return _clean_cid("\n".join(pages))
|
||||
|
||||
|
||||
def extract_text_from_docx(file_bytes: bytes) -> str:
|
||||
doc = Document(io.BytesIO(file_bytes))
|
||||
return "\n".join(p.text for p in doc.paragraphs if p.text.strip())
|
||||
return _clean_cid("\n".join(p.text for p in doc.paragraphs if p.text.strip()))
|
||||
|
||||
|
||||
def extract_text_from_odt(file_bytes: bytes) -> str:
|
||||
|
|
@ -139,7 +151,7 @@ def extract_text_from_odt(file_bytes: bytes) -> str:
|
|||
text = "".join(elem.itertext()).strip()
|
||||
if text:
|
||||
lines.append(text)
|
||||
return "\n".join(lines)
|
||||
return _clean_cid("\n".join(lines))
|
||||
|
||||
|
||||
# ── Section splitter ──────────────────────────────────────────────────────────
|
||||
|
|
|
|||
67
scripts/skills_utils.py
Normal file
67
scripts/skills_utils.py
Normal file
|
|
@ -0,0 +1,67 @@
|
|||
"""
|
||||
skills_utils.py — Content filter and suggestion loader for the skills tagging system.
|
||||
|
||||
load_suggestions(category) → list[str] bundled suggestions for a category
|
||||
filter_tag(tag) → str | None cleaned tag, or None if rejected
|
||||
"""
|
||||
from __future__ import annotations
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
_SUGGESTIONS_FILE = Path(__file__).parent.parent / "config" / "skills_suggestions.yaml"
|
||||
|
||||
# ── Content filter ─────────────────────────────────────────────────────────────
|
||||
# Tags must be short, human-readable skill/domain labels. No URLs, no abuse.
|
||||
|
||||
_BLOCKED = {
|
||||
# profanity placeholder — extend as needed
|
||||
"fuck", "shit", "ass", "bitch", "cunt", "dick", "bastard", "damn",
|
||||
}
|
||||
|
||||
_URL_RE = re.compile(r"https?://|www\.|\.com\b|\.net\b|\.org\b", re.I)
|
||||
_ALLOWED_CHARS = re.compile(r"^[\w\s\-\.\+\#\/\&\(\)]+$", re.UNICODE)
|
||||
|
||||
|
||||
def filter_tag(raw: str) -> str | None:
|
||||
"""Return a cleaned tag string, or None if the tag should be rejected.
|
||||
|
||||
Rejection criteria:
|
||||
- Blank after stripping
|
||||
- Too short (< 2 chars) or too long (> 60 chars)
|
||||
- Contains a URL pattern
|
||||
- Contains disallowed characters
|
||||
- Matches a blocked term (case-insensitive, whole-word)
|
||||
- Repeated character run (e.g. 'aaaaa')
|
||||
"""
|
||||
tag = " ".join(raw.strip().split()) # normalise whitespace
|
||||
if not tag or len(tag) < 2:
|
||||
return None
|
||||
if len(tag) > 60:
|
||||
return None
|
||||
if _URL_RE.search(tag):
|
||||
return None
|
||||
if not _ALLOWED_CHARS.match(tag):
|
||||
return None
|
||||
lower = tag.lower()
|
||||
for blocked in _BLOCKED:
|
||||
if re.search(rf"\b{re.escape(blocked)}\b", lower):
|
||||
return None
|
||||
if re.search(r"(.)\1{4,}", lower): # 5+ repeated chars
|
||||
return None
|
||||
return tag
|
||||
|
||||
|
||||
# ── Suggestion loader ──────────────────────────────────────────────────────────
|
||||
|
||||
def load_suggestions(category: str) -> list[str]:
|
||||
"""Return the bundled suggestion list for a category ('skills'|'domains'|'keywords').
|
||||
Returns an empty list if the file is missing or the category is not found.
|
||||
"""
|
||||
if not _SUGGESTIONS_FILE.exists():
|
||||
return []
|
||||
try:
|
||||
import yaml
|
||||
data = yaml.safe_load(_SUGGESTIONS_FILE.read_text()) or {}
|
||||
return list(data.get(category, []))
|
||||
except Exception:
|
||||
return []
|
||||
|
|
@ -15,6 +15,7 @@ _DEFAULTS = {
|
|||
"phone": "",
|
||||
"linkedin": "",
|
||||
"career_summary": "",
|
||||
"candidate_voice": "",
|
||||
"nda_companies": [],
|
||||
"docs_dir": "~/Documents/JobSearch",
|
||||
"ollama_models_dir": "~/models/ollama",
|
||||
|
|
@ -61,6 +62,7 @@ class UserProfile:
|
|||
self.phone: str = data["phone"]
|
||||
self.linkedin: str = data["linkedin"]
|
||||
self.career_summary: str = data["career_summary"]
|
||||
self.candidate_voice: str = data.get("candidate_voice", "")
|
||||
self.nda_companies: list[str] = [c.lower() for c in data["nda_companies"]]
|
||||
self.docs_dir: Path = Path(data["docs_dir"]).expanduser().resolve()
|
||||
self.ollama_models_dir: Path = Path(data["ollama_models_dir"]).expanduser().resolve()
|
||||
|
|
|
|||
Loading…
Reference in a new issue