Imitation pipeline: resume career summary generation #30

New issue

Open

opened 2026-04-10 22:38:45 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-10 22:38:45 -07:00

Owner

Context: Peregrine's resume parser uses regex as its primary extraction path but falls back to an LLM to generate a career summary when no summary section is detected in the document. This is a small, bounded generation task well-suited to a dedicated fine-tuned 1B-3B model.

What Peregrine uses this for:
After parse_resume() runs regex-based section extraction, structure_resume() checks whether a career_summary field was populated. If not (common for resumes that lack an explicit Summary or Objective section), it calls _llm_career_summary() which sends the first 1500 characters of the raw resume text to the LLM and requests a 2-3 sentence professional career summary.

Input/output schema:

Input: plain-text prompt: "Write a 2-3 sentence professional career summary for this candidate based on their resume. Return only the summary text, no labels.\n\nResume:\n{raw_text[:1500]}"
Output: 2-3 sentence plain-text career summary; no special formatting expected; any failure returns empty string and the field is left blank
No system prompt is set for this call; uses the default LLMRouter fallback chain

Current model/fallback chain:
Default LLMRouter() — no task-specific override; uses fallback_order from config/llm.yaml (typically claude_code → ollama → vllm → copilot → anthropic).

Recommended model domain:
Extraction + summarization, 1B-3B. The task is highly constrained: input is structured resume text, output is a fixed-format 2-3 sentence summary. This is the simplest LLM task in the Peregrine pipeline and the lowest bar for a fine-tuned replacement — a 1B model with 500-1000 examples should match or exceed the base model quality.

Can Avocet produce training data for it?
Yes — straightforwardly. Any resume text + human-written or reviewed career summary is a valid training pair. Peregrine users who have an existing summary section provide implicit ground truth (resume text in → summary section out). Labeling effort is low: present the generated summary alongside the resume excerpt and ask for a thumbs up/down with optional edit.

Suggested data collection approach:

Silver labels: where parse_resume() successfully extracts a summary section, that (resume text, summary) pair is a free training example requiring no human review
For LLM-generated summaries: build a lightweight Avocet card that shows the resume excerpt and generated summary; labeler edits or approves
Volume target is low — 500-1000 pairs should suffice for a 1B-3B fine-tune on this constrained task
Cross-product opportunity: any CF product that ingests resumes (Falcon for benefits applications, etc.) can contribute to the same dataset

Related: Peregrine scripts/resume_parser.py (_llm_career_summary, structure_resume)

**Context:** Peregrine's resume parser uses regex as its primary extraction path but falls back to an LLM to generate a career summary when no summary section is detected in the document. This is a small, bounded generation task well-suited to a dedicated fine-tuned 1B-3B model. **What Peregrine uses this for:** After `parse_resume()` runs regex-based section extraction, `structure_resume()` checks whether a `career_summary` field was populated. If not (common for resumes that lack an explicit Summary or Objective section), it calls `_llm_career_summary()` which sends the first 1500 characters of the raw resume text to the LLM and requests a 2-3 sentence professional career summary. **Input/output schema:** - Input: plain-text prompt: `"Write a 2-3 sentence professional career summary for this candidate based on their resume. Return only the summary text, no labels.\n\nResume:\n{raw_text[:1500]}"` - Output: 2-3 sentence plain-text career summary; no special formatting expected; any failure returns empty string and the field is left blank - No system prompt is set for this call; uses the default LLMRouter fallback chain **Current model/fallback chain:** Default `LLMRouter()` — no task-specific override; uses `fallback_order` from `config/llm.yaml` (typically `claude_code → ollama → vllm → copilot → anthropic`). **Recommended model domain:** Extraction + summarization, 1B-3B. The task is highly constrained: input is structured resume text, output is a fixed-format 2-3 sentence summary. This is the simplest LLM task in the Peregrine pipeline and the lowest bar for a fine-tuned replacement — a 1B model with 500-1000 examples should match or exceed the base model quality. **Can Avocet produce training data for it?** Yes — straightforwardly. Any resume text + human-written or reviewed career summary is a valid training pair. Peregrine users who have an existing summary section provide implicit ground truth (resume text in → summary section out). Labeling effort is low: present the generated summary alongside the resume excerpt and ask for a thumbs up/down with optional edit. **Suggested data collection approach:** - Silver labels: where `parse_resume()` successfully extracts a summary section, that (resume text, summary) pair is a free training example requiring no human review - For LLM-generated summaries: build a lightweight Avocet card that shows the resume excerpt and generated summary; labeler edits or approves - Volume target is low — 500-1000 pairs should suffice for a 1B-3B fine-tune on this constrained task - Cross-product opportunity: any CF product that ingests resumes (Falcon for benefits applications, etc.) can contribute to the same dataset **Related:** Peregrine `scripts/resume_parser.py` (`_llm_career_summary`, `structure_resume`)