Imitation pipeline: company research synthesis #29

New issue

Open

opened 2026-04-10 22:38:27 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-10 22:38:27 -07:00

Owner

Context: Peregrine auto-generates a structured pre-interview research brief whenever a job moves to phone_screen. This is a long-context synthesis task that currently requires a capable 7B+ model. A fine-tuned smaller model could handle the structured-output portion reliably.

What Peregrine uses this for:
Given a job row (company, title, JD excerpt up to 1500 chars) plus optionally live-scraped data (CEO name, HQ, LinkedIn) and SearXNG search result snippets (news, funding, tech stack, competitors, culture, accessibility signals, CEO press), the model produces a structured markdown brief with exactly 7-9 named sections. Sections vary by user config: always includes Company Overview, Leadership & Culture, Tech Stack & Product, Funding & Market Position, Recent Developments, Red Flags & Watch-outs, and Talking Points; optionally adds Inclusion & Accessibility and LGBTQIA+ Inclusion sections. The Talking Points section must reference specific candidate experience entries by name and connect each to a JD signal.

Input/output schema:

Input: candidate name + career summary + role + company + JD excerpt + matched resume experience block (top 2 scored entries in full, rest as one-liners) + matched keyword list + live scrape block (CEO/HQ/LinkedIn if available) + up to 6 categories of SearXNG search snippets (4 results each, title + content + URL)
Output: structured markdown with ## SectionName headers; 7-9 sections depending on user flags; <think> blocks stripped in post-processing

Current model/fallback chain:
research_fallback_order from config/llm.yaml: claude_code → vllm (__auto__, ouroboros) → ollama_research (llama3.1:8b) → ...
Note: alex-cover-writer:latest is explicitly excluded from this chain — it does not follow structured markdown prompts.

Recommended model domain:
Long-context synthesis + structured output, 7B-14B. The task has two separable sub-skills: (1) section classification and formatting discipline, and (2) factual grounding against the provided snippets. A model fine-tuned for structured-output compliance could dramatically reduce the minimum capable size.

Can Avocet produce training data for it?
Yes, with some extension. The full input prompt and raw output are already stored in Peregrine's company_research table (raw_output column). A new Avocet card type could present the brief to a labeler with section-by-section quality ratings (groundedness, relevance, talking-point specificity).

Suggested data collection approach:

Export existing company_research.raw_output rows from staging.db as silver labels; pair each with the reconstructed input prompt (available from the matching jobs row)
Build an Avocet review card that shows brief sections individually and collects section-level quality ratings
Flag Talking Points entries specifically — this is the hardest section and the highest-value training signal (requires reasoning about experience relevance)
Preference pairs for DPO: present two brief variants (e.g., with vs. without live scrape data) and ask labeler which is more useful

Related: Peregrine scripts/company_research.py; company_research table in staging.db

**Context:** Peregrine auto-generates a structured pre-interview research brief whenever a job moves to `phone_screen`. This is a long-context synthesis task that currently requires a capable 7B+ model. A fine-tuned smaller model could handle the structured-output portion reliably. **What Peregrine uses this for:** Given a job row (company, title, JD excerpt up to 1500 chars) plus optionally live-scraped data (CEO name, HQ, LinkedIn) and SearXNG search result snippets (news, funding, tech stack, competitors, culture, accessibility signals, CEO press), the model produces a structured markdown brief with exactly 7-9 named sections. Sections vary by user config: always includes Company Overview, Leadership & Culture, Tech Stack & Product, Funding & Market Position, Recent Developments, Red Flags & Watch-outs, and Talking Points; optionally adds Inclusion & Accessibility and LGBTQIA+ Inclusion sections. The Talking Points section must reference specific candidate experience entries by name and connect each to a JD signal. **Input/output schema:** - Input: candidate name + career summary + role + company + JD excerpt + matched resume experience block (top 2 scored entries in full, rest as one-liners) + matched keyword list + live scrape block (CEO/HQ/LinkedIn if available) + up to 6 categories of SearXNG search snippets (4 results each, title + content + URL) - Output: structured markdown with `## SectionName` headers; 7-9 sections depending on user flags; `<think>` blocks stripped in post-processing **Current model/fallback chain:** `research_fallback_order` from `config/llm.yaml`: `claude_code → vllm (__auto__, ouroboros) → ollama_research (llama3.1:8b) → ...` Note: `alex-cover-writer:latest` is explicitly excluded from this chain — it does not follow structured markdown prompts. **Recommended model domain:** Long-context synthesis + structured output, 7B-14B. The task has two separable sub-skills: (1) section classification and formatting discipline, and (2) factual grounding against the provided snippets. A model fine-tuned for structured-output compliance could dramatically reduce the minimum capable size. **Can Avocet produce training data for it?** Yes, with some extension. The full input prompt and raw output are already stored in Peregrine's `company_research` table (`raw_output` column). A new Avocet card type could present the brief to a labeler with section-by-section quality ratings (groundedness, relevance, talking-point specificity). **Suggested data collection approach:** - Export existing `company_research.raw_output` rows from `staging.db` as silver labels; pair each with the reconstructed input prompt (available from the matching `jobs` row) - Build an Avocet review card that shows brief sections individually and collects section-level quality ratings - Flag Talking Points entries specifically — this is the hardest section and the highest-value training signal (requires reasoning about experience relevance) - Preference pairs for DPO: present two brief variants (e.g., with vs. without live scrape data) and ask labeler which is more useful **Related:** Peregrine `scripts/company_research.py`; `company_research` table in `staging.db`