Imitation pipeline: company research synthesis #29
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context: Peregrine auto-generates a structured pre-interview research brief whenever a job moves to
phone_screen. This is a long-context synthesis task that currently requires a capable 7B+ model. A fine-tuned smaller model could handle the structured-output portion reliably.What Peregrine uses this for:
Given a job row (company, title, JD excerpt up to 1500 chars) plus optionally live-scraped data (CEO name, HQ, LinkedIn) and SearXNG search result snippets (news, funding, tech stack, competitors, culture, accessibility signals, CEO press), the model produces a structured markdown brief with exactly 7-9 named sections. Sections vary by user config: always includes Company Overview, Leadership & Culture, Tech Stack & Product, Funding & Market Position, Recent Developments, Red Flags & Watch-outs, and Talking Points; optionally adds Inclusion & Accessibility and LGBTQIA+ Inclusion sections. The Talking Points section must reference specific candidate experience entries by name and connect each to a JD signal.
Input/output schema:
## SectionNameheaders; 7-9 sections depending on user flags;<think>blocks stripped in post-processingCurrent model/fallback chain:
research_fallback_orderfromconfig/llm.yaml:claude_code → vllm (__auto__, ouroboros) → ollama_research (llama3.1:8b) → ...Note:
alex-cover-writer:latestis explicitly excluded from this chain — it does not follow structured markdown prompts.Recommended model domain:
Long-context synthesis + structured output, 7B-14B. The task has two separable sub-skills: (1) section classification and formatting discipline, and (2) factual grounding against the provided snippets. A model fine-tuned for structured-output compliance could dramatically reduce the minimum capable size.
Can Avocet produce training data for it?
Yes, with some extension. The full input prompt and raw output are already stored in Peregrine's
company_researchtable (raw_outputcolumn). A new Avocet card type could present the brief to a labeler with section-by-section quality ratings (groundedness, relevance, talking-point specificity).Suggested data collection approach:
company_research.raw_outputrows fromstaging.dbas silver labels; pair each with the reconstructed input prompt (available from the matchingjobsrow)Related: Peregrine
scripts/company_research.py;company_researchtable instaging.db