feat: writing style benchmark harness for local text-gen models #36

Open
opened 2026-04-22 07:06:45 -07:00 by pyr0ball · 0 comments
Owner

Summary

Build a benchmark harness that runs all available local text-gen models against a writing style evaluation corpus and scores them for voice match. Output is a ranked model table to inform fine-tune base selection.

Steps

1. Corpus collection

  • Gather 50-100 writing samples from existing sources: Reddit/Lemmy comments, Magpie drafts, technical writing, CLAUDE.md notes
  • Store as data/voice_corpus/ -- plain text files, one per sample
  • Tag by type: social_reply, technical, narrative

2. Prompt design

  • System: "You are a writing assistant. Match the voice, tone, and style of the samples below exactly."
  • User: samples + thread context + "Write a reply to this thread:"
  • Run each model at temp 0.7, top_p 0.9, 300 token max

3. Models to benchmark

  • All models registered in cf-orch (query /api/v1/models for current inventory)
  • Focus: 7B-13B range (fast enough for interactive use)

4. Scoring

  • Automated signals: avg sentence length delta, vocabulary overlap, em-dash frequency (lower = better), filler phrase detection ("delve", "certainly", "I apologize")
  • Human eval grid: print top 3 outputs per model for side-by-side review
  • Output: benchmark_results/voice_YYYY-MM-DD.md ranked table

Acceptance criteria

  • CLI: python scripts/benchmark_voice.py --models all --samples data/voice_corpus/
  • Produces ranked markdown table with sample outputs
  • Top model clearly identified for fine-tune ticket
## Summary Build a benchmark harness that runs all available local text-gen models against a writing style evaluation corpus and scores them for voice match. Output is a ranked model table to inform fine-tune base selection. ## Steps ### 1. Corpus collection - Gather 50-100 writing samples from existing sources: Reddit/Lemmy comments, Magpie drafts, technical writing, CLAUDE.md notes - Store as `data/voice_corpus/` -- plain text files, one per sample - Tag by type: social_reply, technical, narrative ### 2. Prompt design - System: "You are a writing assistant. Match the voice, tone, and style of the samples below exactly." - User: samples + thread context + "Write a reply to this thread:" - Run each model at temp 0.7, top_p 0.9, 300 token max ### 3. Models to benchmark - All models registered in cf-orch (query `/api/v1/models` for current inventory) - Focus: 7B-13B range (fast enough for interactive use) ### 4. Scoring - Automated signals: avg sentence length delta, vocabulary overlap, em-dash frequency (lower = better), filler phrase detection ("delve", "certainly", "I apologize") - Human eval grid: print top 3 outputs per model for side-by-side review - Output: `benchmark_results/voice_YYYY-MM-DD.md` ranked table ## Acceptance criteria - CLI: `python scripts/benchmark_voice.py --models all --samples data/voice_corpus/` - Produces ranked markdown table with sample outputs - Top model clearly identified for fine-tune ticket
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/avocet#36
No description provided.