feat: writing style benchmark harness for local text-gen models #36

New issue

Open

opened 2026-04-22 07:06:45 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-22 07:06:45 -07:00

Owner

Summary

Build a benchmark harness that runs all available local text-gen models against a writing style evaluation corpus and scores them for voice match. Output is a ranked model table to inform fine-tune base selection.

Steps

1. Corpus collection

Gather 50-100 writing samples from existing sources: Reddit/Lemmy comments, Magpie drafts, technical writing, CLAUDE.md notes
Store as data/voice_corpus/ -- plain text files, one per sample
Tag by type: social_reply, technical, narrative

2. Prompt design

System: "You are a writing assistant. Match the voice, tone, and style of the samples below exactly."
User: samples + thread context + "Write a reply to this thread:"
Run each model at temp 0.7, top_p 0.9, 300 token max

3. Models to benchmark

All models registered in cf-orch (query /api/v1/models for current inventory)
Focus: 7B-13B range (fast enough for interactive use)

4. Scoring

Automated signals: avg sentence length delta, vocabulary overlap, em-dash frequency (lower = better), filler phrase detection ("delve", "certainly", "I apologize")
Human eval grid: print top 3 outputs per model for side-by-side review
Output: benchmark_results/voice_YYYY-MM-DD.md ranked table

Acceptance criteria

CLI: python scripts/benchmark_voice.py --models all --samples data/voice_corpus/
Produces ranked markdown table with sample outputs
Top model clearly identified for fine-tune ticket

## Summary Build a benchmark harness that runs all available local text-gen models against a writing style evaluation corpus and scores them for voice match. Output is a ranked model table to inform fine-tune base selection. ## Steps ### 1. Corpus collection - Gather 50-100 writing samples from existing sources: Reddit/Lemmy comments, Magpie drafts, technical writing, CLAUDE.md notes - Store as `data/voice_corpus/` -- plain text files, one per sample - Tag by type: social_reply, technical, narrative ### 2. Prompt design - System: "You are a writing assistant. Match the voice, tone, and style of the samples below exactly." - User: samples + thread context + "Write a reply to this thread:" - Run each model at temp 0.7, top_p 0.9, 300 token max ### 3. Models to benchmark - All models registered in cf-orch (query `/api/v1/models` for current inventory) - Focus: 7B-13B range (fast enough for interactive use) ### 4. Scoring - Automated signals: avg sentence length delta, vocabulary overlap, em-dash frequency (lower = better), filler phrase detection ("delve", "certainly", "I apologize") - Human eval grid: print top 3 outputs per model for side-by-side review - Output: `benchmark_results/voice_YYYY-MM-DD.md` ranked table ## Acceptance criteria - CLI: `python scripts/benchmark_voice.py --models all --samples data/voice_corpus/` - Produces ranked markdown table with sample outputs - Top model clearly identified for fine-tune ticket