[New Feature] Email classifier benchmark harness — evaluate adapter models against labeled gold set #6

Open
opened 2026-03-04 09:31:51 -08:00 by pyr0ball · 0 comments
Owner

Overview

Design doc: docs/plans/2026-02-26-email-classifier-benchmark-design.md
Plan doc: docs/plans/2026-02-26-email-classifier-benchmark-plan.md

Implement a benchmark harness that:

  1. Loads a labeled gold set (JSONL with ground-truth labels)
  2. Runs each configured classifier adapter (rule-based, zero-shot, fine-tuned) against the gold set
  3. Reports per-label precision/recall/F1 and macro averages
  4. Saves results to data/benchmark_results/ with timestamp
  5. Supports ./manage.sh benchmark and ./manage.sh compare CLI commands

Key Files

  • scripts/benchmark.py — main harness entry point
  • scripts/classifier_adapters.py — adapter interface (already started)
  • tests/test_benchmark.py — TDD tests

Acceptance Criteria

  • ./manage.sh benchmark runs all adapters and prints a results table
  • ./manage.sh compare diffs two benchmark result files
  • ./manage.sh score prints current model scores
  • Results saved as JSON with metadata (date, model version, dataset size)
  • All tests pass
## Overview Design doc: `docs/plans/2026-02-26-email-classifier-benchmark-design.md` Plan doc: `docs/plans/2026-02-26-email-classifier-benchmark-plan.md` Implement a benchmark harness that: 1. Loads a labeled gold set (JSONL with ground-truth labels) 2. Runs each configured classifier adapter (rule-based, zero-shot, fine-tuned) against the gold set 3. Reports per-label precision/recall/F1 and macro averages 4. Saves results to `data/benchmark_results/` with timestamp 5. Supports `./manage.sh benchmark` and `./manage.sh compare` CLI commands ## Key Files - `scripts/benchmark.py` — main harness entry point - `scripts/classifier_adapters.py` — adapter interface (already started) - `tests/test_benchmark.py` — TDD tests ## Acceptance Criteria - `./manage.sh benchmark` runs all adapters and prints a results table - `./manage.sh compare` diffs two benchmark result files - `./manage.sh score` prints current model scores - Results saved as JSON with metadata (date, model version, dataset size) - All tests pass
pyr0ball added the
enhancement
label 2026-03-14 16:44:46 -07:00
pyr0ball added this to the Alpha — Label Tool milestone 2026-03-14 16:44:49 -07:00
pyr0ball modified the milestone from Alpha — Label Tool to Beta — Benchmark Harness 2026-03-14 16:45:03 -07:00
pyr0ball added this to the The Menagerie project 2026-03-14 16:45:06 -07:00
pyr0ball self-assigned this 2026-03-14 16:45:07 -07:00
Sign in to join this conversation.
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/avocet#6
No description provided.