[New Feature] Email classifier benchmark harness — evaluate adapter models against labeled gold set #6
Labels
No labels
backlog
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Circuit-Forge/avocet#6
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Overview
Design doc:
docs/plans/2026-02-26-email-classifier-benchmark-design.mdPlan doc:
docs/plans/2026-02-26-email-classifier-benchmark-plan.mdImplement a benchmark harness that:
data/benchmark_results/with timestamp./manage.sh benchmarkand./manage.sh compareCLI commandsKey Files
scripts/benchmark.py— main harness entry pointscripts/classifier_adapters.py— adapter interface (already started)tests/test_benchmark.py— TDD testsAcceptance Criteria
./manage.sh benchmarkruns all adapters and prints a results table./manage.sh comparediffs two benchmark result files./manage.sh scoreprints current model scores