Backend (app/api.py):
- GET /api/benchmark/models — returns installed models grouped by adapter
type (ZeroShotAdapter, RerankerAdapter, GenerationAdapter, Unknown);
reads _MODELS_DIR via app.models so test overrides are respected
- GET /api/benchmark/run — add model_names query param (comma-separated);
when set, passes --models <names...> to benchmark_classifier.py
- GET /api/stats — add benchmark_results field from benchmark_results.json
Frontend:
- BenchmarkView: collapsible Model Selection panel with per-category
checkboxes, select-all per category (supports indeterminate state),
collapsed summary badge ("All models (N)" or "N of M selected");
model_names only sent when a strict subset is selected
- StatsView: Benchmark Results table (accuracy, macro_f1, weighted_f1)
with best-model highlighting per metric; hidden when no results exist
Adds optional failure_category to SubmitRequest and candidate records so
reviewers can classify why a model response was wrong, not just what to do
with it. Enables the fine-tune harness to filter training data by failure
type (e.g. exclude scoring artifacts, train only on genuine wrong answers).
Taxonomy: scoring_artifact | style_violation | partial_answer |
wrong_answer | format_error | hallucination
- app/sft.py: FailureCategory Literal type; SubmitRequest.failure_category;
stored on candidate record in POST /submit correct branch
- tests/test_sft.py: 3 new tests (stores value, null round-trip, 422 on invalid)
- stores/sft.ts: SftFailureCategory type exported; SftQueueItem + SftLastAction
updated; setLastAction accepts optional category param
- SftCard.vue: chip-group selector shown during correct/discard/flag flow;
two-step confirm for discard/flag reveals chips before emitting; category
forwarded in all emit payloads
- CorrectionsView.vue: handleCorrect/Discard/Flag accept and forward category
to POST /api/sft/submit body and store.setLastAction
- SftCard.test.ts: 11 new tests covering chip visibility, selection,
single-active enforcement, pending-action flow, emit payloads, cancel
- manage.sh: dev command starts uvicorn --reload on :8503 and Vite dev
server (auto-port from 5173); kills API on EXIT/INT/TERM trap
- manage.sh: ENV_UI defaults to 'cf' env (overridable via AVOCET_ENV)
- vite.config.ts: add server.proxy to forward /api to :8503 so Vite
dev server can reach the backend without CORS issues
- sft.py GET /config: use `or {}` guard so `sft: ~` (null YAML) doesn't
return None instead of the default empty config
- CorrectionsView: convert handleCorrect/Discard/Flag and handleUndo from
optimistic to pessimistic — queue mutation only happens after server
confirms; failures leave item in queue so user can retry cleanly
- SettingsView: call loadSftConfig() on mount so saved bench_results_dir
is populated instead of always starting empty
- Add /corrections route to Vue router (lazy-loaded CorrectionsView)
- Add Corrections nav item (✍️) to AppSidebar after Benchmark
- Add cf-orch Integration section to SettingsView with bench_results_dir
field, run scanner, and per-run import table
- Add GET /api/sft/config and POST /api/sft/config endpoints to app/sft.py
- Add v-if guard on failure-reason <p> so null renders no element (not literal "null")
- Clarify mid-quality test description: score is 0.4 to <0.7 (exclusive upper bound)
- Add test: renders nothing for failure_reason when null (+1 → 14 SftCard tests)
Replace CSS keyframe dismiss classes and inline cardStyle/deltaX/deltaY
with useCardAnimation composable — pickup/setDragPosition/snapBack/animateDismiss
are now called from pointer event handlers and a dismissType watcher.
TDD: 8 tests written first (red), then composable implemented (green).
Adapts to Anime.js v4 API: 2-arg animate(), object-param spring(),
utils.set() for instant drag-position updates without cache desync.
useLabelKeyboard now accepts labels as Label[] | (() => Label[]).
The keymap is rebuilt on every keypress from the getter result instead of
being captured once at construction time — so keys 1–9 now fire correctly
after the async /api/config/labels fetch completes.
LabelView passes () => labels.value so the reactive ref is read lazily.
New test: 'evaluates labels getter on each keypress' covers the async-load
scenario (empty list → no match; push a label → key fires).
Two bugs fixed:
1. Blank white page after vue SPA rebuild: browsers cached old index.html
referencing old asset hashes. Assets are deleted on rebuild, causing
404s for JS/CSS -> blank page. Fix: serve index.html with
Cache-Control: no-cache so browsers always fetch fresh HTML.
Hashed assets (/assets/chunk-abc123.js) remain cacheable forever.
2. Queue draining to empty on skip/discard: handleSkip and handleDiscard
never refilled the local queue buffer. After enough skips, store.current
went null and the empty state showed (blank-looking). Fix: both handlers
now call fetchBatch() when queue drops below 3, matching handleLabel.
Also: sync classifier_adapters LABELS to match current 10-label schema
(new_lead + hired, remove unrelated).
48 Python tests pass, 48 frontend tests pass.
Implements Task 13: LabelView.vue wires together the label store, API
fetch, card stack, bucket grid, keyboard shortcuts, haptics, motion
preference, and three easter egg badges (on-a-roll, speed round, fifty
deep). App.vue updated to mount LabelView and restore hacker-mode theme
on load. 3 new LabelView tests; all 48 tests pass, build clean.