feat: Corrections tab — SFT candidate import, review, and JSONL export #15

Merged
pyr0ball merged 99 commits from feat/sft-corrections into main 2026-04-08 22:19:01 -07:00
Owner

Summary

Implements avocet#14 "Import benchmark SFT candidates for labeling". Adds a full Corrections workflow to the Avocet SPA that pulls SFT candidates from cf-orch benchmark runs, surfaces them as reviewable cards, collects human corrections, and exports approved records as SFT-ready JSONL. Also retires the old Streamlit app.

Backend (app/sft.py + scripts/sft_import.py)

  • GET /api/sft/runs — discover importable benchmark result runs from cf-orch
  • POST /api/sft/import — import a run with JSONL deduplication on id field (streaming _read_existing_ids for memory efficiency)
  • GET /api/sft/queue — paginated queue of needs_review candidates
  • POST /api/sft/submit — approve (correct), discard, or flag a candidate; validated with Literal["correct","discard","flag"]
  • POST /api/sft/undo — restore the last submitted item to needs_review
  • GET /api/sft/export — NDJSON streaming export of approved records
  • GET /api/sft/stats — counts by status using shared _is_exportable() predicate
  • GET /api/sft/config + POST /api/sft/config — read/write bench_results_dir with atomic file write (tmp + rename); null-safe YAML section handling
  • Module-level globals as testability seams (set_sft_data_dir / set_sft_config_dir)

Frontend

  • stores/sft.ts — Pinia store: queue, current (computed), lastAction, removeCurrentFromQueue, restoreItem
  • SftCorrectionArea.vue — inline correction textarea with expose/reset, accessible aria-describedby guard
  • SftCard.vue — quality chip (low/mid/ok), collapsible prompt, action buttons, failure_reason null guard
  • useSftKeyboard.ts — keyboard shortcuts (c/d/f/Escape) with input/textarea focus guards
  • CorrectionsView.vue — main review page with pessimistic submit/undo (rollback-safe), undo toast, stats sidebar
  • SettingsView.vue — SFT Integration section with bench_results_dir input, run picker table, import UI; loads saved config on mount
  • Router and AppSidebar wired for new Corrections tab

Retired

  • Streamlit app (streamlit_app.py, streamlit_requirements.txt, run_streamlit.sh) removed

Tests

  • 134 tests total passing (25 new test_sft.py + 7 new test_sft_import.py)

Test plan

  • Start avocet dev server, navigate to Corrections tab
  • In Settings → SFT Integration: set bench_results_dir and verify it persists across page reload
  • Import a benchmark run and verify candidates appear in the queue
  • Approve a candidate with correction text, verify it exports via GET /api/sft/export
  • Discard a candidate, verify undo restores it
  • Flag a candidate, verify it appears in stats as model_rejected
  • Keyboard shortcuts (c/d/f/Escape) work on the review card
  • Confirm Streamlit references are gone from manage.sh and docs
## Summary Implements avocet#14 "Import benchmark SFT candidates for labeling". Adds a full Corrections workflow to the Avocet SPA that pulls SFT candidates from cf-orch benchmark runs, surfaces them as reviewable cards, collects human corrections, and exports approved records as SFT-ready JSONL. Also retires the old Streamlit app. ### Backend (`app/sft.py` + `scripts/sft_import.py`) - `GET /api/sft/runs` — discover importable benchmark result runs from cf-orch - `POST /api/sft/import` — import a run with JSONL deduplication on `id` field (streaming `_read_existing_ids` for memory efficiency) - `GET /api/sft/queue` — paginated queue of `needs_review` candidates - `POST /api/sft/submit` — approve (correct), discard, or flag a candidate; validated with `Literal["correct","discard","flag"]` - `POST /api/sft/undo` — restore the last submitted item to `needs_review` - `GET /api/sft/export` — NDJSON streaming export of approved records - `GET /api/sft/stats` — counts by status using shared `_is_exportable()` predicate - `GET /api/sft/config` + `POST /api/sft/config` — read/write `bench_results_dir` with atomic file write (tmp + rename); null-safe YAML section handling - Module-level globals as testability seams (`set_sft_data_dir` / `set_sft_config_dir`) ### Frontend - `stores/sft.ts` — Pinia store: queue, current (computed), lastAction, removeCurrentFromQueue, restoreItem - `SftCorrectionArea.vue` — inline correction textarea with expose/reset, accessible aria-describedby guard - `SftCard.vue` — quality chip (low/mid/ok), collapsible prompt, action buttons, failure_reason null guard - `useSftKeyboard.ts` — keyboard shortcuts (c/d/f/Escape) with input/textarea focus guards - `CorrectionsView.vue` — main review page with pessimistic submit/undo (rollback-safe), undo toast, stats sidebar - `SettingsView.vue` — SFT Integration section with bench_results_dir input, run picker table, import UI; loads saved config on mount - Router and AppSidebar wired for new Corrections tab ### Retired - Streamlit app (`streamlit_app.py`, `streamlit_requirements.txt`, `run_streamlit.sh`) removed ### Tests - 134 tests total passing (25 new `test_sft.py` + 7 new `test_sft_import.py`) ## Test plan - [ ] Start avocet dev server, navigate to Corrections tab - [ ] In Settings → SFT Integration: set bench_results_dir and verify it persists across page reload - [ ] Import a benchmark run and verify candidates appear in the queue - [ ] Approve a candidate with correction text, verify it exports via GET /api/sft/export - [ ] Discard a candidate, verify undo restores it - [ ] Flag a candidate, verify it appears in stats as model_rejected - [ ] Keyboard shortcuts (c/d/f/Escape) work on the review card - [ ] Confirm Streamlit references are gone from manage.sh and docs
pyr0ball added 98 commits 2026-04-08 18:54:21 -07:00
- useApiFetch: typed fetch wrapper with network/http error discrimination
- useMotion: reactive localStorage override for rich-animation toggle, respects OS prefers-reduced-motion
- useHaptics: label/discard/skip/undo vibration patterns, gated on rich mode
- useKonamiCode + useHackerMode: 10-key Konami sequence → hacker theme, persisted in localStorage
- test-setup.ts: jsdom matchMedia stub so useMotion imports cleanly in Vitest
- smoke.test.ts: import smoke tests for all 4 composables (12 tests, all passing)
Implements Task 13: LabelView.vue wires together the label store, API
fetch, card stack, bucket grid, keyboard shortcuts, haptics, motion
preference, and three easter egg badges (on-a-roll, speed round, fifty
deep). App.vue updated to mount LabelView and restore hacker-mode theme
on load. 3 new LabelView tests; all 48 tests pass, build clean.
- Add _item_id() (content hash) + _normalize() to map legacy JSONL fields
  (from_addr/account/no-id) to Vue schema (from/source/id)
- All mutating endpoints now look up by _normalize(x)[id] — handles both
  stored-id (test fixtures) and content-hash (real data) transparently
- Change uvicorn bind from 127.0.0.1 to 0.0.0.0 so LAN clients can connect
Two bugs fixed:

1. Blank white page after vue SPA rebuild: browsers cached old index.html
   referencing old asset hashes. Assets are deleted on rebuild, causing
   404s for JS/CSS -> blank page. Fix: serve index.html with
   Cache-Control: no-cache so browsers always fetch fresh HTML.
   Hashed assets (/assets/chunk-abc123.js) remain cacheable forever.

2. Queue draining to empty on skip/discard: handleSkip and handleDiscard
   never refilled the local queue buffer. After enough skips, store.current
   went null and the empty state showed (blank-looking). Fix: both handlers
   now call fetchBatch() when queue drops below 3, matching handleLabel.

Also: sync classifier_adapters LABELS to match current 10-label schema
(new_lead + hired, remove unrelated).

48 Python tests pass, 48 frontend tests pass.
useLabelKeyboard now accepts labels as Label[] | (() => Label[]).
The keymap is rebuilt on every keypress from the getter result instead of
being captured once at construction time — so keys 1–9 now fire correctly
after the async /api/config/labels fetch completes.

LabelView passes () => labels.value so the reactive ref is read lazily.

New test: 'evaluates labels getter on each keypress' covers the async-load
scenario (empty list → no match; push a label → key fires).
TDD: 8 tests written first (red), then composable implemented (green).
Adapts to Anime.js v4 API: 2-arg animate(), object-param spring(),
utils.set() for instant drag-position updates without cache desync.
Replace CSS keyframe dismiss classes and inline cardStyle/deltaX/deltaY
with useCardAnimation composable — pickup/setDragPosition/snapBack/animateDismiss
are now called from pointer event handlers and a dismissType watcher.
Implements load_and_prepare_data (JSONL ingestion with class filtering),
compute_class_weights (inverse-frequency, div-by-zero safe), compute_metrics_for_trainer
(macro F1 + accuracy), and WeightedTrainer.compute_loss (**kwargs-safe for
Transformers 4.38+ num_items_in_batch). All 12 tests pass.
- load_and_prepare_data() now accepts Path | list[Path]; single-Path callers unchanged
- Dedup by MD5(subject + body[:100]); last file/row wins (lets later runs correct labels)
- Prints summary line when duplicates are dropped
- Added _EmailDataset (TorchDataset wrapper), run_finetune(), and argparse CLI
- run_finetune() saves model + tokenizer + training_info.json with score_files provenance
- Stratified split guard: val set size clamped to at least n_classes (handles tiny example data)
- 3 new unit tests (merge, last-write-wins dedup, single-Path compat) + 1 integration test
- All 16 tests pass (15 unit + 1 integration)
- deberta-small: batch_size 16→8 + grad_accum 1→2 (same effective batch),
  gradient_checkpointing=True (fp16 stays off: DeBERTa v3 disentangled
  attention overflows fp16 at the gather step)
- api: _best_cuda_device() picks highest free-VRAM GPU via nvidia-smi;
  sets CUDA_VISIBLE_DEVICES in subprocess env to prevent DataParallel
  replication across both GPUs; adds PYTORCH_ALLOC_CONF=expandable_segments
- SSE log now reports which GPU was selected
Reentrant gradient checkpointing (the default) conflicts with Accelerate's
gradient accumulation context manager -- causes 'backward through graph a
second time' on the first training step. use_reentrant=False uses the
non-reentrant autograd hook path which is compatible with Accelerate >= 0.27.
Adds POST /api/benchmark/cancel and POST /api/finetune/cancel endpoints
that terminate the running subprocess (kill on 3s timeout), and updates
the run generators to emit a cancelled SSE event instead of error when
the job was intentionally stopped.
- except clause in cancel_benchmark/cancel_finetune narrowed from Exception
  to _subprocess.TimeoutExpired (C1)
- _cancelled_jobs.discard() called after registering new proc to prevent
  a stale flag from a prior run masking errors (I2)
- local `import subprocess` removed from run_benchmark and
  run_finetune_endpoint; all Popen calls updated to _subprocess.Popen (I1)
- test patch targets updated from subprocess.Popen to app.api._subprocess.Popen;
  cancelled-event tests updated to set flag in proc.wait() side-effect so
  the discard-on-new-run logic is exercised correctly
- Delete app/label_tool.py (Streamlit UI retired; Vue SPA is sole UI)
- Extract _strip_html and _extract_body into app/utils.py (stdlib-only, reusable)
- Update tests/test_label_tool.py import to app.utils
- Rename start-api/stop-api/restart-api/open-api → start/stop/restart/open in manage.sh
- Remove STREAMLIT variable and all Streamlit-specific case blocks from manage.sh
- Update manage.sh usage section to reflect Vue+FastAPI-only commands
- Add data/sft_candidates.jsonl and data/sft_approved.jsonl to .gitignore
- Add sft.bench_results_dir key to config/label_tool.yaml.example
Rename _strip_html/_extract_body to strip_html/extract_body (public API).
Remove duplicate _TextExtractor, strip_html, and _extract_body from
imap_fetch.py; import from app.utils instead. Update test_label_tool.py
to use the new public names.
- Add v-if guard on failure-reason <p> so null renders no element (not literal "null")
- Clarify mid-quality test description: score is 0.4 to <0.7 (exclusive upper bound)
- Add test: renders nothing for failure_reason when null (+1 → 14 SftCard tests)
- Add /corrections route to Vue router (lazy-loaded CorrectionsView)
- Add Corrections nav item (✍️) to AppSidebar after Benchmark
- Add cf-orch Integration section to SettingsView with bench_results_dir
  field, run scanner, and per-run import table
- Add GET /api/sft/config and POST /api/sft/config endpoints to app/sft.py
- sft.py GET /config: use `or {}` guard so `sft: ~` (null YAML) doesn't
  return None instead of the default empty config
- CorrectionsView: convert handleCorrect/Discard/Flag and handleUndo from
  optimistic to pessimistic — queue mutation only happens after server
  confirms; failures leave item in queue so user can retry cleanly
- SettingsView: call loadSftConfig() on mount so saved bench_results_dir
  is populated instead of always starting empty
pyr0ball added 1 commit 2026-04-08 19:43:42 -07:00
- manage.sh: dev command starts uvicorn --reload on :8503 and Vite dev
  server (auto-port from 5173); kills API on EXIT/INT/TERM trap
- manage.sh: ENV_UI defaults to 'cf' env (overridable via AVOCET_ENV)
- vite.config.ts: add server.proxy to forward /api to :8503 so Vite
  dev server can reach the backend without CORS issues
pyr0ball merged commit c5eaacc767 into main 2026-04-08 22:19:01 -07:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/avocet#15
No description provided.