avocet

Author	SHA1	Message	Date
pyr0ball	49ec85706c	Merge pull request 'feat: benchmark model picker, category grouping, stats benchmark results' (#20 ) from feat/benchmark-model-picker into main	2026-04-08 23:07:10 -07:00
pyr0ball	478a47f6e0	Merge pull request 'feat: HuggingFace model management tab' (#19 ) from feat/hf-model-queue into main	2026-04-08 23:06:54 -07:00
pyr0ball	7c304ebc45	feat: benchmark model picker, category grouping, stats benchmark results Backend (app/api.py): - GET /api/benchmark/models — returns installed models grouped by adapter type (ZeroShotAdapter, RerankerAdapter, GenerationAdapter, Unknown); reads _MODELS_DIR via app.models so test overrides are respected - GET /api/benchmark/run — add model_names query param (comma-separated); when set, passes --models <names...> to benchmark_classifier.py - GET /api/stats — add benchmark_results field from benchmark_results.json Frontend: - BenchmarkView: collapsible Model Selection panel with per-category checkboxes, select-all per category (supports indeterminate state), collapsed summary badge ("All models (N)" or "N of M selected"); model_names only sent when a strict subset is selected - StatsView: Benchmark Results table (accuracy, macro_f1, weighted_f1) with best-model highlighting per metric; hidden when no results exist	2026-04-08 23:03:56 -07:00
pyr0ball	b6b3d2c390	feat: HuggingFace model management tab - New /api/models router: HF lookup, approval queue (JSONL persistence), SSE download progress via snapshot_download(), installed model listing, path-traversal-safe DELETE - pipeline_tag → adapter type mapping (zero-shot-classification, sentence-similarity, text-generation) - 27 tests covering all endpoints, duplicate detection, path traversal - ModelsView.vue: HF lookup + add, approval queue, live download progress bars via SSE, installed model table with delete - Sidebar entry (🤗 Models) between Benchmark and Corrections	2026-04-08 22:32:35 -07:00
pyr0ball	a7cb3ae62a	Merge pull request 'feat: SFT failure_category — classify why a model response was wrong' (#17 ) from feat/sft-failure-category into main	2026-04-08 22:19:20 -07:00
pyr0ball	c5eaacc767	Merge pull request 'feat: Corrections tab — SFT candidate import, review, and JSONL export' (#15 ) from feat/sft-corrections into main	2026-04-08 22:19:01 -07:00
pyr0ball	9633d9a535	feat: add failure_category field to SFT corrections (#16 ) Adds optional failure_category to SubmitRequest and candidate records so reviewers can classify why a model response was wrong, not just what to do with it. Enables the fine-tune harness to filter training data by failure type (e.g. exclude scoring artifacts, train only on genuine wrong answers). Taxonomy: scoring_artifact \| style_violation \| partial_answer \| wrong_answer \| format_error \| hallucination - app/sft.py: FailureCategory Literal type; SubmitRequest.failure_category; stored on candidate record in POST /submit correct branch - tests/test_sft.py: 3 new tests (stores value, null round-trip, 422 on invalid) - stores/sft.ts: SftFailureCategory type exported; SftQueueItem + SftLastAction updated; setLastAction accepts optional category param - SftCard.vue: chip-group selector shown during correct/discard/flag flow; two-step confirm for discard/flag reveals chips before emitting; category forwarded in all emit payloads - CorrectionsView.vue: handleCorrect/Discard/Flag accept and forward category to POST /api/sft/submit body and store.setLastAction - SftCard.test.ts: 11 new tests covering chip visibility, selection, single-active enforcement, pending-action flow, emit payloads, cancel	2026-04-08 22:10:26 -07:00
pyr0ball	f17aae3bd2	feat: add dev command for hot-reload (uvicorn --reload + Vite HMR) - manage.sh: dev command starts uvicorn --reload on :8503 and Vite dev server (auto-port from 5173); kills API on EXIT/INT/TERM trap - manage.sh: ENV_UI defaults to 'cf' env (overridable via AVOCET_ENV) - vite.config.ts: add server.proxy to forward /api to :8503 so Vite dev server can reach the backend without CORS issues	2026-04-08 19:43:40 -07:00
pyr0ball	09e334359f	fix: pessimistic submit/undo, config null-safe, load config on mount - sft.py GET /config: use `or {}` guard so `sft: ~` (null YAML) doesn't return None instead of the default empty config - CorrectionsView: convert handleCorrect/Discard/Flag and handleUndo from optimistic to pessimistic — queue mutation only happens after server confirms; failures leave item in queue so user can retry cleanly - SettingsView: call loadSftConfig() on mount so saved bench_results_dir is populated instead of always starting empty	2026-04-08 18:49:38 -07:00
pyr0ball	353d0a47a0	feat: Corrections tab — router, sidebar, settings, SFT config endpoints - Add /corrections route to Vue router (lazy-loaded CorrectionsView) - Add Corrections nav item (✍️) to AppSidebar after Benchmark - Add cf-orch Integration section to SettingsView with bench_results_dir field, run scanner, and per-run import table - Add GET /api/sft/config and POST /api/sft/config endpoints to app/sft.py	2026-04-08 18:29:22 -07:00
pyr0ball	e63d77127b	feat: CorrectionsView and useSftKeyboard composable	2026-04-08 15:26:13 -07:00
pyr0ball	03e5f9f9b4	fix: guard null failure_reason render, fix mid-quality test description - Add v-if guard on failure-reason <p> so null renders no element (not literal "null") - Clarify mid-quality test description: score is 0.4 to <0.7 (exclusive upper bound) - Add test: renders nothing for failure_reason when null (+1 → 14 SftCard tests)	2026-04-08 15:23:19 -07:00
pyr0ball	e16ea95dcc	fix: guard aria-describedby from rendering undefined string	2026-04-08 15:22:12 -07:00
pyr0ball	8873920b83	feat: SftCard — quality chip, prompt collapsible, action buttons, correction area slot	2026-04-08 15:19:37 -07:00
pyr0ball	2d939b77f9	feat: SftCorrectionArea — inline correction text area component	2026-04-08 15:16:45 -07:00
pyr0ball	137a9dbb8e	fix: nullable failure_reason, factory fixture for sft store tests	2026-04-08 15:14:29 -07:00
pyr0ball	9c11916d81	feat: useSftStore — SftQueueItem type and Pinia store	2026-04-08 15:11:17 -07:00
pyr0ball	b6d45c746c	fix: shared _is_exportable predicate, return type annotations on export/stats	2026-04-08 15:07:24 -07:00
pyr0ball	07807f0d05	feat: sft router — /export and /stats endpoints	2026-04-08 14:46:08 -07:00
pyr0ball	4ad2907ae8	fix: use Literal type for SubmitRequest.action field	2026-04-08 14:33:38 -07:00
pyr0ball	f19cab60f7	feat: sft router — /queue, /submit, /undo endpoints	2026-04-08 14:22:06 -07:00
pyr0ball	b330e84111	fix: sft router — yaml error handling, none filter, shared jsonl utils, fixture restore	2026-04-08 14:07:09 -07:00
pyr0ball	597ffc7324	feat: sft router skeleton — /api/sft/runs and /api/sft/import	2026-04-08 13:54:58 -07:00
pyr0ball	cfde474454	fix: log on malformed json in _read_jsonl, use streaming id dedup	2026-04-08 07:37:22 -07:00
pyr0ball	bbfae1a622	fix: log warning when sft record is missing id field	2026-04-08 07:30:46 -07:00
pyr0ball	03dac57fd9	feat: sft_import.py — run discovery and JSONL deduplication	2026-04-08 07:13:37 -07:00
pyr0ball	25880e377d	refactor: consolidate HTML extraction into app/utils.py Rename _strip_html/_extract_body to strip_html/extract_body (public API). Remove duplicate _TextExtractor, strip_html, and _extract_body from imap_fetch.py; import from app.utils instead. Update test_label_tool.py to use the new public names.	2026-04-08 06:52:15 -07:00
pyr0ball	ae0ac19505	chore: retire Streamlit app, scaffold sft branch - Delete app/label_tool.py (Streamlit UI retired; Vue SPA is sole UI) - Extract _strip_html and _extract_body into app/utils.py (stdlib-only, reusable) - Update tests/test_label_tool.py import to app.utils - Rename start-api/stop-api/restart-api/open-api → start/stop/restart/open in manage.sh - Remove STREAMLIT variable and all Streamlit-specific case blocks from manage.sh - Update manage.sh usage section to reflect Vue+FastAPI-only commands - Add data/sft_candidates.jsonl and data/sft_approved.jsonl to .gitignore - Add sft.bench_results_dir key to config/label_tool.yaml.example	2026-04-08 06:18:12 -07:00
pyr0ball	cfc09b4731	chore: gitignore CLAUDE.md and docs/superpowers (BSL 1.1 compliance)	2026-03-27 01:04:18 -07:00
pyr0ball	de2a2935b9	chore: gitignore CLAUDE.md and docs/superpowers (BSL 1.1 compliance)	2026-03-27 01:00:30 -07:00
pyr0ball	0d252da2a0	feat(avocet): add cancel buttons for benchmark and fine-tune runs	2026-03-15 18:15:35 -07:00
pyr0ball	e38a28dcc3	fix(avocet): narrow cancel except clause, clear stale cancel flags on new run - except clause in cancel_benchmark/cancel_finetune narrowed from Exception to _subprocess.TimeoutExpired (C1) - _cancelled_jobs.discard() called after registering new proc to prevent a stale flag from a prior run masking errors (I2) - local `import subprocess` removed from run_benchmark and run_finetune_endpoint; all Popen calls updated to _subprocess.Popen (I1) - test patch targets updated from subprocess.Popen to app.api._subprocess.Popen; cancelled-event tests updated to set flag in proc.wait() side-effect so the discard-on-new-run logic is exercised correctly	2026-03-15 18:13:01 -07:00
pyr0ball	0ab49609c0	feat(avocet): add cancel endpoints for benchmark and finetune jobs Adds POST /api/benchmark/cancel and POST /api/finetune/cancel endpoints that terminate the running subprocess (kill on 3s timeout), and updates the run generators to emit a cancelled SSE event instead of error when the job was intentionally stopped.	2026-03-15 18:09:20 -07:00
pyr0ball	db44c9323e	fix(avocet): use_reentrant=False for gradient checkpointing Reentrant gradient checkpointing (the default) conflicts with Accelerate's gradient accumulation context manager -- causes 'backward through graph a second time' on the first training step. use_reentrant=False uses the non-reentrant autograd hook path which is compatible with Accelerate >= 0.27.	2026-03-15 17:23:40 -07:00
pyr0ball	cbc382cc88	fix(avocet): reduce deberta-small VRAM + auto-select freest GPU for training - deberta-small: batch_size 16→8 + grad_accum 1→2 (same effective batch), gradient_checkpointing=True (fp16 stays off: DeBERTa v3 disentangled attention overflows fp16 at the gather step) - api: _best_cuda_device() picks highest free-VRAM GPU via nvidia-smi; sets CUDA_VISIBLE_DEVICES in subprocess env to prevent DataParallel replication across both GPUs; adds PYTORCH_ALLOC_CONF=expandable_segments - SSE log now reports which GPU was selected	2026-03-15 17:09:06 -07:00
pyr0ball	ed818dc341	feat(avocet): add restart-api command to manage.sh	2026-03-15 17:04:00 -07:00
pyr0ball	5d68b0706f	fix(avocet): use startsWith for error class in ft-log (consistent with benchmark log)	2026-03-15 16:14:47 -07:00
pyr0ball	65548f4ddb	feat(avocet): add fine-tune section and trained models badge row to BenchmarkView	2026-03-15 16:09:51 -07:00
pyr0ball	dd352f07cd	fix(avocet): _MODELS_DIR overridable in tests; sanitize score paths against path traversal	2026-03-15 16:07:27 -07:00
pyr0ball	903624a4b8	feat(avocet): add /api/finetune/status and /api/finetune/run endpoints	2026-03-15 16:04:34 -07:00
pyr0ball	48e02f2ed6	fix(avocet): move TorchDataset import to top; split sample_count into total+train	2026-03-15 16:02:43 -07:00
pyr0ball	939ce06f45	feat(avocet): run_finetune, CLI, multi-score-file merge with last-write-wins dedup - load_and_prepare_data() now accepts Path \| list[Path]; single-Path callers unchanged - Dedup by MD5(subject + body[:100]); last file/row wins (lets later runs correct labels) - Prints summary line when duplicates are dropped - Added _EmailDataset (TorchDataset wrapper), run_finetune(), and argparse CLI - run_finetune() saves model + tokenizer + training_info.json with score_files provenance - Stratified split guard: val set size clamped to at least n_classes (handles tiny example data) - 3 new unit tests (merge, last-write-wins dedup, single-Path compat) + 1 integration test - All 16 tests pass (15 unit + 1 integration)	2026-03-15 15:52:41 -07:00
pyr0ball	4e70e79b26	fix(avocet): tighten body truncation test to exact 400-char assertion	2026-03-15 15:44:19 -07:00
pyr0ball	de5794611b	feat(avocet): add finetune data pipeline, class weights, WeightedTrainer Implements load_and_prepare_data (JSONL ingestion with class filtering), compute_class_weights (inverse-frequency, div-by-zero safe), compute_metrics_for_trainer (macro F1 + accuracy), and WeightedTrainer.compute_loss (**kwargs-safe for Transformers 4.38+ num_items_in_batch). All 12 tests pass.	2026-03-15 15:38:45 -07:00
pyr0ball	d1a36bfd63	fix(avocet): guard discover_finetuned_models against malformed/incomplete training_info.json	2026-03-15 15:18:13 -07:00
pyr0ball	df37a8e16d	feat(avocet): auto-discover fine-tuned models in benchmark harness	2026-03-15 11:59:13 -07:00
pyr0ball	179cb67e1c	fix(avocet): FineTunedAdapter GPU device routing + precise body truncation test	2026-03-15 10:56:47 -07:00
pyr0ball	dc321de59f	feat(avocet): add FineTunedAdapter for local checkpoint inference	2026-03-15 10:54:38 -07:00
pyr0ball	f4a654933d	chore(avocet): add scikit-learn to classifier env	2026-03-15 09:44:04 -07:00
pyr0ball	a53f3a7341	feat(avocet): benchmark UI, label fixes, BenchmarkView with charts and SSE run	2026-03-15 09:39:37 -07:00

1 2 3

116 commits