fix: 5 pre-existing test failures on main (models isolation, cforch return type, finetune GPU) #56

New issue

Closed

opened 2026-05-05 14:35:59 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-05-05 14:35:59 -07:00

Owner

Pre-existing failures identified during avocet#55 review

All 5 failures exist on main and predate the embedding k-NN work. Zero regressions were introduced by #55.

1-3. `test_models.py` — test isolation gap (3 failures)

Tests: test_installed_empty, test_installed_detects_downloaded_model, test_installed_detects_finetuned_model

Root cause: list_installed() in app/models.py:896-898 scans both _MODELS_DIR and _CF_TEXT_MODELS_DIR. The reset_models_globals fixture redirects _MODELS_DIR to tmp but has no setter for _CF_TEXT_MODELS_DIR, which points at /Library/Assets/LLM/cf-text/models (15 real models on the dev machine). All three tests that assert exact counts (== [] or == 1) fail because of the leaked models.

Fix: Add set_cf_text_models_dir(path: Path) -> None to app/models.py (mirrors the existing set_models_dir()). Update the reset_models_globals fixture to redirect both dirs — point _CF_TEXT_MODELS_DIR at a nonexistent tmp subpath so the scan skips it cleanly.

Effort: ~5 lines (1 setter + 2-line fixture update).

4. `test_cforch.py::test_results_returns_latest_summary`

Root cause: get_results() in app/cforch.py:498 has return type -> list, but summary.json is a dict. FastAPI response validation raises ResponseValidationError: Input should be a valid list.

The endpoint name, docstring, and test all indicate intent to return a single summary dict. The -> list annotation is wrong.

Fix: Change def get_results() -> list: to def get_results() -> dict:.

Effort: Trivial (1 word).

5. `test_finetune.py::test_integration_finetune_on_example_data`

Root cause: torch.OutOfMemoryError: CUDA out of memory — GPU has only 10.5 MiB free when another process holds 5.72 GiB of VRAM. This test passes when the GPU is idle and fails when cf-orch has a model loaded.

There is also a secondary correctness issue visible in the output:

classifier.bias | MISMATCH | ckpt: torch.Size([3]) vs model: torch.Size([2])

A checkpoint trained on 3 labels is being reloaded into a 2-label model. The OOM masks this during the current run, but it would surface as a classification error on a less-loaded machine.

Fix (environmental): Mark with @pytest.mark.slow or @pytest.mark.gpu and exclude from the default pytest run. Only run explicitly when GPU is idle.

Fix (correctness): Investigate the label count mismatch in the fine-tune checkpoint reload path — the checkpoint and model must agree on num_labels before loading weights.

Effort: Skip marker is trivial; label mismatch needs investigation.

Labels

bug / test / good first issue (items 1-4 are mechanical fixes)
blocked:gpu (item 5 environmental OOM)

## Pre-existing failures identified during avocet#55 review All 5 failures exist on `main` and predate the embedding k-NN work. Zero regressions were introduced by #55. --- ### 1-3. `test_models.py` — test isolation gap (3 failures) **Tests:** `test_installed_empty`, `test_installed_detects_downloaded_model`, `test_installed_detects_finetuned_model` **Root cause:** `list_installed()` in `app/models.py:896-898` scans both `_MODELS_DIR` and `_CF_TEXT_MODELS_DIR`. The `reset_models_globals` fixture redirects `_MODELS_DIR` to tmp but has no setter for `_CF_TEXT_MODELS_DIR`, which points at `/Library/Assets/LLM/cf-text/models` (15 real models on the dev machine). All three tests that assert exact counts (`== []` or `== 1`) fail because of the leaked models. **Fix:** Add `set_cf_text_models_dir(path: Path) -> None` to `app/models.py` (mirrors the existing `set_models_dir()`). Update the `reset_models_globals` fixture to redirect both dirs — point `_CF_TEXT_MODELS_DIR` at a nonexistent tmp subpath so the scan skips it cleanly. **Effort:** ~5 lines (1 setter + 2-line fixture update). --- ### 4. `test_cforch.py::test_results_returns_latest_summary` **Root cause:** `get_results()` in `app/cforch.py:498` has return type `-> list`, but `summary.json` is a dict. FastAPI response validation raises `ResponseValidationError: Input should be a valid list`. The endpoint name, docstring, and test all indicate intent to return a single summary dict. The `-> list` annotation is wrong. **Fix:** Change `def get_results() -> list:` to `def get_results() -> dict:`. **Effort:** Trivial (1 word). --- ### 5. `test_finetune.py::test_integration_finetune_on_example_data` **Root cause:** `torch.OutOfMemoryError: CUDA out of memory` — GPU has only 10.5 MiB free when another process holds 5.72 GiB of VRAM. This test passes when the GPU is idle and fails when cf-orch has a model loaded. There is also a secondary correctness issue visible in the output: ``` classifier.bias | MISMATCH | ckpt: torch.Size([3]) vs model: torch.Size([2]) ``` A checkpoint trained on 3 labels is being reloaded into a 2-label model. The OOM masks this during the current run, but it would surface as a classification error on a less-loaded machine. **Fix (environmental):** Mark with `@pytest.mark.slow` or `@pytest.mark.gpu` and exclude from the default `pytest` run. Only run explicitly when GPU is idle. **Fix (correctness):** Investigate the label count mismatch in the fine-tune checkpoint reload path — the checkpoint and model must agree on `num_labels` before loading weights. **Effort:** Skip marker is trivial; label mismatch needs investigation. --- ## Labels - `bug` / `test` / `good first issue` (items 1-4 are mechanical fixes) - `blocked:gpu` (item 5 environmental OOM)