docs: add LLM queue optimizer design spec

Resource-aware batch scheduler for LLM tasks. Closes #2.
2026-03-14 16:38:47 -07:00 · 2026-03-14 16:38:47 -07:00 · 9fcfe7daa1
commit 9fcfe7daa1
parent 397b778217
1 changed files with 320 additions and 0 deletions
--- a/docs/superpowers/specs/2026-03-14-llm-queue-optimizer-design.md
+++ b/docs/superpowers/specs/2026-03-14-llm-queue-optimizer-design.md
@ -0,0 +1,320 @@
 # LLM Queue Optimizer — Design Spec
 **Date:** 2026-03-14
 **Branch:** `feature/llm-queue-optimizer`
 **Closes:** [#2](https://git.opensourcesolarpunk.com/Circuit-Forge/peregrine/issues/2)
 **Author:** pyr0ball
 ---
 ## Problem
 On single-GPU and CPU-only systems, the background task runner spawns a daemon thread for every task immediately on submission. When a user approves N jobs at once, N threads race to load their respective LLM models simultaneously, causing repeated model swaps and significant latency overhead.
 The root issue is that `submit_task()` is a spawn-per-task model with no scheduling layer. SQLite's `background_tasks` table is a status log, not a consumed work queue.
 Additionally, on restart all `queued` tasks are cleared to `failed`, discarding pending work.
 ---
 ## Goals
 - Eliminate unnecessary model switching by batching LLM tasks by type
 - Allow concurrent model execution when VRAM permits multiple models simultaneously
 - Preserve FIFO ordering within each task type
 - Survive process restarts — `queued` tasks resume after restart
 - Apply to all tiers (no tier gating)
 - Keep non-LLM tasks (discovery, email sync, scrape, enrich) unaffected — they continue to spawn free threads
 ---
 ## Non-Goals
 - Changing the LLM router fallback chain
 - Adding new task types
 - Tier gating on the scheduler
 - Persistent task history in memory
 ---
 ## Architecture
 ### Task Classification
 ```
 LLM_TASK_TYPES = {"cover_letter", "company_research", "wizard_generate"}
 ```
 All other task types (`discovery`, `email_sync`, `scrape_url`, `enrich_descriptions`,
 `enrich_craigslist`, `prepare_training`) bypass the scheduler and spawn free threads,
 unchanged from the current implementation.
 ### Component Overview
 ```
 submit_task()  ──→  TaskScheduler.enqueue(task_id, task_type, job_id, params)
                         │
                         ├── LLM task? ──→  per-type deque  ──→  Scheduler loop
                         │                                              │
                         └── Non-LLM task? ──→  spawn thread (unchanged)
                                                                        │
                                          ┌─────────────────────────────┘
                                          ▼
                                   Scheduling cycle
                                   (wakes on enqueue or batch completion)
                                          │
                                   Clean up finished batches, release VRAM
                                          │
                                   Sort eligible types by queue depth (desc)
                                          │
                                   For each type:
                                     reserved_vram + budget[type] ≤ available_vram?
                                          │ yes                    │ no
                                          ▼                        ▼
                                   Start batch worker         skip (wait for slot)
                                   (serial: one task at a time)
                                          │
                                   Batch worker signals done → scheduler re-evaluates
 ```
 ### New File: `scripts/task_scheduler.py`
 **State:**
 | Attribute | Type | Purpose |
 |---|---|---|
 | `_queues` | `dict[str, deque[TaskSpec]]` | Per-type pending task deques |
 | `_active` | `dict[str, Thread]` | Currently running batch worker per type |
 | `_reserved_vram` | `float` | Sum of VRAM budgets for active batches |
 | `_available_vram` | `float` | Total VRAM from `get_gpus()`; 999.0 on CPU-only |
 | `_lock` | `threading.Lock` | Protects all mutable scheduler state |
 | `_wake` | `threading.Event` | Pulsed on enqueue or batch completion |
 | `_stop` | `threading.Event` | Set by `shutdown()` to terminate the loop |
 **Scheduler loop:**
 ```python
 while not _stop.is_set():
    _wake.wait(timeout=30)
    _wake.clear()
    with _lock:
        # Release finished batches
        for t, thread in list(_active.items()):
            if not thread.is_alive():
                _reserved_vram -= _budgets.get(t, 0)
                del _active[t]
        # Start new batches where VRAM allows
        candidates = sorted(
            [t for t in _queues if _queues[t] and t not in _active],
            key=lambda t: len(_queues[t]),
            reverse=True,
        )
        for task_type in candidates:
            budget = _budgets.get(task_type, DEFAULT_VRAM_BUDGETS.get(task_type, 0))
            if _reserved_vram + budget <= _available_vram:
                thread = Thread(target=_batch_worker, args=(task_type,), daemon=True)
                _active[task_type] = thread
                _reserved_vram += budget
                thread.start()
 ```
 **Batch worker:**
 ```python
 def _batch_worker(task_type: str) -> None:
    try:
        while True:
            with _lock:
                if not _queues[task_type]:
                    break
                task = _queues[task_type].popleft()
            _run_task(db_path, task.id, task_type, task.job_id, task.params)
    finally:
        with _lock:
            _active.pop(task_type, None)
            _reserved_vram -= _budgets.get(task_type, 0)
        _wake.set()
 ```
 Tasks arriving mid-batch for an already-active type are appended to the deque and
 picked up naturally by the running batch worker — no re-scheduling needed.
 **Singleton access:**
 ```python
 _scheduler: TaskScheduler | None = None
 def get_scheduler(db_path: Path) -> TaskScheduler:
    global _scheduler
    if _scheduler is None:
        _scheduler = TaskScheduler(db_path)
        _scheduler.start()
    return _scheduler
 def reset_scheduler() -> None:
    """Tear down and clear singleton. Test teardown only."""
    global _scheduler
    if _scheduler:
        _scheduler.shutdown()
        _scheduler = None
 ```
 ### VRAM Budget Configuration
 Declared in `config/llm.yaml` under a `scheduler:` key:
 ```yaml
 scheduler:
  vram_budgets:
    cover_letter: 2.5       # alex-cover-writer:latest (~2GB GGUF + headroom)
    company_research: 5.0   # llama3.1:8b or vllm model
    wizard_generate: 2.5    # same model family as cover_letter
  max_queue_depth: 500
 ```
 Defaults (used when key absent — backwards compatible with existing installs):
 ```python
 DEFAULT_VRAM_BUDGETS = {
    "cover_letter":     2.5,
    "company_research": 5.0,
    "wizard_generate":  2.5,
 }
 ```
 `_available_vram` is read from `preflight.get_gpus()` at scheduler startup (sum across
 all GPUs). CPU-only systems get `_available_vram = 999.0`, allowing all type batches to
 run concurrently — preserving existing behavior on CPU installs.
 ### Memory Safety
 - **Batch worker `finally` block** — always releases `_reserved_vram` and fires `_wake`,
  even if `_run_task()` raises. Prevents permanently wedged VRAM reservations.
 - **Scheduler loop reaps dead threads** — `thread.is_alive()` check catches any worker
  that exits without firing `_wake` (defense in depth).
 - **Max queue depth** — `enqueue()` rejects tasks past `max_queue_depth` with a logged
  warning. Prevents unbounded memory growth under pathological conditions.
 - **No in-memory history** — completed/failed state lives exclusively in SQLite. Deques
  hold only pending `TaskSpec` namedtuples. Memory footprint is `O(pending tasks)`.
 - **`reset_scheduler()`** — explicit teardown for test isolation. Sets `_stop` event,
  joins the scheduler thread (with timeout), clears the module-level reference.
 ---
 ## Changes to Existing Files
 ### `scripts/task_runner.py`
 `submit_task()` becomes a thin shim:
 ```python
 def submit_task(db_path, task_type, job_id=None, params=None):
    task_id, is_new = insert_task(db_path, task_type, job_id or 0, params=params)
    if is_new:
        from scripts.task_scheduler import get_scheduler
        get_scheduler(db_path).enqueue(task_id, task_type, job_id or 0, params)
    return task_id, is_new
 ```
 `_run_task()` and all task handler branches remain unchanged.
 ### `scripts/db.py`
 Add `reset_running_tasks()` helper (alongside existing `kill_stuck_tasks()`):
 ```python
 def reset_running_tasks(db_path: Path = DEFAULT_DB) -> int:
    """On restart: mark in-flight tasks failed. Queued tasks are left for scheduler."""
    conn = sqlite3.connect(db_path)
    count = conn.execute(
        "UPDATE background_tasks SET status='failed', error='Interrupted by restart',"
        " finished_at=datetime('now') WHERE status='running'"
    ).rowcount
    conn.commit()
    conn.close()
    return count
 ```
 ### `app/app.py`
 Replace `kill_stuck_tasks()` call with `reset_running_tasks()` on startup:
 ```python
 # Before
 kill_stuck_tasks(db_path)
 # After — queued tasks survive for the scheduler to resume
 reset_running_tasks(db_path)
 # Scheduler reads surviving 'queued' rows during get_scheduler() startup
 ```
 ### `config/llm.yaml.example`
 Add `scheduler:` section documenting VRAM budget keys.
 ---
 ## Data Model
 No schema changes. The existing `background_tasks` table supports all scheduler needs:
 | Column | Scheduler use |
 |---|---|
 | `task_type` | Queue routing |
 | `status` | `queued` → pending; `running` → active; `completed`/`failed` → done |
 | `created_at` | FIFO ordering within type |
 | `params` | Passed through to `_run_task()` unchanged |
 ---
 ## Durability
 On startup, `TaskScheduler.__init__()` queries:
 ```sql
 SELECT id, task_type, job_id, params
 FROM background_tasks
 WHERE status = 'queued'
 ORDER BY created_at ASC
 ```
 LLM tasks are pushed onto their respective deques. Non-LLM tasks (which don't survive
 restarts under the current model) are re-spawned as free threads.
 `running` rows are reset to `failed` by `reset_running_tasks()` before the scheduler
 starts — their results are unknown and must be re-submitted by the user.
 ---
 ## Testing (`tests/test_task_scheduler.py`)
 | Test | What it verifies |
 |---|---|
 | `test_llm_tasks_batch_by_type` | N cover_letter + M research enqueued; all cover_letters execute before any research when VRAM only fits one model |
 | `test_fifo_within_type` | Arrival order preserved within a type batch |
 | `test_concurrent_batches_when_vram_allows` | Two type batches start simultaneously when `available_vram` fits both budgets |
 | `test_new_tasks_picked_up_mid_batch` | Task enqueued while batch is active is consumed by the running worker |
 | `test_worker_crash_releases_vram` | `_run_task` raises; `_reserved_vram` returns to 0; scheduler continues |
 | `test_non_llm_tasks_bypass_scheduler` | `discovery`, `email_sync` etc. spawn free threads; scheduler deques untouched |
 | `test_durability_on_startup` | DB has existing `queued` rows; scheduler re-enqueues them on init |
 | `test_running_rows_reset_on_startup` | `running` rows → `failed` via `reset_running_tasks()`; `queued` rows untouched |
 | `test_max_queue_depth` | Enqueue past limit logs warning and does not crash |
 | `test_reset_scheduler_cleans_up` | `reset_scheduler()` stops loop thread; no lingering threads |
 All tests mock `_run_task` to avoid real LLM calls. `reset_scheduler()` called in
 teardown for isolation.
 ---
 ## Files Touched
 | File | Change |
 |---|---|
 | `scripts/task_scheduler.py` | **New** — ~160 lines |
 | `scripts/task_runner.py` | `submit_task()` shim — ~8 lines changed |
 | `scripts/db.py` | `reset_running_tasks()` — ~10 lines added |
 | `app/app.py` | Startup: `kill_stuck_tasks` → `reset_running_tasks` |
 | `config/llm.yaml.example` | Add `scheduler:` section |
 | `tests/test_task_scheduler.py` | **New** — ~200 lines |