voice benchmark: parallel model scoring to fan out across cluster nodes #39

New issue

Open

opened 2026-04-22 10:44:31 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-22 10:44:31 -07:00

Owner

Problem

The voice benchmark (scripts/benchmark_voice.py run --cforch) runs models sequentially — allocate, score all 6 prompts, release, repeat. Even with 3 nodes online (Heimdall, Navi, Strahl), only one GPU is ever busy at a time.

Goal

Fan out model scoring across available cluster nodes in parallel so all GPUs are utilized simultaneously.

Proposed approach

Launch N concurrent workers (e.g. via concurrent.futures.ThreadPoolExecutor or asyncio.gather)
Each worker: allocate cf-text for one model on any available node, score all prompts, release
Workers run concurrently — coordinator picks the best available node per allocation
Collect results and merge into the same report format

Constraints

Must still respect --max-vram filtering before queuing
Worker count should be bounded (suggest: min(len(models), num_online_gpus) or a --parallel N flag)
Results order in the report should match the ranked catalog order, not arrival order
try/finally lease release must be preserved per worker

Context

Cluster currently: Heimdall (2x RTX 4000 8 GB), Navi (RTX 4000 8 GB), Strahl (RTX 2060 6 GB). With 4 GPUs available, a 8-model run could complete in ~2x instead of 8x the single-model time.

## Problem The voice benchmark (`scripts/benchmark_voice.py run --cforch`) runs models sequentially — allocate, score all 6 prompts, release, repeat. Even with 3 nodes online (Heimdall, Navi, Strahl), only one GPU is ever busy at a time. ## Goal Fan out model scoring across available cluster nodes in parallel so all GPUs are utilized simultaneously. ## Proposed approach - Launch N concurrent workers (e.g. via `concurrent.futures.ThreadPoolExecutor` or `asyncio.gather`) - Each worker: allocate cf-text for one model on any available node, score all prompts, release - Workers run concurrently — coordinator picks the best available node per allocation - Collect results and merge into the same report format ## Constraints - Must still respect `--max-vram` filtering before queuing - Worker count should be bounded (suggest: min(len(models), num_online_gpus) or a `--parallel N` flag) - Results order in the report should match the ranked catalog order, not arrival order - `try/finally` lease release must be preserved per worker ## Context Cluster currently: Heimdall (2x RTX 4000 8 GB), Navi (RTX 4000 8 GB), Strahl (RTX 2060 6 GB). With 4 GPUs available, a 8-model run could complete in ~2x instead of 8x the single-model time.