fix(video): enforce PCI_BUS_ID order + force CUDA_VISIBLE_DEVICES assignment
Some checks failed
CI / test (push) Has been cancelled
Mirror / mirror (push) Has been cancelled

CUDA defaults to FASTEST_FIRST device ordering, which does not match
nvidia-smi's PCI bus order on multi-GPU nodes. On Muninn, the RTX 3090
is cuda:0 and the Quadro RTX 4000 is cuda:1 — the opposite of nvidia-smi.

Two fixes:
1. Set CUDA_DEVICE_ORDER=PCI_BUS_ID so --gpu-id always matches nvidia-smi
   and the muninn.yaml profile GPU index assignments.
2. Use direct assignment (os.environ[...] = ...) instead of setdefault —
   setdefault silently no-ops if CUDA_VISIBLE_DEVICES is already present
   in the environment (conda activation, prior run, system default).
This commit is contained in:
pyr0ball 2026-05-26 15:07:30 -07:00
parent 9f7fb45071
commit c2ac55259d

View file

@ -171,10 +171,12 @@ if __name__ == "__main__":
)
args = _parse_args()
# cf-orch sets CUDA_VISIBLE_DEVICES before spawning; only set it here when
# running the service manually (--gpu-id flag) without cf-orch.
# Pin GPU selection unconditionally — --gpu-id is authoritative.
# Force PCI_BUS_ID ordering so --gpu-id matches nvidia-smi (not CUDA's
# default FASTEST_FIRST, which can swap indices on multi-GPU nodes).
if args.device == "cuda" and not args.mock:
os.environ.setdefault("CUDA_VISIBLE_DEVICES", str(args.gpu_id))
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu_id)
mock = args.mock or args.model == "mock"
device = "cpu" if mock else args.device