fix(video): enforce PCI_BUS_ID order + force CUDA_VISIBLE_DEVICES assignment
CUDA defaults to FASTEST_FIRST device ordering, which does not match nvidia-smi's PCI bus order on multi-GPU nodes. On Muninn, the RTX 3090 is cuda:0 and the Quadro RTX 4000 is cuda:1 — the opposite of nvidia-smi. Two fixes: 1. Set CUDA_DEVICE_ORDER=PCI_BUS_ID so --gpu-id always matches nvidia-smi and the muninn.yaml profile GPU index assignments. 2. Use direct assignment (os.environ[...] = ...) instead of setdefault — setdefault silently no-ops if CUDA_VISIBLE_DEVICES is already present in the environment (conda activation, prior run, system default).
This commit is contained in:
parent
9f7fb45071
commit
c2ac55259d
1 changed files with 5 additions and 3 deletions
|
|
@ -171,10 +171,12 @@ if __name__ == "__main__":
|
|||
)
|
||||
args = _parse_args()
|
||||
|
||||
# cf-orch sets CUDA_VISIBLE_DEVICES before spawning; only set it here when
|
||||
# running the service manually (--gpu-id flag) without cf-orch.
|
||||
# Pin GPU selection unconditionally — --gpu-id is authoritative.
|
||||
# Force PCI_BUS_ID ordering so --gpu-id matches nvidia-smi (not CUDA's
|
||||
# default FASTEST_FIRST, which can swap indices on multi-GPU nodes).
|
||||
if args.device == "cuda" and not args.mock:
|
||||
os.environ.setdefault("CUDA_VISIBLE_DEVICES", str(args.gpu_id))
|
||||
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu_id)
|
||||
|
||||
mock = args.mock or args.model == "mock"
|
||||
device = "cpu" if mock else args.device
|
||||
|
|
|
|||
Loading…
Reference in a new issue