AI Discussions

brandonbiggs · ‎06-10-2024

Hi, I'm running a language model on an MI250x. If I try to isolate the software to just use a single GPU with the environment variable `CUDA_VISIBLE_DEVICES=0`, my model will load and I can see it using GPU memory (24%). But when I try to run inference on the model, I get an error:

Memory access fault by GPU node-9 (Agent handle: 0x9354ed0) on address 0x7f94b2200000. Reason: Unknown.

If I change `CUDA_VISIBLE_DEVICES=0` to CUDA_VISIBLE_DEVICES=0,1 my model will load on the same GPU and continue just using one GPU worth of memory but when I run inference, it works without the error. This model is small enough it can easily fit into the GPUs memory, so it doesn't seem like an OOM issue. Does anyone know why this might be happening?

AI Discussions

Running AI Model on GPUs