cancel
Showing results for 
Search instead for 
Did you mean: 

Graphics Cards

brandonbiggs
Journeyman III

Isolating MI250x Accelerator/GPU with environment variable

Hi, I have a server with 8 (4x2) MI250x GPUs. I'm trying to isolate the GPU using an environment variable. I can do this with Nvidia GPUs with `export CUDA_VISIBLE_DEVICES=1`. Is there an equivalent AMD environment variable? Some search results suggested `export HIP_VISIBLE_DEVICES=1` or `export ROCR_VISIBLE_DEVICES=1`. I tried both of these and it doesn't seem to utilize the specified GPU. This is my first AMD GPU server so I apologize for my inexperience and for comparing it to Nvidia devices.

 

If I do not try to export any additional variables, my code runs, but if I do export any of these 3 env variables, I get the following error:

Memory access fault by GPU node-4 (Agent handle: 0x9539910) on address 0x7feb35e00000. Reason: Unknown.

This also made me curious how the AMD GPUs are identified. When I run `rocm-smi` I can see all 8 GPUs, but the number I'm exporting doesn't seem to line up to the GPU I'm expecting (i.e. export HIP_VISIBLE_DEVICES=1 does not line up with Device 0 or Node 0).

 

========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Avg) (Mem, Compute, ID)
====================================================================================================================
0 4 0x740c, 11743 38.0°C 96.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 560.0W 24% 0%
1 5 0x740c, 61477 35.0°C N/A N/A, N/A, 0 800Mhz 1600Mhz 0% auto 0.0W 2% 0%
2 2 0x740c, 58606 30.0°C 98.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 560.0W 5% 0%
3 3 0x740c, 36660 31.0°C N/A N/A, N/A, 0 800Mhz 1600Mhz 0% auto 0.0W 2% 0%
4 8 0x740c, 28341 37.0°C 95.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 560.0W 1% 0%
5 9 0x740c, 30122 35.0°C N/A N/A, N/A, 0 800Mhz 1600Mhz 0% auto 0.0W 2% 0%
6 6 0x740c, 9668 36.0°C 84.0W N/A, N/A, 0 800Mhz 1600Mhz 0% auto 560.0W 2% 0%
7 7 0x740c, 3258 30.0°C N/A N/A, N/A, 0 800Mhz 1600Mhz 0% auto 0.0W 2% 0%
====================================================================================================================
=============================================== End of ROCm SMI Log ================================================  

 

0 Likes
1 Reply
brandonbiggs
Journeyman III

I also did some testing. It seems the devices don't line up as I would expect?

 

                              Device  Node
export CUDA_VISIBLE_DEVICES=0 2       2
export CUDA_VISIBLE_DEVICES=1 3       3
export CUDA_VISIBLE_DEVICES=2 0       4
export CUDA_VISIBLE_DEVICES=3 1       5
export CUDA_VISIBLE_DEVICES=4 6       6
export CUDA_VISIBLE_DEVICES=5 7       7
export CUDA_VISIBLE_DEVICES=6 4       8
export CUDA_VISIBLE_DEVICES=7 5       9
0 Likes