cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jpsollie
Adept II

headless GPU on amdgpu-pro never returns

I have a problem with the opencl driver:

I have a headless linux system, with 3 gpu cards on it:

-amd cape verde

-amd fury

-matrox mgag200.

I run the latest linux 4.10-rc8 kernel

I modified the DRM calls of the amdgpu-pro code to the kernel so it would be compatible

the driver now boots fine (see dmesg), it has some issues with the fury display, but that display is never used, so I don't care.

now, my question:

I removed all mesa/ocl-cid libraries, but no matter what I try, the function call to clGetPlatformIDs never returns

-when executing the clinfo with no drivers loaded, the result is the same as with amdgpu driver loaded, but I can still press Ctrl+C.

  When the drivers are loaded, there is no possibility to interrupt the program, and the CPU goes up to 6.0, while top reports clinfo is not even running.  the only way to stop this is to "echo b > /proc/sysrq-trigger"

using gdb, I discovered that the problem is with clGetPlatformIDs: when calling this function, it never returns

I am using the 16.60 drivers

I know I am using an unsupported configuration, but has anyone got an idea to make this setup working?

thanks

0 Likes
9 Replies
jpsollie
Adept II

small update:

with the DRM drivers not loaded, I also discovered the CPU load is still at 100% (which is very likely to be a kernel bug introduced by me), and clinfo still hangs (which points to a misconfiguration with the libs).  Though I was able to discover this, I have no idea what I'm doing wrong.  anyone any suggestions?

Thx

0 Likes
jpsollie
Adept II

update2:

bt of clinfo in gdb:

clinfo in gdb while it hangs

#0  0x00007ffff70a0e3f in __pthread_once_slow () from /lib64/libpthread.so.0

#1  0x00007ffff74c8e58 in clGetExtensionFunctionAddress () from /usr/lib64/libOpenCL.so.1

#2  0x00007ffff74c775b in ?? () from /usr/lib64/libOpenCL.so.1

#3  0x00007ffff74c9647 in ?? () from /usr/lib64/libOpenCL.so.1

#4  0x00007ffff70a0e81 in __pthread_once_slow () from /lib64/libpthread.so.0

#5  0x00007ffff74c7d31 in clGetPlatformIDs () from /usr/lib64/libOpenCL.so.1

#6  0x000000000040f687 in ?? ()

#7  0x0000000000407c12 in ?? ()

#8  0x00007ffff6d1e4f0 in __libc_start_main () from /lib64/libc.so.6

#9  0x000000000040e741 in ?? ()

please also note that clinfo in mesa works fine, even with the amdgpu_pro stack loaded.

so, where's the error? I have no idea what's going on, but if I'd accidentally come up with a solution I will post it here

0 Likes
jpsollie
Adept II

okay, I mostly solved it.  steps I took:

-modify the openCL icd file so it loads libamdocl64.so, not libOpenCL.so

- export my LD_LIBRARY_PATH to : /usr/lib64/OpenCL/vendors/amdgpu-pro/:/opt/amdgpu-pro/lib/x86_64-linux-gnu/:/opt/amdgpu-pro/lib/xorg/modules/drivers/:/usr/lib64/:/lib64/:/usr/lib32

-export GPU_FORCE_64BIT_PTR=1

- add -cl-std=1.1 to the BuildProgram flags

now, the program detects my 3 AMD devices, clinfo works fine.  However, we're not there yet:

the cape verde device is unable to execute a kernel.  when I submit it, I get the following error in dmesg:

dmesg

[  349.362694] amdgpu 0000:41:00.0: GPU fault detected: 146 0x062a770c

[  349.362700] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001032B1

[  349.362702] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A07700C

[  349.362707] amdgpu 0000:41:00.0: VM fault (0x0c, vmid 5) at page 1061553, read from '' (0x00000000) (119)

... and the program never finishes any kernel (not even the cpu).

this clearly looks like a firmware bug.  Where should I report it?

0 Likes

You may report the issue to one of the below support forums as applicable.

Drivers & Software

Graphics

Regards,

0 Likes

dipak​, sorry for disturbing you, but you seem to be the only one being able to help:

I reported this at the driver forum.  I got no answer, the only linux activity I get is from myself, all the others are win32/64 related issues or others trying to find an answer to their problem, and I seem to be the only one wanting to help.

BUT, I isolated the problem:

the problem is a malfunctioning clCreateCommandQueue function call:

the analyze_device() function on the cape verde works perfectly, but clCreateCommandQueue causes IO page faults @ AMD-VI.

would it help to use a different version of LLVM? can I report this to AMD directly?

0 Likes

Could you please share the link of the post (at driver forum) so that I could forward it to the concerned team?

Regards,

0 Likes

Thanks.

0 Likes
dipak
Big Boss

Hi,

As I've come to know, the config used here amdgpu-pro linux 4.10 GPU fault on cape verde is not supported at this moment. Gentoo Linux is not a supported distro by amdgpu-pro 16.60 and also, 16.60 does not support Kernel 4.10.

Regards,

0 Likes