with the DRM drivers not loaded, I also discovered the CPU load is still at 100% (which is very likely to be a kernel bug introduced by me), and clinfo still hangs (which points to a misconfiguration with the libs). Though I was able to discover this, I have no idea what I'm doing wrong. anyone any suggestions?
bt of clinfo in gdb:
clinfo in gdb while it hangs
#0 0x00007ffff70a0e3f in __pthread_once_slow () from /lib64/libpthread.so.0
#1 0x00007ffff74c8e58 in clGetExtensionFunctionAddress () from /usr/lib64/libOpenCL.so.1
#2 0x00007ffff74c775b in ?? () from /usr/lib64/libOpenCL.so.1
#3 0x00007ffff74c9647 in ?? () from /usr/lib64/libOpenCL.so.1
#4 0x00007ffff70a0e81 in __pthread_once_slow () from /lib64/libpthread.so.0
#5 0x00007ffff74c7d31 in clGetPlatformIDs () from /usr/lib64/libOpenCL.so.1
#6 0x000000000040f687 in ?? ()
#7 0x0000000000407c12 in ?? ()
#8 0x00007ffff6d1e4f0 in __libc_start_main () from /lib64/libc.so.6
#9 0x000000000040e741 in ?? ()
please also note that clinfo in mesa works fine, even with the amdgpu_pro stack loaded.
so, where's the error? I have no idea what's going on, but if I'd accidentally come up with a solution I will post it here
okay, I mostly solved it. steps I took:
-modify the openCL icd file so it loads libamdocl64.so, not libOpenCL.so
- export my LD_LIBRARY_PATH to : /usr/lib64/OpenCL/vendors/amdgpu-pro/:/opt/amdgpu-pro/lib/x86_64-linux-gnu/:/opt/amdgpu-pro/lib/xorg/modules/drivers/:/usr/lib64/:/lib64/:/usr/lib32
- add -cl-std=1.1 to the BuildProgram flags
now, the program detects my 3 AMD devices, clinfo works fine. However, we're not there yet:
the cape verde device is unable to execute a kernel. when I submit it, I get the following error in dmesg:
[ 349.362694] amdgpu 0000:41:00.0: GPU fault detected: 146 0x062a770c
[ 349.362700] amdgpu 0000:41:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001032B1
[ 349.362702] amdgpu 0000:41:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A07700C
[ 349.362707] amdgpu 0000:41:00.0: VM fault (0x0c, vmid 5) at page 1061553, read from '' (0x00000000) (119)
... and the program never finishes any kernel (not even the cpu).
this clearly looks like a firmware bug. Where should I report it?
dipak, sorry for disturbing you, but you seem to be the only one being able to help:
I reported this at the driver forum. I got no answer, the only linux activity I get is from myself, all the others are win32/64 related issues or others trying to find an answer to their problem, and I seem to be the only one wanting to help.
BUT, I isolated the problem:
the problem is a malfunctioning clCreateCommandQueue function call:
the analyze_device() function on the cape verde works perfectly, but clCreateCommandQueue causes IO page faults @ AMD-VI.
would it help to use a different version of LLVM? can I report this to AMD directly?
Could you please share the link of the post (at driver forum) so that I could forward it to the concerned team?
As I've come to know, the config used here amdgpu-pro linux 4.10 GPU fault on cape verde is not supported at this moment. Gentoo Linux is not a supported distro by amdgpu-pro 16.60 and also, 16.60 does not support Kernel 4.10.