9 Replies Latest reply on Mar 15, 2017 4:06 AM by dipak

    headless GPU on amdgpu-pro never returns

    jpsollie

      I have a problem with the opencl driver:

      I have a headless linux system, with 3 gpu cards on it:

      -amd cape verde

      -amd fury

      -matrox mgag200.

      I run the latest linux 4.10-rc8 kernel

      I modified the DRM calls of the amdgpu-pro code to the kernel so it would be compatible

      the driver now boots fine (see dmesg), it has some issues with the fury display, but that display is never used, so I don't care.

       

      now, my question:

      I removed all mesa/ocl-cid libraries, but no matter what I try, the function call to clGetPlatformIDs never returns

      -when executing the clinfo with no drivers loaded, the result is the same as with amdgpu driver loaded, but I can still press Ctrl+C.

        When the drivers are loaded, there is no possibility to interrupt the program, and the CPU goes up to 6.0, while top reports clinfo is not even running.  the only way to stop this is to "echo b > /proc/sysrq-trigger"

      using gdb, I discovered that the problem is with clGetPlatformIDs: when calling this function, it never returns

      I am using the 16.60 drivers

      I know I am using an unsupported configuration, but has anyone got an idea to make this setup working?

       

      thanks

        • Re: headless GPU on amdgpu-pro never returns
          jpsollie

          small update:

          with the DRM drivers not loaded, I also discovered the CPU load is still at 100% (which is very likely to be a kernel bug introduced by me), and clinfo still hangs (which points to a misconfiguration with the libs).  Though I was able to discover this, I have no idea what I'm doing wrong.  anyone any suggestions?

          Thx

          • Re: headless GPU on amdgpu-pro never returns
            jpsollie

            update2:

            bt of clinfo in gdb:

            clinfo in gdb while it hangs

            #0  0x00007ffff70a0e3f in __pthread_once_slow () from /lib64/libpthread.so.0

            #1  0x00007ffff74c8e58 in clGetExtensionFunctionAddress () from /usr/lib64/libOpenCL.so.1

            #2  0x00007ffff74c775b in ?? () from /usr/lib64/libOpenCL.so.1

            #3  0x00007ffff74c9647 in ?? () from /usr/lib64/libOpenCL.so.1

            #4  0x00007ffff70a0e81 in __pthread_once_slow () from /lib64/libpthread.so.0

            #5  0x00007ffff74c7d31 in clGetPlatformIDs () from /usr/lib64/libOpenCL.so.1

            #6  0x000000000040f687 in ?? ()

            #7  0x0000000000407c12 in ?? ()

            #8  0x00007ffff6d1e4f0 in __libc_start_main () from /lib64/libc.so.6

            #9  0x000000000040e741 in ?? ()

             

            please also note that clinfo in mesa works fine, even with the amdgpu_pro stack loaded.

             

            so, where's the error? I have no idea what's going on, but if I'd accidentally come up with a solution I will post it here

            • Re: headless GPU on amdgpu-pro never returns
              jpsollie

              okay, I mostly solved it.  steps I took:

              -modify the openCL icd file so it loads libamdocl64.so, not libOpenCL.so

              - export my LD_LIBRARY_PATH to : /usr/lib64/OpenCL/vendors/amdgpu-pro/:/opt/amdgpu-pro/lib/x86_64-linux-gnu/:/opt/amdgpu-pro/lib/xorg/modules/drivers/:/usr/lib64/:/lib64/:/usr/lib32

              -export GPU_FORCE_64BIT_PTR=1

              - add -cl-std=1.1 to the BuildProgram flags

              now, the program detects my 3 AMD devices, clinfo works fine.  However, we're not there yet:

              the cape verde device is unable to execute a kernel.  when I submit it, I get the following error in dmesg:

              dmesg

              [  349.362694] amdgpu 0000:41:00.0: GPU fault detected: 146 0x062a770c

              [  349.362700] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001032B1

              [  349.362702] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A07700C

              [  349.362707] amdgpu 0000:41:00.0: VM fault (0x0c, vmid 5) at page 1061553, read from '' (0x00000000) (119)

              ... and the program never finishes any kernel (not even the cpu).

               

              this clearly looks like a firmware bug.  Where should I report it?

              • Re: headless GPU on amdgpu-pro never returns
                dipak

                Hi,

                As I've come to know, the config used here amdgpu-pro linux 4.10 GPU fault on cape verde is not supported at this moment. Gentoo Linux is not a supported distro by amdgpu-pro 16.60 and also, 16.60 does not support Kernel 4.10.

                 

                Regards,