12 Replies Latest reply on Jan 29, 2013 1:13 AM by himanshu.gautam

    clinfo segmentation fault

    yurtesen

      In some machines clinfo is crashing with segmentation fault. This specific machine below is with an Intel processor and Nvidia GPU. I am not sure why clinfo is trying to become root, load fglrx or segmentation fault. Any ideas? It is not a big deal but clinfo was a nice tool for quick look at system devices.

      $ clinfo

      Setting of real/effective user Id to 0/0 failed

      FATAL: Module fglrx not found.

      Error! Fail to load fglrx kernel module! Maybe you can switch to root user to load kernel module directly

      No protocol specified

      Number of platforms:                             3

        Platform Profile:                              FULL_PROFILE

        Platform Version:                              OpenCL 1.1 CUDA 4.2.1

        Platform Name:                                 NVIDIA CUDA

        Platform Vendor:                               NVIDIA Corporation

        Platform Extensions:                           cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll

        Platform Profile:                              FULL_PROFILE

        Platform Version:                              OpenCL 1.2 LINUX

        Platform Name:                                 Intel(R) OpenCL

        Platform Vendor:                               Intel(R) Corporation

        Platform Extensions:                           cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread

        Platform Profile:                              FULL_PROFILE

        Platform Version:                              OpenCL 1.2 AMD-APP (1113.2)

        Platform Name:                                 AMD Accelerated Parallel Processing

        Platform Vendor:                               Advanced Micro Devices, Inc.

        Platform Extensions:                           cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

       

       

        Platform Name:                                 NVIDIA CUDA

      Number of devices:                               1

        Device Type:                                   CL_DEVICE_TYPE_GPU

        Device ID:                                     4318

        Max compute units:                             8

        Max work items dimensions:                     3

          Max work items[0]:                           1024

          Max work items[1]:                           1024

          Max work items[2]:                           64

        Max work group size:                           1024

        Preferred vector width char:                   1

        Preferred vector width short:                  1

        Preferred vector width int:                    1

        Preferred vector width long:                   1

        Preferred vector width float:                  1

        Preferred vector width double:                 1

        Native vector width char:                      1

        Native vector width short:                     1

        Native vector width int:                       1

        Native vector width long:                      1

        Native vector width float:                     1

        Native vector width double:                    1

        Max clock frequency:                           1058Mhz

        Address bits:                                  32

        Max memory allocation:                         536690688

        Image support:                                 Yes

        Max number of images read arguments:           256

        Max number of images write arguments:          16

        Max image 2D width:                            32768

        Max image 2D height:                           32768

        Max image 3D width:                            4096

        Max image 3D height:                           4096

        Max image 3D depth:                            4096

        Max samplers within kernel:                    32

        Max size of kernel argument:                   4352

        Alignment (bits) of base address:              4096

        Minimum alignment (bytes) for any datatype:    128

        Single precision floating point capability

          Denorms:                                     Yes

          Quiet NaNs:                                  Yes

          Round to nearest even:                       Yes

          Round to zero:                               Yes

          Round to +ve and infinity:                   Yes

          IEEE754-2008 fused multiply-add:             Yes

        Cache type:                                    Read/Write

        Cache line size:                               128

        Cache size:                                    131072

        Global memory size:                            2146762752

        Constant buffer size:                          65536

        Max number of constant args:                   9

        Local memory type:                             Scratchpad

        Local memory size:                             49152

      Segmentation fault (core dumped)

      $

        • Re: clinfo segmentation fault
          himanshu.gautam

          The error messages that you see toward the beginning are from "clGetPlatformIds" call. This issue has been reported to AMD and will be fixed in a future release.

           

          The segmentation fault is reported while listing CUDA device information. So, I cant for sure tell who is the culprit.

          Note that ICD enables multiple platforms to co-exist with one another by cascading a single call amongst multiple platforms. So the call is getting cascaded among all platforms.

           

          Can you run clInfo from GDB and take a stack trace and post it here. That might give some insight as to who is failing.

            • Re: clinfo segmentation fault
              yurtesen

              Do you need this?

              Program received signal SIGSEGV, Segmentation fault.

              0x0000000000000000 in ?? ()

              (gdb) backtrace

              #0  0x0000000000000000 in ?? ()

              #1  0x000000000040c5a7 in cl::Device::Device(cl::Device const&) ()

              #2  0x0000000000405c8f in T.1902 ()

              #3  0x0000000000407ded in main ()

              (gdb)

              I also ran valgrind

              ==1900== Jump to the invalid address stated on the next line

              ==1900==    at 0x0: ???

              ==1900==    by 0x40C5A6: cl::Device::Device(cl::Device const&) (in /usr/bin/clinfo)

              ==1900==    by 0x405C8E: T.1902 (in /usr/bin/clinfo)

              ==1900==    by 0x407DEC: main (in /usr/bin/clinfo)

              ==1900==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

              ==1900==

              ==1900==

              ==1900== Process terminating with default action of signal 11 (SIGSEGV)

              ==1900==  Bad permissions for mapped region at address 0x0

              ==1900==    at 0x0: ???

              ==1900==    by 0x40C5A6: cl::Device::Device(cl::Device const&) (in /usr/bin/clinfo)

              ==1900==    by 0x405C8E: T.1902 (in /usr/bin/clinfo)

              ==1900==    by 0x407DEC: main (in /usr/bin/clinfo)

              Thanks!

              Evren

                • Re: clinfo segmentation fault
                  himanshu.gautam

                  Hmm... Well, I was just hoping to see some clues... but, I cant make out anything except for some NULL pointer related stuff......Actually a Jump into a NULL pointer...

                  This reminds me of ICD, where in the OpenCL implementations obtain addresses of various APIs and make a CALL to the address returned.

                  To debug this further, :

                  1. What is the OpenCL implementation that links dynamically to clinfo?

                       You can run a "ldd" on clinfo. That will show what is linked to the clInfo.

                        Depending on what you have your PATH and LD_LIBRARY_PATH -- This can either link to AMD's OpenCL or NVIDIA's OpenCL.

                       If it is pointing to NVIDIA's OpenCL -Please change your PATH or LD_.... so that AMD's opencl is linked dynamically.

                   

                  At the moment, I dont have a system and I really dont know whether clInfo is statically linked or not.

                  I will check this on Monday.

                  Meanwhile, Can you check and publish your findings, if you find time? Thanks,

                    • Re: clinfo segmentation fault
                      yurtesen

                      I think it has nothing to do with OpenCL library. It appears clinfo is simply trying to use wrong pointer when trying to access the next device. It looks like a programming error in clinfo. It seems to get confused when there are multiple platforms. (a guess)

                       

                      I also found out that the 'clinfo' from APP SDK 2.7 does not cause segmentation faults. I simply tested this by only extracting clinfo binary from APP SDK 2.7 and running it on the problem system. So the problem was introduced in clinfo itself and in APP SDK 2.8

                       

                      1-I tried with different libOpenCL paths...

                       

                      Causes segmentation fault:

                      $ ldd /usr/bin/clinfo

                              linux-vdso.so.1 =>  (0x00007fff5862e000)

                              libOpenCL.so.1 => /opt/AMDAPP/lib/x86_64/libOpenCL.so.1 (0x00007f570f80d000)

                              libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f570f5d2000)

                              libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f570f2d5000)

                              libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f570f0d1000)

                              libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f570eebb000)

                              libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f570eafb000)

                              /lib64/ld-linux-x86-64.so.2 (0x00007f570fa15000)

                       

                      Changed LD_LIBRARY_PATH and still causes segmentation fault

                      $ ldd /usr/bin/clinfo

                              linux-vdso.so.1 =>  (0x00007fff573ff000)

                              libOpenCL.so.1 => /opt/intel/opencl-1.2-3.0.56860/lib64/libOpenCL.so.1 (0x00007f4967ebd000)

                              libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4967c81000)

                              libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f4967985000)

                              libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4967781000)

                              libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f496756b000)

                              libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f49671ab000)

                              /lib64/ld-linux-x86-64.so.2 (0x00007f49680c6000)

                              libnuma.so.1 => /usr/lib/libnuma.so.1 (0x00007f4966fa0000)

                              libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f4966c9d000)

                      Nvidia's lib causes a different error:

                      $ ldd /usr/bin/clinfo

                      /usr/bin/clinfo: /usr/lib/libOpenCL.so.1: no version information available (required by /usr/bin/clinfo)

                      /usr/bin/clinfo: /usr/lib/libOpenCL.so.1: no version information available (required by /usr/bin/clinfo)

                              linux-vdso.so.1 =>  (0x00007fff36bff000)

                              libOpenCL.so.1 => /usr/lib/libOpenCL.so.1 (0x00007fed1de33000)

                              libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fed1dbf8000)

                              libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fed1d8fb000)

                              libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fed1d6f7000)

                              libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fed1d4e1000)

                              libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fed1d121000)

                              /lib64/ld-linux-x86-64.so.2 (0x00007fed1e03a000)

                      The error is

                      clinfo: relocation error: clinfo: symbol clRetainDevice, version OPENCL_1.2 not defined in file libOpenCL.so.1 with link time reference

                        • Re: clinfo segmentation fault
                          himanshu.gautam

                          Thanks for your time on this. i have an nvidia machine running linux-64. let me try installing APP SDK 2.8 and check out clinfo.

                          I will do this on Monday and pass on the info to the AMD engg team, if needed.

                           

                          Meanwhile, Thanks a lot for your time and Thanks for reporting the issue.

                          1 of 1 people found this helpful
                            • Re: clinfo segmentation fault
                              himanshu.gautam

                              I was able to reproduce the issue with Intel CPU + NVIDIA GPU machine.

                               

                              I installed 2.8 (without any graphics driver - because none is needed - so CPU runtime will be used) and it seg-faulted while listing NVIDIA's implementation.

                               

                              2.7 works fine on this machine.

                               

                              Yurtesen, Thanks for reporting this issue and  Thanks a lot for your time on this. I will pass this on the AMD Engg team.

                               

                              Also, I found few things about 2.8 clinfo.

                              clinfo2.7 is 41096 bytes in size.

                              clinfo2.8 is 581608 bytes in size.

                              This means a static link has gone inside 2.8 -- which is corroborated by LDD.

                              libstdc++.so.6 is no more listed inside 2.8's LDD output.

                              Which means -- Some C++ library has been statically linked this time.

                              I am using 4.6.3 g++ version.

                              Not sure, what has been statically linked in with clinfo and whether it is compatible or not.

                               

                              Anyway,  I will pass on the message to AMD Engg team. Thanks for your time again.

                                • Re: clinfo segmentation fault
                                  yurtesen

                                  Thanks for your interest in this. I hope it would be fixed in the next SDK.

                                    • Re: clinfo segmentation fault
                                      LeeHowes

                                      This issue is fixed. It's a bug in cl.hpp because of the addition of APIs to the OpenCL ICD with no clean degradation. If you link cl.hpp against a 1.2 SDK it will include 1.2 features, which include clRetainDevice. If you then run it on a 1.1  device the ICD does not report an error when it sees an empty line in the jump table, it just tries to call it and fails.

                                       

                                      We fixed cl.hpp to do a version check on construction of a device object and carry a flag with it to avoid calling those functions if they are going to cause a problem.

                                       

                                      I thought the current SDK had the latest cl.hpp, but just in case try grabbing it from khronos.org http://www.khronos.org/registry/cl/api/1.2/cl.hpp

                                       

                                      Note the code for class Wrapper<cl_device_id> in there, which checks if the device is able to be reference counted.

                                       

                                      Lee

                                        • Re: clinfo segmentation fault
                                          yurtesen

                                          I am not compiling clinfo (obviously since I dont have the sources) . Is the version of cl.hpp have any consequence? I am not even sure if it could find it at runtime.

                                           

                                          The AMD APP SDK 2.8 seem to have cl.hpp v1.2 but the latest version is cl.hpp v1.2.1 (I actually have both. It appears Intel's SDK is supplied with v1.2.1).

                                           

                                          Anyway, I am happy to hear that it is fixed. I am simply using the clinfo from AMD APP SDK 2.7 for now as a workaround..

                                           

                                          Also, why does clinfo is trying to load fglrx? I am not sure what the point of that be even if it could load it. Because it appears clinfo is not able to find AMD GPU devices if X is not running so loading the module wouldnt do any good anyway?. (latest driver I tried was Catalyst 13.1)

                                            • Re: clinfo segmentation fault
                                              LeeHowes

                                              Yes, maybe the problem is simply that clinfo was compiled against the wrong version. Hopefully that will not be a problem in the next release.

                                               

                                              Not sure about the fglrx issue. Maybe it's because clinfo has to check for the GL sharing extensions?

                                              1 of 1 people found this helpful
                                                • Re: clinfo segmentation fault
                                                  yurtesen

                                                  LeeHowes wrote:

                                                   

                                                  Not sure about the fglrx issue. Maybe it's because clinfo has to check for the GL sharing extensions?

                                                  Well I dont know about that, but if it is not loaded, then it is not loaded. OpenCL programs are not able to find GPU at all if X is not running (so what would it help if it is looking for GL sharing etc. if there is no GPU in the sight?) So I imagine loading fglrx wouldnt do any good unless if it also sets up X and restarts it. (but obviously it shouldnt mess with X settings anyway)

                                                   

                                                  I think it is just causing unnecessary and annoying error messages.

                                                  -Setting of real/effective user Id to 0/0 failed

                                                  FATAL: Module fglrx not found.

                                                  Error! Fail to load fglrx kernel module! Maybe you can switch to root user to load kernel module directly

                                                  In addition, when catalyst is installed, fglrx is set to be autoloaded at boot anyway and it simply prints out a warning that the machine must be rebooted for everything to be functional. (I am not sure if installer tries to load fglrx, I never checked that. If not, it should be changed to try to load it instead). So, basically if fglrx is not loaded, simply it would mean that Catalyst is not installed.

                                                   

                                                  I can see that this simply is a cosmetic problem. I just think it is a bug that people will see errors and misleading recommendations for loading fglrx even on machines with CPU component only.