Archives Discussions

yurtesen · ‎01-25-2013

In some machines clinfo is crashing with segmentation fault. This specific machine below is with an Intel processor and Nvidia GPU. I am not sure why clinfo is trying to become root, load fglrx or segmentation fault. Any ideas? It is not a big deal but clinfo was a nice tool for quick look at system devices.

$ clinfo
Setting of real/effective user Id to 0/0 failed
FATAL: Module fglrx not found.
Error! Fail to load fglrx kernel module! Maybe you can switch to root user to load kernel module directly
No protocol specified
Number of platforms:                             3
Platform Profile:                              FULL_PROFILE
Platform Version:                              OpenCL 1.1 CUDA 4.2.1
Platform Name:                                 NVIDIA CUDA
Platform Vendor:                               NVIDIA Corporation
Platform Extensions:                           cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll
Platform Profile:                              FULL_PROFILE
Platform Version:                              OpenCL 1.2 LINUX
Platform Name:                                 Intel(R) OpenCL
Platform Vendor:                               Intel(R) Corporation
Platform Extensions:                           cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread
Platform Profile:                              FULL_PROFILE
Platform Version:                              OpenCL 1.2 AMD-APP (1113.2)
Platform Name:                                 AMD Accelerated Parallel Processing
Platform Vendor:                               Advanced Micro Devices, Inc.
Platform Extensions:                           cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name:                                 NVIDIA CUDA
Number of devices:                               1
Device Type:                                   CL_DEVICE_TYPE_GPU
Device ID:                                     4318
Max compute units:                             8
Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           64
Max work group size:                           1024
Preferred vector width char:                   1
Preferred vector width short:                  1
Preferred vector width int:                    1
Preferred vector width long:                   1
Preferred vector width float:                  1
Preferred vector width double:                 1
Native vector width char:                      1
Native vector width short:                     1
Native vector width int:                       1
Native vector width long:                      1
Native vector width float:                     1
Native vector width double:                    1
Max clock frequency:                           1058Mhz
Address bits:                                  32
Max memory allocation:                         536690688
Image support:                                 Yes
Max number of images read arguments:           256
Max number of images write arguments:          16
Max image 2D width:                            32768
Max image 2D height:                           32768
Max image 3D width:                            4096
Max image 3D height:                           4096
Max image 3D depth:                            4096
Max samplers within kernel:                    32
Max size of kernel argument:                   4352
Alignment (bits) of base address:              4096
Minimum alignment (bytes) for any datatype:    128
Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
Cache type:                                    Read/Write
Cache line size:                               128
Cache size:                                    131072
Global memory size:                            2146762752
Constant buffer size:                          65536
Max number of constant args:                   9
Local memory type:                             Scratchpad
Local memory size:                             49152
Segmentation fault (core dumped)
$

himanshu_gautam · ‎01-27-2013

I was able to reproduce the issue with Intel CPU + NVIDIA GPU machine.

I installed 2.8 (without any graphics driver - because none is needed - so CPU runtime will be used) and it seg-faulted while listing NVIDIA's implementation.

2.7 works fine on this machine.

Yurtesen, Thanks for reporting this issue and Thanks a lot for your time on this. I will pass this on the AMD Engg team.

Also, I found few things about 2.8 clinfo.

clinfo2.7 is 41096 bytes in size.

clinfo2.8 is 581608 bytes in size.

This means a static link has gone inside 2.8 -- which is corroborated by LDD.

libstdc++.so.6 is no more listed inside 2.8's LDD output.

Which means -- Some C++ library has been statically linked this time.

I am using 4.6.3 g++ version.

Not sure, what has been statically linked in with clinfo and whether it is compatible or not.

Anyway, I will pass on the message to AMD Engg team. Thanks for your time again.

View solution in original post

himanshu_gautam · ‎01-25-2013

The error messages that you see toward the beginning are from "clGetPlatformIds" call. This issue has been reported to AMD and will be fixed in a future release.

The segmentation fault is reported while listing CUDA device information. So, I cant for sure tell who is the culprit.

Note that ICD enables multiple platforms to co-exist with one another by cascading a single call amongst multiple platforms. So the call is getting cascaded among all platforms.

Can you run clInfo from GDB and take a stack trace and post it here. That might give some insight as to who is failing.

yurtesen · ‎01-25-2013

Do you need this?

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) backtrace
#0 0x0000000000000000 in ?? ()
#1 0x000000000040c5a7 in cl::Device::Device(cl::Device const&) ()
#2 0x0000000000405c8f in T.1902 ()
#3 0x0000000000407ded in main ()
(gdb)

I also ran valgrind

==1900== Jump to the invalid address stated on the next line
==1900==    at 0x0: ???
==1900==    by 0x40C5A6: cl::Device::Device(cl::Device const&) (in /usr/bin/clinfo)
==1900==    by 0x405C8E: T.1902 (in /usr/bin/clinfo)
==1900==    by 0x407DEC: main (in /usr/bin/clinfo)
==1900== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==1900==
==1900==
==1900== Process terminating with default action of signal 11 (SIGSEGV)
==1900== Bad permissions for mapped region at address 0x0
==1900==    at 0x0: ???
==1900==    by 0x40C5A6: cl::Device::Device(cl::Device const&) (in /usr/bin/clinfo)
==1900==    by 0x405C8E: T.1902 (in /usr/bin/clinfo)
==1900==    by 0x407DEC: main (in /usr/bin/clinfo)

Thanks!

Evren

himanshu_gautam · ‎01-26-2013

Hmm... Well, I was just hoping to see some clues... but, I cant make out anything except for some NULL pointer related stuff......Actually a Jump into a NULL pointer...

This reminds me of ICD, where in the OpenCL implementations obtain addresses of various APIs and make a CALL to the address returned.

To debug this further, :

1. What is the OpenCL implementation that links dynamically to clinfo?

You can run a "ldd" on clinfo. That will show what is linked to the clInfo.

Depending on what you have your PATH and LD_LIBRARY_PATH -- This can either link to AMD's OpenCL or NVIDIA's OpenCL.

If it is pointing to NVIDIA's OpenCL -Please change your PATH or LD_.... so that AMD's opencl is linked dynamically.

At the moment, I dont have a system and I really dont know whether clInfo is statically linked or not.

I will check this on Monday.

Meanwhile, Can you check and publish your findings, if you find time? Thanks,

yurtesen · ‎01-26-2013

I think it has nothing to do with OpenCL library. It appears clinfo is simply trying to use wrong pointer when trying to access the next device. It looks like a programming error in clinfo. It seems to get confused when there are multiple platforms. (a guess)

I also found out that the 'clinfo' from APP SDK 2.7 does not cause segmentation faults. I simply tested this by only extracting clinfo binary from APP SDK 2.7 and running it on the problem system. So the problem was introduced in clinfo itself and in APP SDK 2.8

1-I tried with different libOpenCL paths...

Causes segmentation fault:

$ ldd /usr/bin/clinfo
        linux-vdso.so.1 => (0x00007fff5862e000)
        libOpenCL.so.1 => /opt/AMDAPP/lib/x86_64/libOpenCL.so.1 (0x00007f570f80d000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f570f5d2000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f570f2d5000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f570f0d1000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f570eebb000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f570eafb000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f570fa15000)

Changed LD_LIBRARY_PATH and still causes segmentation fault

$ ldd /usr/bin/clinfo
        linux-vdso.so.1 => (0x00007fff573ff000)
        libOpenCL.so.1 => /opt/intel/opencl-1.2-3.0.56860/lib64/libOpenCL.so.1 (0x00007f4967ebd000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4967c81000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f4967985000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4967781000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f496756b000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f49671ab000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f49680c6000)
        libnuma.so.1 => /usr/lib/libnuma.so.1 (0x00007f4966fa0000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f4966c9d000)

Nvidia's lib causes a different error:

$ ldd /usr/bin/clinfo
/usr/bin/clinfo: /usr/lib/libOpenCL.so.1: no version information available (required by /usr/bin/clinfo)
/usr/bin/clinfo: /usr/lib/libOpenCL.so.1: no version information available (required by /usr/bin/clinfo)
        linux-vdso.so.1 => (0x00007fff36bff000)
        libOpenCL.so.1 => /usr/lib/libOpenCL.so.1 (0x00007fed1de33000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fed1dbf8000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fed1d8fb000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fed1d6f7000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fed1d4e1000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fed1d121000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fed1e03a000)

The error is

clinfo: relocation error: clinfo: symbol clRetainDevice, version OPENCL_1.2 not defined in file libOpenCL.so.1 with link time reference

himanshu_gautam · ‎01-26-2013

Thanks for your time on this. i have an nvidia machine running linux-64. let me try installing APP SDK 2.8 and check out clinfo.

I will do this on Monday and pass on the info to the AMD engg team, if needed.

Meanwhile, Thanks a lot for your time and Thanks for reporting the issue.

himanshu_gautam · ‎01-27-2013

I was able to reproduce the issue with Intel CPU + NVIDIA GPU machine.

I installed 2.8 (without any graphics driver - because none is needed - so CPU runtime will be used) and it seg-faulted while listing NVIDIA's implementation.

2.7 works fine on this machine.

Yurtesen, Thanks for reporting this issue and Thanks a lot for your time on this. I will pass this on the AMD Engg team.

Also, I found few things about 2.8 clinfo.

clinfo2.7 is 41096 bytes in size.

clinfo2.8 is 581608 bytes in size.

This means a static link has gone inside 2.8 -- which is corroborated by LDD.

libstdc++.so.6 is no more listed inside 2.8's LDD output.

Which means -- Some C++ library has been statically linked this time.

I am using 4.6.3 g++ version.

Not sure, what has been statically linked in with clinfo and whether it is compatible or not.

Anyway, I will pass on the message to AMD Engg team. Thanks for your time again.

yurtesen · ‎01-28-2013

Thanks for your interest in this. I hope it would be fixed in the next SDK.

LeeHowes · ‎01-28-2013

This issue is fixed. It's a bug in cl.hpp because of the addition of APIs to the OpenCL ICD with no clean degradation. If you link cl.hpp against a 1.2 SDK it will include 1.2 features, which include clRetainDevice. If you then run it on a 1.1 device the ICD does not report an error when it sees an empty line in the jump table, it just tries to call it and fails.

We fixed cl.hpp to do a version check on construction of a device object and carry a flag with it to avoid calling those functions if they are going to cause a problem.

I thought the current SDK had the latest cl.hpp, but just in case try grabbing it from khronos.org http://www.khronos.org/registry/cl/api/1.2/cl.hpp

Note the code for class Wrapper<cl_device_id> in there, which checks if the device is able to be reference counted.

Lee

yurtesen · ‎01-28-2013

I am not compiling clinfo (obviously since I dont have the sources) . Is the version of cl.hpp have any consequence? I am not even sure if it could find it at runtime.

The AMD APP SDK 2.8 seem to have cl.hpp v1.2 but the latest version is cl.hpp v1.2.1 (I actually have both. It appears Intel's SDK is supplied with v1.2.1).

Anyway, I am happy to hear that it is fixed. I am simply using the clinfo from AMD APP SDK 2.7 for now as a workaround..

Also, why does clinfo is trying to load fglrx? I am not sure what the point of that be even if it could load it. Because it appears clinfo is not able to find AMD GPU devices if X is not running so loading the module wouldnt do any good anyway?. (latest driver I tried was Catalyst 13.1)

LeeHowes · ‎01-28-2013

Yes, maybe the problem is simply that clinfo was compiled against the wrong version. Hopefully that will not be a problem in the next release.

Not sure about the fglrx issue. Maybe it's because clinfo has to check for the GL sharing extensions?

yurtesen · ‎01-28-2013

LeeHowes wrote:
Not sure about the fglrx issue. Maybe it's because clinfo has to check for the GL sharing extensions?

Well I dont know about that, but if it is not loaded, then it is not loaded. OpenCL programs are not able to find GPU at all if X is not running (so what would it help if it is looking for GL sharing etc. if there is no GPU in the sight?) So I imagine loading fglrx wouldnt do any good unless if it also sets up X and restarts it. (but obviously it shouldnt mess with X settings anyway)

I think it is just causing unnecessary and annoying error messages.

-Setting of real/effective user Id to 0/0 failed
FATAL: Module fglrx not found.
Error! Fail to load fglrx kernel module! Maybe you can switch to root user to load kernel module directly

In addition, when catalyst is installed, fglrx is set to be autoloaded at boot anyway and it simply prints out a warning that the machine must be rebooted for everything to be functional. (I am not sure if installer tries to load fglrx, I never checked that. If not, it should be changed to try to load it instead). So, basically if fglrx is not loaded, simply it would mean that Catalyst is not installed.

I can see that this simply is a cosmetic problem. I just think it is a bug that people will see errors and misleading recommendations for loading fglrx even on machines with CPU component only.

himanshu_gautam · ‎01-29-2013

Hi Yurtesen,

As I replied in an earlier post, the ugly error messages are coming from "clGetPlatformIds" API. This has been reported to AMD and will be fixed.

Thanks,

Best Regards,

Archives Discussions

clinfo segmentation fault