Hello AMD OpenCL Gurus. I am facing a problem when building and running an opencl example. Here are details of my setup:
a.) I installed amdgpu-pro-install --opencl=legacy --headless
b.) I get the output from clinfo as
Output from clinfo (which w
Number of platforms: | 1 | ||
Platform Profile: | FULL_PROFILE | ||
Platform Version: | OpenCL 2.1 AMD-APP (2639.3) | ||
Platform Name: | AMD Accelerated Parallel Processing | ||
Platform Vendor: | Advanced Micro Devices, Inc. | ||
Platform Extensions: | cl_khr_icd cl_amd_event_callback cl_amd_offline_devices |
Platform Name: | AMD Accelerated Parallel Processing | |||
Number of devices: | 1 | |||
Device Type: | CL_DEVICE_TYPE_GPU | |||
Vendor ID: | 1002h | |||
Board name: | AMD Radeon (TM) R5 M340 | |||
Device Topology: | PCI[ B#1, D#0, F#0 ] | |||
Max compute units: | 5 | |||
Max work items dimensions: | 3 | |||
Max work items[0]: | 1024 | |||
Max work items[1]: | 1024 | |||
Max work items[2]: | 1024 | |||
Max work group size: | 256 | |||
Preferred vector width char: | 4 | |||
Preferred vector width short: | 2 | |||
Preferred vector width int: | 1 | |||
Preferred vector width long: | 1 | |||
Preferred vector width float: | 1 | |||
Preferred vector width double: | 1 | |||
Native vector width char: | 4 | |||
Native vector width short: | 2 | |||
Native vector width int: | 1 | |||
Native vector width long: | 1 | |||
Native vector width float: | 1 | |||
Native vector width double: | 1 | |||
Max clock frequency: | 750Mhz | |||
Address bits: | 64 | |||
Max memory allocation: | 1596905472 | |||
Image support: | Yes | |||
Max number of images read arguments: | 128 | |||
Max number of images write arguments: | 8 | |||
Max image 2D width: | 16384 | |||
Max image 2D height: | 16384 | |||
Max image 3D width: | 2048 | |||
Max image 3D height: | 2048 | |||
Max image 3D depth: | 2048 | |||
Max samplers within kernel: | 16 | |||
Max size of kernel argument: | 1024 | |||
Alignment (bits) of base address: | 2048 |
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: | No | ||||
Quiet NaNs: | Yes | ||||
Round to nearest even: | Yes | ||||
Round to zero: | Yes | ||||
Round to +ve and infinity: | Yes | ||||
IEEE754-2008 fused multiply-add: | Yes | ||||
Cache type: | Read/Write | ||||
Cache line size: | 64 | ||||
Cache size: | 16384 | ||||
Global memory size: | 2146349056 | ||||
Constant buffer size: | 65536 | ||||
Max number of constant args: | 8 | ||||
Local memory type: | Scratchpad | ||||
Local memory size: | 32768 | ||||
Max pipe arguments: | 0 | ||||
Max pipe active reservations: | 0 | ||||
Max pipe packet size: | 0 | ||||
Max global variable size: | 0 |
Max global variable preferred total size: 0
Max read/write image args: | 0 | ||||
Max on device events: | 0 | ||||
Queue on device max size: | 0 | ||||
Max on device queues: | 0 | ||||
Queue on device preferred size: | 0 | ||||
SVM capabilities: | |||||
Coarse grain buffer: | No | ||||
Fine grain buffer: | No | ||||
Fine grain system: | No | ||||
Atomics: | No | ||||
Preferred platform atomic alignment: | 0 | ||||
Preferred global atomic alignment: | 0 | ||||
Preferred local atomic alignment: | 0 |
Kernel Preferred work group size multiple: 64
Error correction support: | 0 | ||||
Unified memory for Host and Device: | 0 | ||||
Profiling timer resolution: | 1 | ||||
Device endianess: | Little | ||||
Available: | Yes | ||||
Compiler available: | Yes | ||||
Execution capabilities: | |||||
Execute OpenCL kernels: | Yes | ||||
Execute native function: | No | ||||
Queue on Host properties: | |||||
Out-of-Order: | No | ||||
Profiling : | Yes | ||||
Queue on Device properties: | |||||
Out-of-Order: | No | ||||
Profiling : | No | ||||
Platform ID: | 0x7fdcfda149f0 | ||||
Name: | Hainan | ||||
Vendor: | Advanced Micro Devices, Inc. | ||||
Device OpenCL C version: | OpenCL C 1.2 | ||||
Driver version: | 2639.3 | ||||
Profile: | FULL_PROFILE | ||||
Version: | OpenCL 1.2 AMD-APP (2639.3) | ||||
Extensions: | cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event |
c.) As per the AMD website, this card is supported.
d.) I build the helloworld example from
e.) So far so good. However when I try to run the example I get the following bugs:
1.) "Failed to create commandQueue for device". This was also pointed out by some users and I followed the fix listed here
2.) Now once I enabled those exports the code behaves erratically. If I run the code for the first time, I get the following output (garbage)
-1.70674e+38 -1.70674e+38 2.34731e-38 2.34731e-38 3.63613e+23 3.63613e+23 -1.18942e-23 -3.04413e-21 -1.17842e+08 -2.31435e-32 7.57767e-16 ...... (omitted the rest of the output)
3.) If I run the code immediately, I get the correct output as:
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 102 105 108....
4.) If I pause and then run the code again I get garbage.
So my question is what is happening here? Is there a problem with the OpenCL driver and can we have a fix? Can anyone from AMD comment on this problem?
Solved! Go to Solution.
I see the same problem with a Radeon 280X, Tahiti, SI. It looks like that the data transfer from device to host is somehow broken. It appears that enqueuing any kernel (even a do-nothing one) after the real kernel fixes the problem. See this example:
https://github.com/rdemaria/sixtracklib_gsoc18/blob/master/studies/study0/hello_workaround.cpp
I have not tested this solution thoroughly, but it looks it is working. This explains why several opencl codes do actually work with these drivers and devices.
Needless to say, it is very frustrating to be forced to either use a 4 year old linux kernel and drivers or spend ages on questionable workarounds.
I would really appreciate if some driver developer could shed some light on this issue.
It looks like a "Hainan" series card which is based on 1st generation GCN (SI family). And as I know, currently ampdgpu-pro does not support OpenCL on SI cards.
Radeon™ Software for Linux® 18.20 Release Notes
AMD Radeon™ R5 340 is listed there. Is there some way to replicate this problem at your end?
Above card (R5 m3xxx) seems a mobile gpu where as R5 340 is a desktop model. Anyway, R5 340 is also based on GCN 1.0 which is not supported for OpenCL. The release note lists those cards which are supported by the base display driver. Please check the "Base Feature support" section which says:
No, it does not support any OpenCL version.
I see the same problem with a Radeon 280X, Tahiti, SI. It looks like that the data transfer from device to host is somehow broken. It appears that enqueuing any kernel (even a do-nothing one) after the real kernel fixes the problem. See this example:
https://github.com/rdemaria/sixtracklib_gsoc18/blob/master/studies/study0/hello_workaround.cpp
I have not tested this solution thoroughly, but it looks it is working. This explains why several opencl codes do actually work with these drivers and devices.
Needless to say, it is very frustrating to be forced to either use a 4 year old linux kernel and drivers or spend ages on questionable workarounds.
I would really appreciate if some driver developer could shed some light on this issue.
Allright. However, from the same github page I run a C++ example and it works fine! No problems at all. Examples failed for clFFT (C-style) and worked for ViennaCL (C++ style). If thats the case I am not worried as I code mostly in C++.
opencl-book-samples/src/Chapter_12/VectorAdd at master · bgaster/opencl-book-samples · GitHub
It could be that
queue.enqueueMapBuffer
is doing the right thing, but I did have to use it with the fglrx driver.
At the moment I don't have the hardware to test if it is the case...
I have some updates for you:
a.) Your code in Github does not work for me!
b.) However, your solution in that code works
AMD SI Cards weird results · Issue #254 · ddemidov/vexcl · GitHub
In order to help other users, can you please provide a C-style API for your workaround? There are still many libraries that use C-style OpenCL notably clFFT and ViennaCL also
Let me add a reply: A C-style workaround is present here
Workaround for AMD-SI cards · GitHub
Is this the only way to do it? Do I have to create a new "programNull" ?
It seems correct to me. Unfortunately I cannot test the code before mid August because I had an hardware failure... I will then get back this issue...
No problem about that. One clarification is needed. Suppose If I have multiple kernels to enque, do I have to generate a null kernel after every valid kernel or can it be done after all valid kernels have been enqueued, only once?
I am not sure. It is possible that only the latest is needed, but I would
need to check. I would also try to see if this enqueeMap... found in the
C++ is actually doing the same. I have no idea what the issue really is and
why the fix works in this case...
so take a look at this
AMD SI GPU's - Weird result. · Issue #301 · CNugteren/CLBlast · GitHub
This is even weird! Maybe that is the real solution