cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

Meteorhead
Challenger

Wrong device OpenCL C version advertized by OpenCL runtime in 64-bit only

I have reported this same issue on non-public forums, but just for the record I'll write it up here too, in case others come across the same problem.

The OpenCL runtime found in the Windows drivers for a few months now (but only a few months, because in September-October-ish it was still working properly) reports supported OpenCL C version to be 2.0 for Polaris cards, but when trying to use any of the built-in work-group reduction functions, the clBuildProgram bails out with:

There is a call to an undefined label
Error: HSAIL program is not finalized successfully.
Codegen phase failed compilation.
Error: BRIG finalization to ISA failed.

This error only happens when building a 64-bit application and the ICD Loader loads amdocl64.dll. When using amdocl.dll the runtime doesn't report this language version.

To repro clone the Khronos OpenCL-SDK (including submodules) and build with providing

-D OPENCL_SDK_BUILD_SAMPLES=ON
-D OPENCL_SDK_BUILD_OPENGL_SAMPLES=OFF

CMake options for a minimal build of the console-only samples. reducecpp exhibits the behavior which exhaustively checks whether these built-ins are available via 1.2, 2.x, and 3.0 compatible host API usage.

0 Likes
1 Solution

Hi @Meteorhead,

I was trying to reproduce the issue with the attached kernel file using Radeon GPU Analyzer (RGA) tool (https://gpuopen.com/tools/)  and I observed a similar compilation error as below. 

Building for ellesmere... failed.
Error (reported by the OpenCL Compiler):
ld.lld: error: undefined hidden symbol: wait_group_events(int, ocl_event AS0*)

So, it looks like the below line (in reduce.cl ) causing the error. I'll report it to the OpenCL team.

wait_group_events(1, &read); 

Just to verify the same, can you please comment out the above line to see if the kernel is building fine?

Thanks.

View solution in original post

11 Replies

You should post this thread at AMD Developer's Forum under OpenCL but you must first get "Whitelisted" to open a thread there from here: https://community.amd.com/t5/newcomers-start-here/bd-p/newcomer-forum

@dipak 

0 Likes
dipak
Big Boss

Hi @Meteorhead ,

Thank you for reporting it. I'm moving the post to the AMD OpenCL forum.

Can you please share your setup details like gpu device, driver version etc. and attach the clinfo output?

Thanks.

The driver is AMD Software 22.3.1 (latest at the time of writing) running on an ASUS GL702ZC laptop. (B350 chipset, Ryzen R7 1700, Radeon RX 580 (Polaris, gfx803))

Because I can't attach either TXT or ZIP, here is the clinfo output. There are three runtimes installed:

  • AMD APP
  • OpenCLOn12
  • Intel CPU Runtime
Number of platforms: 3
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3380.6)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_amd_event_callback cl_amd_offline_devices
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.2 D3D12 Implementation
Platform Name: OpenCLOn12
Platform Vendor: Microsoft
Platform Extensions: cl_khr_icd cl_khr_extended_versioning cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_il_program
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 3.0 WINDOWS
Platform Name: Intel(R) OpenCL
Platform Vendor: Intel(R) Corporation
Platform Extensions: cl_khr_spirv_linkonce_odr cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_khr_il_program cl_intel_unified_shared_memory_preview cl_intel_subgroups cl_intel_subgroups_char cl_intel_subgroups_short cl_intel_subgroups_long cl_intel_spirv_subgroups cl_intel_required_subgroup_size cl_intel_exec_by_local_thread cl_intel_vec_len_hint cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer


Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Radeon RX 580 Series
Device Topology: PCI[ B#12, D#0, F#0 ]
Max compute units: 36
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1077Mhz
Address bits: 64
Max memory allocation: 3422552064
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 4294967296
Constant buffer size: 3422552064
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 3422552064
Max global variable size: 3080296704
Max global variable preferred total size: 4294967296
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 00007FFFFFCC6490
Name: Ellesmere
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3380.6 (PAL,HSAIL)
Profile: FULL_PROFILE
Version: OpenCL 2.0 AMD-APP (3380.6)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_khr_gl_depth_images cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_liquid_flash cl_amd_copy_buffer_p2p cl_amd_planar_yuv


Platform Name: OpenCLOn12
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Max compute units: 1
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 64
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 2
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 2
Max clock frequency: 12Mhz
Address bits: 64
Max memory allocation: 1068234752
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 1024
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 4272939008
Constant buffer size: 65536
Max number of constant args: 15
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 80
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 000001E62DF6E9F0
Name: Radeon RX 580 Series
Vendor: Microsoft
Device OpenCL C version: OpenCL C 1.2
Driver version: 1.1.0
Profile: FULL_PROFILE
Version: OpenCL 1.2 D3D12 Implementation
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_il_program


Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1414h
Max compute units: 1
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 64
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 2
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 2
Max clock frequency: 12Mhz
Address bits: 64
Max memory allocation: 1073741824
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 1024
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 17144451072
Constant buffer size: 65536
Max number of constant args: 15
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 80
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 000001E62DF6E9F0
Name: Microsoft Basic Render Driver
Vendor: Microsoft
Device OpenCL C version: OpenCL C 1.2
Driver version: 1.1.0
Profile: FULL_PROFILE
Version: OpenCL 1.2 D3D12 Implementation
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_il_program


Platform Name: Intel(R) OpenCL
Number of devices: 1
Device Type: CL_DEVICE_TYPE_CPU
Vendor ID: 8086h
Max compute units: 16
Max work items dimensions: 3
Max work items[0]: 8192
Max work items[1]: 8192
Max work items[2]: 8192
Max work group size: 8192
Preferred vector width char: 1
Preferred vector width short: 1
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 32
Native vector width short: 16
Native vector width int: 8
Native vector width long: 4
Native vector width float: 8
Native vector width double: 4
Max clock frequency: 0Mhz
Address bits: 64
Max memory allocation: 8572225536
Image support: Yes
Max number of images read arguments: 480
Max number of images write arguments: 480
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 480
Max size of kernel argument: 3840
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: No
Cache type: Read/Write
Cache line size: 64
Cache size: 524288
Global memory size: 34288902144
Constant buffer size: 131072
Max number of constant args: 480
Local memory type: Global
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16383
Max pipe packet size: 1024
Max global variable size: 65536
Max global variable preferred total size: 65536
Max read/write image args: 480
Max on device events: 4294967295
Queue on device max size: 4294967295
Max on device queues: 4294967295
Queue on device preferred size: 4294967295
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: Yes
Atomics: Yes
Preferred platform atomic alignment: 64
Preferred global atomic alignment: 64
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 128
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 100
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 000001E62DF82028
Name: AMD Ryzen 7 1700 Eight-Core Processor
Vendor: Intel(R) Corporation
Device OpenCL C version: OpenCL C 3.0
Driver version: 2021.12.9.0.24_005321
Profile: FULL_PROFILE
Version: OpenCL 3.0 (Build 0)
Extensions: cl_khr_spirv_linkonce_odr cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_khr_il_program cl_intel_unified_shared_memory_preview cl_intel_subgroups cl_intel_subgroups_char cl_intel_subgroups_short cl_intel_subgroups_long cl_intel_spirv_subgroups cl_intel_required_subgroup_size cl_intel_exec_by_local_thread cl_intel_vec_len_hint cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffe
0 Likes

Thank you for sharing the above information. We will try to reproduce the issue and get back to you.

Thanks.

0 Likes

Hi @Meteorhead,

I was trying to reproduce the issue with the attached kernel file using Radeon GPU Analyzer (RGA) tool (https://gpuopen.com/tools/)  and I observed a similar compilation error as below. 

Building for ellesmere... failed.
Error (reported by the OpenCL Compiler):
ld.lld: error: undefined hidden symbol: wait_group_events(int, ocl_event AS0*)

So, it looks like the below line (in reduce.cl ) causing the error. I'll report it to the OpenCL team.

wait_group_events(1, &read); 

Just to verify the same, can you please comment out the above line to see if the kernel is building fine?

Thanks.

Yes, I can confirm that when that line is commented out and I only rely on the subsequent barrier() it compiles and executes in 64-bit as well.

0 Likes

Thank you for confirming it. I've reported the issue to the OpenCL compiler team.

Thanks.

0 Likes

Hi @Meteorhead ,

We need to open a ticket against the issue. It would be helpful if you please provide the intermediate temporary files (such as  IL and ISA code etc.) generated by the compiler.

To dump the intermediate temporary files, please set the below environment variable and then run the executable. 

On Windows: set AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps               

For more information about this option, see https://rocmdocs.amd.com/en/latest/Programming_Guides/Opencl-programming-guide.html#amd-developed-su...

P.S. Please use the original kernel code that was causing the compilation error i.e. with "wait_group_events" call

Thanks.

 

0 Likes

The temp file of the compilation can be found in this zip file.

0 Likes

Thanks for providing the intermediate temporary files. 

From the temporary files, it seems like they were generated on Fiji (gfx803). If a different setup was used to dump the temp files, please share the setup details. 

Thanks. 

 

0 Likes

Update:

A ticket has been opened to investigate the issue. I'll notify you as soon as I get any update on this.

Thanks.

0 Likes