OpenCL

jinchuantang · ‎04-10-2023

Dear Dipak,

One of my students was trying to port some pure C code to OpenCL kernel at a very early stage and encountered a problem with RX580 dGPU while using clbuildprogram. In the meantime, the code has no building problem with RX5700 dGPU and CPU runtimes (pocl3 and intel CPU runtime). The error with 580 is as follows：

i128

Error in has_operand section, at offset 1420:

Address is outside of memory allocated for variable

LLVM ERROR:

Brig container validation has failed in BRIGAsmPrinter.cpp

To reproduce the problem, find the exe and cl files in the bin/Debug folder and execute exe in the terminal/command line.

the file is here: https://pan.baidu.com/s/1bgqnGW2hPgdBE3tGgah52Q

提取码(retrieving code): 1234

下载（Download）

Thank you very much!

Best wishes,

Jinchuan

dipak · ‎04-17-2023

Hi Jinchuan,

Below are my findings from kernel file "En_kernel_2.txt".

1) For the following "typedef" (and similar others), use "unsigned long" or "ulong" instead of "unsigned long long". As mentioned in the section "6.1.4 Reserved Data Types" in the OpenCL 1.2 spec, "long long" or " unsigned long long" are reserved data types and should not be used by applications as type names.

typedef unsigned long long uint64_t;

2) While trying to build the kernel file with Radeon GPU Analyzer, I observed this compilation error:

"error: incompatible integer to pointer conversion passing '__private uint64_t' (aka '__private unsigned long long') to parameter of type '__private uint64_t *' (aka '__private unsigned long long *'); take the address with & [-Wint-conversion]"

for the following code:

uint64_t tmp1[4] = { 0 }, tmp2 = { 0 }, k = { 0 }, k2 = { 0 },

num2 = { {2},{0},{0},{0} }, num3 = { {3},{0},{0},{0} };

....

vli_modMult(tmp1, num3, tmp1, Ec->p);//tmp1 = 3*x1^2

Declaring num2 and num3 as below resolved the error:

uint64_t tmp1[4] = { 0 }, tmp2 = { 0 }, k = { 0 }, k2 = { 0 },

num2[4] = { {2},{0},{0},{0} }, num3[4] = { {3},{0},{0},{0} };

Please try the above suggestions (also attached the modified kernel file) and let me know if it resolves the compilation issue.

Thanks.

View solution in original post

dipak · ‎04-11-2023

Hi Jinchuan,

Thank you for reporting the above issue. We will look into it.

Could you please provide the setup information like OS, driver version and attach the clinfo output?

Thanks.

jinchuantang · ‎04-11-2023

Dear Dipak,

Please received my huge thanks!

The info is as follows:

Adrenalin Edition driver 23.2.2
OS Windows 11 Pro

Microsoft Windows [版本 10.0.22000.1696]
(c) Microsoft Corporation。保留所有权利。

C:\Users\security>clinfo
Number of platforms: 3
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3516.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_amd_event_callback cl_amd_offline_devices
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.2 D3D12 Implementation
Platform Name: OpenCLOn12
Platform Vendor: Microsoft
Platform Extensions: cl_khr_icd cl_khr_extended_versioning cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_il_program cl_khr_3d_image_writes cl_khr_gl_sharing cl_khr_gl_event
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 3.0 WINDOWS
Platform Name: Intel(R) OpenCL
Platform Vendor: Intel(R) Corporation
Platform Extensions: cl_khr_spirv_linkonce_odr cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_khr_il_program cl_intel_unified_shared_memory_preview cl_intel_subgroups cl_intel_subgroups_char cl_intel_subgroups_short cl_intel_subgroups_long cl_intel_spirv_subgroups cl_intel_required_subgroup_size cl_intel_exec_by_local_thread cl_intel_vec_len_hint cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer

Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon RX 580 2048SP
Device Topology: PCI[ B#1, D#0, F#0 ]
Max compute units: 32
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1284Mhz
Address bits: 64
Max memory allocation: 7073274265
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8589934592
Constant buffer size: 7073274265
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 2778306969
Max global variable size: 6365946624
Max global variable preferred total size: 8589934592
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 00007FFF161D2000
Name: Ellesmere
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3516.0 (PAL,HSAIL)
Profile: FULL_PROFILE
Version: OpenCL 2.0 AMD-APP (3516.0)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_liquid_flash cl_amd_copy_buffer_p2p cl_amd_planar_yuv

Platform Name: OpenCLOn12
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Max compute units: 1
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 64
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 2
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 2
Max clock frequency: 12Mhz
Address bits: 64
Max memory allocation: 1073741824
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 1024
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 8567902208
Constant buffer size: 65536
Max number of constant args: 15
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 80
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0000020E7C424A30
Name: AMD Radeon RX 580 2048SP
Vendor: Microsoft
Device OpenCL C version: OpenCL C 1.2
Driver version: 1.1.0
Profile: FULL_PROFILE
Version: OpenCL 1.2 D3D12 Implementation
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_il_program cl_khr_3d_image_writes cl_khr_gl_sharing cl_khr_gl_event

Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1414h
Max compute units: 1
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 64
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 2
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 2
Max clock frequency: 12Mhz
Address bits: 64
Max memory allocation: 1073741824
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 1024
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 55735267328
Constant buffer size: 65536
Max number of constant args: 15
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 80
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0000020E7C424A30
Name: Microsoft Basic Render Driver
Vendor: Microsoft
Device OpenCL C version: OpenCL C 1.2
Driver version: 1.1.0
Profile: FULL_PROFILE
Version: OpenCL 1.2 D3D12 Implementation
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_il_program cl_khr_3d_image_writes cl_khr_gl_sharing cl_khr_gl_event

Platform Name: Intel(R) OpenCL
Number of devices: 1
Device Type: CL_DEVICE_TYPE_CPU
Vendor ID: 8086h
Max compute units: 16
Max work items dimensions: 3
Max work items[0]: 8192
Max work items[1]: 8192
Max work items[2]: 8192
Max work group size: 8192
Preferred vector width char: 1
Preferred vector width short: 1
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 32
Native vector width short: 16
Native vector width int: 8
Native vector width long: 4
Native vector width float: 8
Native vector width double: 4
Max clock frequency: 3800Mhz
Address bits: 64
Max memory allocation: 27867633664
Image support: Yes
Max number of images read arguments: 480
Max number of images write arguments: 480
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 480
Max size of kernel argument: 3840
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: No
Cache type: Read/Write
Cache line size: 64
Cache size: 262144
Global memory size: 111470534656
Constant buffer size: 131072
Max number of constant args: 480
Local memory type: Global
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16383
Max pipe packet size: 1024
Max global variable size: 65536
Max global variable preferred total size: 65536
Max read/write image args: 480
Max on device events: 4294967295
Queue on device max size: 4294967295
Max on device queues: 4294967295
Queue on device preferred size: 4294967295
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: Yes
Atomics: Yes
Preferred platform atomic alignment: 64
Preferred global atomic alignment: 64
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 128
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 100
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0000020E7C258B28
Name: Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
Vendor: Intel(R) Corporation
Device OpenCL C version: OpenCL C 3.0
Driver version: 2021.12.9.0.24_005321
Profile: FULL_PROFILE
Version: OpenCL 3.0 (Build 0)
Extensions: cl_khr_spirv_linkonce_odr cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_khr_il_program cl_intel_unified_shared_memory_preview cl_intel_subgroups cl_intel_subgroups_char cl_intel_subgroups_short cl_intel_subgroups_long cl_intel_spirv_subgroups cl_intel_required_subgroup_size cl_intel_exec_by_local_thread cl_intel_vec_len_hint cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer

dipak · ‎04-11-2023

Hi @jinchuantang,

Thanks for the information.

I was unable to download the repro using the link. Could you please attach the file here (using "Drag and drop here or browse files to attach" option available above the "Reply/Post" button while posting a reply) ?

Thanks.

jinchuantang · ‎04-14-2023

Dear Dipak,

I could not find the option to drag and drop file here. I provide a link to my Sourceforge page titled 580 clbuildprogram for your kind reference.

sourceforge.net/p/octave-ocl-extra/discussion/general/

Best wishes,

Jinchuan

dipak · ‎04-14-2023

Thanks for providing the above link. I was able to download the repro files. I will look into it.

Thanks.

dipak · ‎04-17-2023

Hi Jinchuan,

Below are my findings from kernel file "En_kernel_2.txt".

1) For the following "typedef" (and similar others), use "unsigned long" or "ulong" instead of "unsigned long long". As mentioned in the section "6.1.4 Reserved Data Types" in the OpenCL 1.2 spec, "long long" or " unsigned long long" are reserved data types and should not be used by applications as type names.

typedef unsigned long long uint64_t;

2) While trying to build the kernel file with Radeon GPU Analyzer, I observed this compilation error:

"error: incompatible integer to pointer conversion passing '__private uint64_t' (aka '__private unsigned long long') to parameter of type '__private uint64_t *' (aka '__private unsigned long long *'); take the address with & [-Wint-conversion]"

for the following code:

uint64_t tmp1[4] = { 0 }, tmp2 = { 0 }, k = { 0 }, k2 = { 0 },

num2 = { {2},{0},{0},{0} }, num3 = { {3},{0},{0},{0} };

....

vli_modMult(tmp1, num3, tmp1, Ec->p);//tmp1 = 3*x1^2

Declaring num2 and num3 as below resolved the error:

uint64_t tmp1[4] = { 0 }, tmp2 = { 0 }, k = { 0 }, k2 = { 0 },

num2[4] = { {2},{0},{0},{0} }, num3[4] = { {3},{0},{0},{0} };

Please try the above suggestions (also attached the modified kernel file) and let me know if it resolves the compilation issue.

Thanks.

jinchuantang · ‎04-24-2023

Dear Dipak,

Please receive my highest gratitude. Thanks to you, the problem is solved.

Thank you very much!

Best wishes,

Jinchuan

dipak · ‎04-24-2023

Hi Jinchuan,

It's nice to hear that the compilation issue has been resolved.

I would like to suggest one point here. Though OpenCL C is based on C99, there are some restrictions and extensions in OpenCL C. So, when porting from pure C code to OpenCL C, it would be better to follow those restrictions and modifications described in the spec. Otherwise, depending on the OpenCL compiler, the kernel may produce compilation error or unexpected result on some platforms/devices.

Thanks.

jinchuantang · ‎05-05-2023

Dear Dipak,

Thank you very much for your kind suggestion! You and the AMD community are always very professional and helpful. I will definitely "follow the yellow brick road".

Best wishes,

Jinchuan