OpenCL

jje · ‎04-06-2021

Hi,

I'm evaluating a Radeon Pro W5500 for the next generation of our embedded system. The GPU is however not able to run our OpenCL code. The reason; we are using workgroups sizes greater than 256.

This has not been an issue on any earlier AMD GPUs we have used. E.g. Radeon HD7850, Radeon RX570, Radeon E8860, Radeon Pro WX7100, and others.

We set GPU_MAX_WORKGROUP_SIZE to 1024 which has work so far, but with the W5500 this is no longer enough.

I can specify kernel attribute reqd_work_group_size with a workgroup size greater than 256 (e.g. 517,1,1) and that works fine, but since we run the same kernels with different workgroup sizes this becomes rather cumbersome.

I have tried to set GPU_ENABLE_WAVE32_MODE and GPU_ENABLE_WGP_MODE to 0 but it doesn't seem to make a difference.

I suspect the reduced maximum workgroup size is related to the replacement of CUs with WGP in the RDNA architecture. Can anyone verify that? If it is the case will we have the same problem with all new AMD GPUs?

Is there a workaround so we don't have the rewrite a significant portion of our OpenCL code base?

dipak · ‎04-12-2021

Hi @jje ,

I discussed this topic with the OpenCL compiler team and below is their feedback.

- Use the kernel attribute reqd_work_group_size for workgroup size greater than 256.

- For the best performance, you may go with multiple kernels. If you don’t want that, it might be better if you restrict the workgroup size to 256 since you would likely be losing even more performance if you always use a kernel compiled for workgroup size 1024.

So, in this case, they think the best recommendation is to either limit the workgroup size to 256, or to create and use specialized kernels for larger workgroup sizes.

Thanks.

View solution in original post

fsadough · ‎04-06-2021

On which OS?
Can you provide an OpenCL Info?
https://github.com/marchv/opencl-info

jje · ‎04-07-2021

Hi,

We use Windows 10

Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3075.13)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_amd_event_callback cl_amd_offline_devices

Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon Pro W5500
Device Topology: PCI[ B#3, D#0, F#0 ]
Max compute units: 11
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 1024
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1744Mhz
Address bits: 64
Max memory allocation: 7059013632
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8573157376
Constant buffer size: 7059013632
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 2764046336
Max global variable size: 6353112064
Max global variable preferred total size: 8573157376
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 32
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 00007FFA5AEF3FD0
Name: gfx1012
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3075.13 (PAL,LC)
Profile: FULL_PROFILE
Version: OpenCL 2.0 AMD-APP (3075.13)
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_khr_gl_depth_images cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_liquid_flash cl_amd_copy_buffer_p2p cl_amd_planar_yuv

fsadough · ‎04-07-2021

Please be more specific. I need the exact OS version and build. Please provide an AMDZ Report:

AMDZ Report
- Please extract the amdz-v287.zip available from https://we.tl/t-y6giVdBWDp
- Run amdz.exe file as an Administrator
- Select “Save All“ and “txt“ as the output format
- Click on the blue button to save the report
- The .txt file will be saved in the same folder where you extracted the zipped file

jje · ‎04-09-2021

Hi,

Here are the output from AMDZ https://we.tl/t-tNrubUVJgo

Mvh Jakob

fsadough · ‎04-09-2021

Couple of suggestions, meanwhile I will try to repro the issue.

Update your system BIOS
----------BIOS Information--------------
Version: American Megatrends Inc. 1.80 (255.255) 12/26/2019 (16MB)
Update your Windows version
----------OS Information----------------
Name: Windows 10 Enterprise (UEFI)(VBS running)
Version: 10.018363.1(1379).amd64fre.19h1_release.190318-1202
Update your AMD GPU Driver to the latest version (21.Q1.1) from:
https://www.amd.com/en/support/professional-graphics/radeon-pro/radeon-pro-w5000-series/radeon-pro-w...

----------Graphics Information----------
GFX Registry TdrDebugMode=3
Adapter name AMD Radeon Pro W5500
Hardware ID PCI\VEN_1002&DEV_7341&SUBSYS_0B0C1002&REV_00\6&8656465&0&00000008
PCI address BUS: 3, DEV: 0, FUN: 0
Driver info 20.10.27.03-200826a-358464C-RadeonProEnterprise (27.20.11027.3006) 8-26-2020

jje · ‎04-09-2021

Hi @fsadough

I have just updated to the latest AMD driver it didn't make a difference.

Regarding update of BIOS and Windows I don't see it will make much of a difference. If I put another GPU in the computer I have no problems with workgroup sizes above 256. I only see the problem with the W5500.

Mvh Jakob

fsadough · ‎04-09-2021

What do you mean by another GPU?

jje · ‎04-09-2021

Hi,

If install an RX 570 or a HD 7850 or a WX 7100 I have no problem using workgroup size larger than 256

Mvh Jakob

dipak · ‎04-12-2021

Hi @jje ,

I discussed this topic with the OpenCL compiler team and below is their feedback.

- Use the kernel attribute reqd_work_group_size for workgroup size greater than 256.

- For the best performance, you may go with multiple kernels. If you don’t want that, it might be better if you restrict the workgroup size to 256 since you would likely be losing even more performance if you always use a kernel compiled for workgroup size 1024.

So, in this case, they think the best recommendation is to either limit the workgroup size to 256, or to create and use specialized kernels for larger workgroup sizes.

Thanks.

dipak · ‎04-07-2021

Hi @jje ,

I can specify kernel attribute reqd_work_group_size with a workgroup size greater than 256 (e.g. 517,1,1) and that works fine

I think it's the right way to specify a workgroup size greater than 256. Below is a similar discussion where it is recommended to use this kernel attribute for workgroup size greater than 256.

clinfo wrongly reports max work group size 256

I suspect the reduced maximum workgroup size is related to the replacement of CUs with WGP in the RDNA architecture. Can anyone verify that? If it is the case will we have the same problem with all new AMD GPUs?

The section 4.3 (Workgroups) in the RDNA ISA manual says that it supports workgroup size up to 1024 work-items (16 wave64’s or 32 wave32’s).

Thanks.

jje · ‎04-09-2021

Hi @dipak,

I think it's the right way to specify a workgroup size greater than 256.

I'm sure you are right, but since we have a large OpenCL code base (+100.000 lines) that depends on being able to use workgroup sizes greater than 256, and since many of out kernels are used with different workgroup sizes (i.e. we cannot just add the reqd_work_group_size attribute to the affected kernels) it will be a significant rewrite.

Previously setting GPU_MAX_WORKGROUP_SIZE to 1024 have been enough. But apparently not anymore. If the change is not related to the GPU architecture could it then be a driver issue? I have just upgraded to the latest driver and it makes no difference.

Do you know if we can expect the same issue on other newer AMD GPUs or is it limited to W5500? (It is difficult to go out and buy another GPU and test it at the moment 😉 )

Mvh Jakob

dipak · ‎04-09-2021

Hi @jje ,

Without the reqd_work_group_size kernel attribute, the compiler can use a preferred maximum workgroup size for the kernel for a specific device and this value may be different than the CL_DEVICE_MAX_WORK_GROUP_SIZE reported by the openecl runtime. An application can query this kernel specific max. workgroup size by clGetKernelWorkGroupInfo with parameter CL_KERNEL_WORK_GROUP_SIZE. As the spec says about the CL_KERNEL_WORK_GROUP_SIZE:

"As a result and unlike CL_DEVICE_MAX_WORK_GROUP_SIZE this value may vary from one kernel to another as well as one device to another. CL_KERNEL_WORK_GROUP_SIZE will be less than or equal to CL_DEVICE_MAX_WORK_GROUP_SIZE for a given kernel object."

In this case, I think clinfo or runtime reports the max. workgroup size (1024) as set by the environmental variable GPU_MAX_WORKGROUP_SIZE. However, without the kernel attribute, the compiler uses a lower default value which is most likely 256.

It doesn't look to me a driver issue. If possible it's always better to avoid using any undocumented/unofficial environmental variable/settings whose behavior may change in the future.

As I'm aware, below is an old post where a user reported a similar problem where clinfo shows the modified value but clEnqueueNDRangeKernel failed.

how to make GPU_MAX_WORKGROUP_SIZE bigger than 256

Thanks.

OpenCL

Max workgroup size on Radeon Pro W5500