cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

lolliedieb
Adept II

Running OpenCL Work Groups with >256 Elements

Hi all,

I am currently re-writing some OpenCL code of mine and would like to split the work of the group to more waves in order to have more waves in flight. The code is a OpenCL 1.2 code (because it needs to be compatible with Nvidia GPUs as well). On Nvidias running more then 256 items is no problem as long as enough resources are available.

I read that it seems to be possible to run a larger work group OpenCL kernel when Null range is used or when compile time requirement work group size and submit time sizes agree.

Unfortunately I experience the following problem:

I submit a kernel that has attribute header

__attribute__((reqd_work_group_size(256, 2, 1)))

and my submit part looks like

cl_int err = queue.enqueueNDRangeKernel(kernel, cl::NDRange(0), cl::NDRange(8388608,2), cl::NDRange(256,2), NULL, NULL);
So I expect 32768 x 1 work groups with 256x2 items each. Interesting the returned error code is CL_SUCCESS (by the way, switching to 128, 4 gives invalid work group size)

I start the kernel with following two lines:

if (get_global_id(0) == 0) printf("Local Sizes: %d %d \n",get_local_size(0),get_local_size(1));

if (get_group_id(0) == 0) printf("Work Item Id: %d %d \n",get_local_id(0), get_local_id(1));

Interestingly I get 256 and 2 return from the first line, but work group 0 only prints me the local id's (x 0), all  the (x 1) are missing and never executed.

My OpenCL platform identifies itself as "OpenCL 2.1 AMD-APP (2671.3)", GPU as "OpenCL 1.2 AMD-APP(2671.3)" (this is strange that it also does not report OpenCL 2.1 compatibility ...),
Its a Ubuntu 18.04 system with recent amdgpu-pro 18.30 running. The GPU is a RX 580 4G (plus a AMD A10 7850 iGPU that is also detected).

clinfo reports

Max work item dimensions                    3
  Max work item sizes                         1024x1024x1024
  Max work group size                         256
  Preferred work group size (AMD)             256
  Max work group size (AMD)                   1024
  Preferred work group size multiple          64
  Wavefront width (AMD)                       64

So, the OpenCL standard value and CL_DEVICE_MAX_WORK_GROUP_SIZE_AMD do not agree.
The kernel uses 33 registers (it compiles well in rga and CodeXL) and 21.0k local memory. So with 256 work items per group I can have only 3 waves per SIMD active while 512 would allow 6 - I hope to get better performance by this (at least on Nvidia the step from 256 to 512 helped a lot)


Any advice to get this running? Unfortunately using NULL local range and hope the the compiler just will give 512 work group size is not an option, because there are lot of hard-coded optimizations to the work group size in the code.

Thanks in advance.

0 Likes
13 Replies

Hi, since this involves OpenCL programming, your best bet to get help in your coding is from AMD OpenCL & Vulkan Forum from here: OpenCL

If you have problems posting on the Forum, AMD Moderator Deepak ​can assist you in getting access.

Ah, you may be right. Sry, did not find the other forum before.
Well I will post to the new user forum there. Seems the better place for my request. Thx very much.

0 Likes
dipak
Big Boss

Hi lolliedieb,

I've whitelisted you and I'm moving this post to our OpenCL forum.

And thank you,elstaci

0 Likes

Hey, Thank you for helping me in the past. Surprised you are up so early in the morning

0 Likes
dipak
Big Boss

If "Max work group size" is reported as 256 then that is the max. limit for work-group size (multiplying all the dimensions i.e. X *Y *Z). Though I'm little bit surprised with the "Max work item sizes" values [1024x1024x1024]. Because, as per my understanding, the limit for each dimension should not be greater than "Max work group size". I need to check this with the concerned team. Interestingly, earlier clinfo used to report it as [256x256x256] which is quite expected.

Meanwhile, could you please attach the clinfo output?

My OpenCL platform identifies itself as "OpenCL 2.1 AMD-APP (2671.3)", GPU as "OpenCL 1.2 AMD-APP(2671.3)" (this is strange that it also does not report OpenCL 2.1 compatibility ...),

It is expected one. Currently OpenCL runtime under AMDGPU-Pro supports OpenCL 1.2 only, so it recognizes any OpenCL 2.0+ devices as OpenCL 1.2.

Thanks.

0 Likes

Here is the full clinfo output for both GPUs (Linux):

Number of platforms                               1

  Platform Name                                   AMD Accelerated Parallel Processing

  Platform Vendor                                 Advanced Micro Devices, Inc.

  Platform Version                                OpenCL 2.1 AMD-APP (2671.3)

  Platform Profile                                FULL_PROFILE

  Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

  Platform Host timer resolution                  1ns

  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing

Number of devices                                 2

  Device Name                                     Spectre

  Device Vendor                                   Advanced Micro Devices, Inc.

  Device Vendor ID                                0x1002

  Device Version                                  OpenCL 1.2 AMD-APP (2671.3)

  Driver Version                                  2671.3

  Device OpenCL C Version                         OpenCL C 1.2

  Device Type                                     GPU

  Device Board Name (AMD)                         AMD Radeon Graphics

  Device Topology (AMD)                           PCI-E, 00:01.0

  Device Profile                                  FULL_PROFILE

  Device Available                                Yes

  Compiler Available                              Yes

  Linker Available                                Yes

  Max compute units                               8

  SIMD per compute unit (AMD)                     4

  SIMD width (AMD)                                16

  SIMD instruction width (AMD)                    1

  Max clock frequency                             720MHz

  Graphics IP (AMD)                               7.1

  Device Partition                                (core)

    Max number of sub-devices                     8

    Supported partition types                     None

  Max work item dimensions                        3

  Max work item sizes                             1024x1024x1024

  Max work group size                             256

  Preferred work group size (AMD)                 256

  Max work group size (AMD)                       1024

  Preferred work group size multiple              64

  Wavefront width (AMD)                           64

  Preferred / native vector sizes                

    char                                                 4 / 4      

    short                                                2 / 2      

    int                                                  1 / 1      

    long                                                 1 / 1      

    half                                                 1 / 1        (n/a)

    float                                                1 / 1      

    double                                               1 / 1        (cl_khr_fp64)

  Half-precision Floating-point support           (n/a)

  Single-precision Floating-point support         (core)

    Denormals                                     No

    Infinity and NANs                             Yes

    Round to nearest                              Yes

    Round to zero                                 Yes

    Round to infinity                             Yes

    IEEE754-2008 fused multiply-add               Yes

    Support is emulated in software               No

    Correctly-rounded divide and sqrt operations  Yes

  Double-precision Floating-point support         (cl_khr_fp64)

    Denormals                                     Yes

    Infinity and NANs                             Yes

    Round to nearest                              Yes

    Round to zero                                 Yes

    Round to infinity                             Yes

    IEEE754-2008 fused multiply-add               Yes

    Support is emulated in software               No

  Address bits                                    64, Little-Endian

  Global memory size                              4625539072 (4.308GiB)

  Global free memory (AMD)                        8017340 (7.646GiB)

  Global memory channels (AMD)                    4

  Global memory banks per channel (AMD)           8

  Global memory bank width (AMD)                  256 bytes

  Error Correction support                        No

  Max memory allocation                           3078068633 (2.867GiB)

  Unified memory for Host and Device              Yes

  Minimum alignment for any data type             128 bytes

  Alignment of base address                       2048 bits (256 bytes)

  Global Memory cache type                        Read/Write

  Global Memory cache size                        16384 (16KiB)

  Global Memory cache line size                   64 bytes

  Image support                                   Yes

    Max number of samplers per kernel             16

    Max size for 1D images from buffer            134217728 pixels

    Max 1D or 2D image array size                 2048 images

    Base address alignment for 2D image buffers   256 bytes

    Pitch alignment for 2D image buffers          256 pixels

    Max 2D image size                             16384x16384 pixels

    Max 3D image size                             2048x2048x2048 pixels

    Max number of read image args                 128

    Max number of write image args                8

  Local memory type                               Local

  Local memory size                               32768 (32KiB)

  Local memory syze per CU (AMD)                  65536 (64KiB)

  Local memory banks (AMD)                        32

  Max number of constant args                     8

  Max constant buffer size                        3078068633 (2.867GiB)

  Preferred constant buffer size (AMD)            16384 (16KiB)

  Max size of kernel argument                     1024

  Queue properties                               

    Out-of-order execution                        No

    Profiling                                     Yes

  Prefer user sync for interop                    Yes

  Profiling timer resolution                      1ns

  Profiling timer offset since Epoch (AMD)        1539692369995556810ns (Tue Oct 16 14:19:29 2018)

  Execution capabilities                         

    Run OpenCL kernels                            Yes

    Run native kernels                            No

    Thread trace supported (AMD)                  Yes

    Number of async queues (AMD)                  2

    Max real-time compute queues (AMD)            0

    Max real-time compute units (AMD)             94

    SPIR versions                                 1.2

  printf() buffer size                            4194304 (4MiB)

  Built-in kernels                               

  Device Extensions                               cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

  Device Name                                     Ellesmere

  Device Vendor                                   Advanced Micro Devices, Inc.

  Device Vendor ID                                0x1002

  Device Version                                  OpenCL 1.2 AMD-APP (2671.3)

  Driver Version                                  2671.3

  Device OpenCL C Version                         OpenCL C 1.2

  Device Type                                     GPU

  Device Board Name (AMD)                         Radeon RX 580 Series

  Device Topology (AMD)                           PCI-E, 01:00.0

  Device Profile                                  FULL_PROFILE

  Device Available                                Yes

  Compiler Available                              Yes

  Linker Available                                Yes

  Max compute units                               36

  SIMD per compute unit (AMD)                     4

  SIMD width (AMD)                                16

  SIMD instruction width (AMD)                    1

  Max clock frequency                             1366MHz

  Graphics IP (AMD)                               8.0

  Device Partition                                (core)

    Max number of sub-devices                     36

    Supported partition types                     None

  Max work item dimensions                        3

  Max work item sizes                             1024x1024x1024

  Max work group size                             256

  Preferred work group size (AMD)                 256

  Max work group size (AMD)                       1024

  Preferred work group size multiple              64

  Wavefront width (AMD)                           64

  Preferred / native vector sizes                

    char                                                 4 / 4      

    short                                                2 / 2      

    int                                                  1 / 1      

    long                                                 1 / 1      

    half                                                 1 / 1        (cl_khr_fp16)

    float                                                1 / 1      

    double                                               1 / 1        (cl_khr_fp64)

  Half-precision Floating-point support           (cl_khr_fp16)

    Denormals                                     No

    Infinity and NANs                             No

    Round to nearest                              No

    Round to zero                                 No

    Round to infinity                             No

    IEEE754-2008 fused multiply-add               No

    Support is emulated in software               No

  Single-precision Floating-point support         (core)

    Denormals                                     No

    Infinity and NANs                             Yes

    Round to nearest                              Yes

    Round to zero                                 Yes

    Round to infinity                             Yes

    IEEE754-2008 fused multiply-add               Yes

    Support is emulated in software               No

    Correctly-rounded divide and sqrt operations  Yes

  Double-precision Floating-point support         (cl_khr_fp64)

    Denormals                                     Yes

    Infinity and NANs                             Yes

    Round to nearest                              Yes

    Round to zero                                 Yes

    Round to infinity                             Yes

    IEEE754-2008 fused multiply-add               Yes

    Support is emulated in software               No

  Address bits                                    64, Little-Endian

  Global memory size                              3830943744 (3.568GiB)

  Global free memory (AMD)                        3719780 (3.547GiB)

  Global memory channels (AMD)                    8

  Global memory banks per channel (AMD)           16

  Global memory bank width (AMD)                  256 bytes

  Error Correction support                        No

  Max memory allocation                           3074440806 (2.863GiB)

  Unified memory for Host and Device              No

  Minimum alignment for any data type             128 bytes

  Alignment of base address                       2048 bits (256 bytes)

  Global Memory cache type                        Read/Write

  Global Memory cache size                        16384 (16KiB)

  Global Memory cache line size                   64 bytes

  Image support                                   Yes

    Max number of samplers per kernel             16

    Max size for 1D images from buffer            134217728 pixels

    Max 1D or 2D image array size                 2048 images

    Base address alignment for 2D image buffers   256 bytes

    Pitch alignment for 2D image buffers          256 pixels

    Max 2D image size                             16384x16384 pixels

    Max 3D image size                             2048x2048x2048 pixels

    Max number of read image args                 128

    Max number of write image args                8

  Local memory type                               Local

  Local memory size                               32768 (32KiB)

  Local memory syze per CU (AMD)                  65536 (64KiB)

  Local memory banks (AMD)                        32

  Max number of constant args                     8

  Max constant buffer size                        3074440806 (2.863GiB)

  Preferred constant buffer size (AMD)            16384 (16KiB)

  Max size of kernel argument                     1024

  Queue properties                               

    Out-of-order execution                        No

    Profiling                                     Yes

  Prefer user sync for interop                    Yes

  Profiling timer resolution                      1ns

  Profiling timer offset since Epoch (AMD)        1539692369995556810ns (Tue Oct 16 14:19:29 2018)

  Execution capabilities                         

    Run OpenCL kernels                            Yes

    Run native kernels                            No

    Thread trace supported (AMD)                  Yes

    Number of async queues (AMD)                  2

    Max real-time compute queues (AMD)            0

    Max real-time compute units (AMD)             120

    SPIR versions                                 1.2

  printf() buffer size                            4194304 (4MiB)

  Built-in kernels                               

  Device Extensions                               cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

NULL platform behavior

  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform

  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform

  clCreateContext(NULL, ...) [default]            No platform

  clCreateContext(NULL, ...) [other]              Success [AMD]

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)

    Platform Name                                 AMD Accelerated Parallel Processing

    Device Name                                   Spectre

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (2)

    Platform Name                                 AMD Accelerated Parallel Processing

    Device Name                                   Spectre

    Device Name                                   Ellesmere

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform

  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (2)

    Platform Name                                 AMD Accelerated Parallel Processing

    Device Name                                   Spectre

    Device Name                                   Ellesmere

By the way very interesting:
Without the RX 580 I was not able to install this drivers on the machine (The APUs officially not supported) but now the iGPU runs just fine on this drivers, graphics and compute. I wonder why this artificial restriction then ^^

0 Likes
leyvin
Miniboss

The 1024x1024x1024 is listed because that's the Maximum per Dimension., i.e. 1024x1x1, 1x1024x1, or 1x1x1024... although functionally these are identical.

Yes, this can be a little confusing but listing a 256x256x256 is simply incorrect as that's not the Maximum that each Dimension can be assigned.

1024 is the OpenCL Specification for Maximum Dimension Size.

Now I'd recommend (highly) reading the ISA for GCN 3rd Gen., as it helps understand how and why the OpenCL Limitations exist, but more importantly the best approach to setting Thread Affinity in the OpenCL Kernel.

Note you have 16kB Per CU and 32kB Shared Cache to work with., what's more you have a Maximum 16 Instruction Queue... and each Object contains 32bit Instruction + 32bit Data., so they actually use double the Cache you might think they do.

i.e. 1024 Objects = 16kB *not* 8kB as you might think.

This as a note is why Asynchronous Operation is costless, unlike on NVIDIA; where there is a Memory Access Latency penalty to transfer between Shader Modules.

GCN is also setup to where each Compute Unit has 16 SIMD, 4 Pipelines and 64 Threads (per Cycle) but are packaged in Complexes of 4 Compute Units.

"Global" Cache is only shared between these 4 Units., not the whole GPU... as such while an individual CU might be capable of storing and processing up to 1024 Objects, Shared across 4; you're looking at a maximum of 256 Asynchronously., but as noted because you can only process 64 Per Cycle Per CU; this really shouldn't be an issue as that's the absolute maximum that can be processed without Barriers.

This as a note means that there's 16kB for the Data + 16kB for the Kernel within the Shared (Global) Cache., if you want a bigger Kernel, then you have to have fewer workgroups.

NVIDIA not only has double the available Threads per Shader Module., but also keep in mind that the code you've written isn't necessarily what is actually being used at runtime... an issue I've had with NVIDIA for some time, is that they have a habit of doing Black Box approaches for "Optimisation"., which can often result in unwanted behaviour or sometime frankly incorrect output.

NVIDIA here is the "Special Case" Exception., not the Rule. It's something I'd heavily recommend anyone doing Development always keeps in the back of their mind., and is a main reason why I don't do initial Development with either Intel or NVIDIA Hardware., as it can lead to some false assumptions in terms of Support and Specification, that will result in it being very difficult to then port to other platforms later.

0 Likes

leyvin
Thx for your reply, but I somehow do not really get the connection between your words about GCN ISA and my work group problem.

As far as I know AMD cards are not limited to 256 elements in one work group. For example in DirectCompute and HSA as well the limit is 1024 and also here CL_DEVICE_MAX_WORK_GROUP_SIZE_AMD returns 1024 as maximal extended work group size. My question targets: how to use this?

Because my problem is that for some reason scheduling 512 x 1 is returned as CL_INVALID_WORK_GROUP_SIZE (I do not understand why) while for some reason 256 x 2 is accepted (enqueueNDRangeKernel returns no error) and the kernel itself recognizes local_size(0) as 256 and local_size(1) as 2 ... so that seems to be fine, but local threads 0-255 x 1 are never executed while 0-255 x 0 are working as expected.

I do not think I am running into resource problems here, because the kernel uses
32 vgpr / thread
~27.5 kb local memory / work group
54 sregs / work group 
and has an ISA size of only 4k, so instruction cache should be fine as well.

So right now having two of this groups active at the same time is no problem at all. I would expect if the problem would be on the resource side that then I only got one work group with twice the threads in flight. But this is not the case.

You may be right that stepping from 256 threads / group to 512 does not have an positive effect for my code. But since my code is mainly global memory latency & global atomics bound I already saw a nice improvement stepping from 128 to 192 to 256 elements per work group. Also running at larger sizes may allow me to use same code on Nvidia and AMD as well - I am only one person working on this code but it must run on both and further on both Windows and Linux. Thats holding me back from switching to CUDA + HSA or Vulcan or ROCm because my development effort would be too huge maintaining more then one platform.

0 Likes

Hi,

I have 3 years old experiences making GCN asm kernels on HD7770 (GCN 2nd gen, I guess) and trying to alloc max LDS with the widest groupsize.

workitems: totalStreamsOnCard*4*2

groupsize: 256

lds size/group: 32768 (this is the maximum per group, not 64)

vregs: 128

Important thing is that I use one dimensional GID and if there is more actual work that the above workitemsize, I implement a loop in the kernel. This way I can get workitems that working simultaneously for half a second. With 1D GID you can go much more above that 1K OpenCL 'limit'.

This thing was able to fulfill each CU: allocate all 64K LDS, used up all the VRegs, while letting me synchronize the groups with ds_barrier.

I guess it is not good news... I also tried to find a way to group the most threads as I can, but that was the max (for me at least).

0 Likes

Hi lolliedieb,

As I've come to know, 256 is the default max. work group size so that any OpenCL application should work properly. Whereas, AMD version (i.e.1024) shows max. HW capability and user can use that value by forcing dimensions with __attribute__((reqd_work_group_size(X, Y, Z))). However, the developers must understand the performance impact of such request.

By the way, when I ran the code you mentioned on the first post, I got the expected output on a Windows m/c. Currently, I don't have a Linux setup to test it. I have a suggestion though. Instead of printf, could you please try some other methods (such as write to a buffer) to verify that work items corresponding to get_local_id(1) are executing or not? If you still see the problem, please share the repro. I'll report it to the concerned team.

Thanks.

0 Likes

Well the printf is usually not part of my kernel. When I ran it the first time I only had the usual working code in it and wondered why it executed much faster then I anticipated. Until I realized that only a fraction of my output data was written. The kernels amount of output is a quadratic function in the number of input data and so while researching the reason I found out that only first half of input data was read and the second half - belonging to threads 256-511 was missing.
Thats why I put the printf in to see whats going on.

I will later try to set up a Windows installation on the same machine to compare if there is a difference in behavior between Linux and Windows driver.

0 Likes
lolliedieb
Adept II

Hey guys,
I wanted to reply back a thing recently learned for all googlers:

It turned out that what I was asking for is well possible, but requires a (seems undocumented) environment variable to be set. Namely if I set GPU_MAX_WORKGROUP_SIZE=1024 I can go up to that value, its even shown then in cl_info.

But I wonder why there is this switch but information regarding it is very poor, but its working just fine as long as the kernel has enough resources.

Now new question:
Any chance to activate full usage of 64 kBytes local memory (instead of 32) per work group? The kernel compile well but then given an out of resources run time error if I try to use them ^^

Have a nice day

0 Likes

GCN 1st to 4th Gen only have 32kB Local., GCN 5th Gen has the full 64kB that was planned but scrapped, however the ISA was never updated to reflect this.

I believe it was corrected in the GCN3 (Update) ISA … although only briefly in the accompanying text not the Architecture Diagram. 

Not entirely sure the reason for said reduced Caches over the ISA, but likely a cost-saving measure. 

If you look at the OpenCL (easiest way is via GPU-Z)., it should list 256kB Queue • 16kB Global (GDS) • 32kB Local (LDS); which should be the same from the HD 7770 up to Polaris 30., and technically the Ryzen with Vega Graphics too, as they're Polaris as well. 

RX Vega 56 / 64 / FE and Radeon VII however should list 512kB Queue • 64kB Global (GDS) • 64kB Local (LDS).

Although if you want portability, I'd stick within the bounds of GCN 1st to 4th Gen. 

0 Likes