11 Replies Latest reply on Oct 22, 2018 5:56 AM by lolliedieb

    Running OpenCL Work Groups with >256 Elements

    lolliedieb

      Hi all,

       

      I am currently re-writing some OpenCL code of mine and would like to split the work of the group to more waves in order to have more waves in flight. The code is a OpenCL 1.2 code (because it needs to be compatible with Nvidia GPUs as well). On Nvidias running more then 256 items is no problem as long as enough resources are available.

      I read that it seems to be possible to run a larger work group OpenCL kernel when Null range is used or when compile time requirement work group size and submit time sizes agree.

       

      Unfortunately I experience the following problem:

      I submit a kernel that has attribute header

      __attribute__((reqd_work_group_size(256, 2, 1)))

      and my submit part looks like

      cl_int err = queue.enqueueNDRangeKernel(kernel, cl::NDRange(0), cl::NDRange(8388608,2), cl::NDRange(256,2), NULL, NULL);
      So I expect 32768 x 1 work groups with 256x2 items each. Interesting the returned error code is CL_SUCCESS (by the way, switching to 128, 4 gives invalid work group size)

       

      I start the kernel with following two lines:

      if (get_global_id(0) == 0) printf("Local Sizes: %d %d \n",get_local_size(0),get_local_size(1));

      if (get_group_id(0) == 0) printf("Work Item Id: %d %d \n",get_local_id(0), get_local_id(1));

      Interestingly I get 256 and 2 return from the first line, but work group 0 only prints me the local id's (x 0), all  the (x 1) are missing and never executed.

       

      My OpenCL platform identifies itself as "OpenCL 2.1 AMD-APP (2671.3)", GPU as "OpenCL 1.2 AMD-APP(2671.3)" (this is strange that it also does not report OpenCL 2.1 compatibility ...),
      Its a Ubuntu 18.04 system with recent amdgpu-pro 18.30 running. The GPU is a RX 580 4G (plus a AMD A10 7850 iGPU that is also detected).

       

      clinfo reports

      Max work item dimensions                    3
        Max work item sizes                         1024x1024x1024
        Max work group size                         256
        Preferred work group size (AMD)             256
        Max work group size (AMD)                   1024
        Preferred work group size multiple          64
        Wavefront width (AMD)                       64

       

      So, the OpenCL standard value and CL_DEVICE_MAX_WORK_GROUP_SIZE_AMD do not agree.
      The kernel uses 33 registers (it compiles well in rga and CodeXL) and 21.0k local memory. So with 256 work items per group I can have only 3 waves per SIMD active while 512 would allow 6 - I hope to get better performance by this (at least on Nvidia the step from 256 to 512 helped a lot)


      Any advice to get this running? Unfortunately using NULL local range and hope the the compiler just will give 512 work group size is not an option, because there are lot of hard-coded optimizations to the work group size in the code.

       

      Thanks in advance.

        • Re: Running OpenCL Work Groups with >256 Elements
          elstaci

          Hi, since this involves OpenCL programming, your best bet to get help in your coding is from AMD OpenCL & Vulkan Forum from here: OpenCL

           

          If you have problems posting on the Forum, AMD Moderator dipakcan assist you in getting access.

          1 of 1 people found this helpful
          • Re: Running OpenCL Work Groups with >256 Elements
            dipak

            Hi lolliedieb,

            I've whitelisted you and I'm moving this post to our OpenCL forum.

             

            And thank you,elstaci.

            • Re: Running OpenCL Work Groups with >256 Elements
              dipak

              If "Max work group size" is reported as 256 then that is the max. limit for work-group size (multiplying all the dimensions i.e. X *Y *Z). Though I'm little bit surprised with the "Max work item sizes" values [1024x1024x1024]. Because, as per my understanding, the limit for each dimension should not be greater than "Max work group size". I need to check this with the concerned team. Interestingly, earlier clinfo used to report it as [256x256x256] which is quite expected.

              Meanwhile, could you please attach the clinfo output?

               

               

              My OpenCL platform identifies itself as "OpenCL 2.1 AMD-APP (2671.3)", GPU as "OpenCL 1.2 AMD-APP(2671.3)" (this is strange that it also does not report OpenCL 2.1 compatibility ...),

              It is expected one. Currently OpenCL runtime under AMDGPU-Pro supports OpenCL 1.2 only, so it recognizes any OpenCL 2.0+ devices as OpenCL 1.2.

               

               

              Thanks.

                • Re: Running OpenCL Work Groups with >256 Elements
                  lolliedieb

                  Here is the full clinfo output for both GPUs (Linux):

                  Number of platforms                               1

                    Platform Name                                   AMD Accelerated Parallel Processing

                    Platform Vendor                                 Advanced Micro Devices, Inc.

                    Platform Version                                OpenCL 2.1 AMD-APP (2671.3)

                    Platform Profile                                FULL_PROFILE

                    Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

                    Platform Host timer resolution                  1ns

                    Platform Extensions function suffix             AMD

                   

                    Platform Name                                   AMD Accelerated Parallel Processing

                  Number of devices                                 2

                    Device Name                                     Spectre

                    Device Vendor                                   Advanced Micro Devices, Inc.

                    Device Vendor ID                                0x1002

                    Device Version                                  OpenCL 1.2 AMD-APP (2671.3)

                    Driver Version                                  2671.3

                    Device OpenCL C Version                         OpenCL C 1.2

                    Device Type                                     GPU

                    Device Board Name (AMD)                         AMD Radeon Graphics

                    Device Topology (AMD)                           PCI-E, 00:01.0

                    Device Profile                                  FULL_PROFILE

                    Device Available                                Yes

                    Compiler Available                              Yes

                    Linker Available                                Yes

                    Max compute units                               8

                    SIMD per compute unit (AMD)                     4

                    SIMD width (AMD)                                16

                    SIMD instruction width (AMD)                    1

                    Max clock frequency                             720MHz

                    Graphics IP (AMD)                               7.1

                    Device Partition                                (core)

                      Max number of sub-devices                     8

                      Supported partition types                     None

                    Max work item dimensions                        3

                    Max work item sizes                             1024x1024x1024

                    Max work group size                             256

                    Preferred work group size (AMD)                 256

                    Max work group size (AMD)                       1024

                    Preferred work group size multiple              64

                    Wavefront width (AMD)                           64

                    Preferred / native vector sizes                

                      char                                                 4 / 4      

                      short                                                2 / 2      

                      int                                                  1 / 1      

                      long                                                 1 / 1      

                      half                                                 1 / 1        (n/a)

                      float                                                1 / 1      

                      double                                               1 / 1        (cl_khr_fp64)

                    Half-precision Floating-point support           (n/a)

                    Single-precision Floating-point support         (core)

                      Denormals                                     No

                      Infinity and NANs                             Yes

                      Round to nearest                              Yes

                      Round to zero                                 Yes

                      Round to infinity                             Yes

                      IEEE754-2008 fused multiply-add               Yes

                      Support is emulated in software               No

                      Correctly-rounded divide and sqrt operations  Yes

                    Double-precision Floating-point support         (cl_khr_fp64)

                      Denormals                                     Yes

                      Infinity and NANs                             Yes

                      Round to nearest                              Yes

                      Round to zero                                 Yes

                      Round to infinity                             Yes

                      IEEE754-2008 fused multiply-add               Yes

                      Support is emulated in software               No

                    Address bits                                    64, Little-Endian

                    Global memory size                              4625539072 (4.308GiB)

                    Global free memory (AMD)                        8017340 (7.646GiB)

                    Global memory channels (AMD)                    4

                    Global memory banks per channel (AMD)           8

                    Global memory bank width (AMD)                  256 bytes

                    Error Correction support                        No

                    Max memory allocation                           3078068633 (2.867GiB)

                    Unified memory for Host and Device              Yes

                    Minimum alignment for any data type             128 bytes

                    Alignment of base address                       2048 bits (256 bytes)

                    Global Memory cache type                        Read/Write

                    Global Memory cache size                        16384 (16KiB)

                    Global Memory cache line size                   64 bytes

                    Image support                                   Yes

                      Max number of samplers per kernel             16

                      Max size for 1D images from buffer            134217728 pixels

                      Max 1D or 2D image array size                 2048 images

                      Base address alignment for 2D image buffers   256 bytes

                      Pitch alignment for 2D image buffers          256 pixels

                      Max 2D image size                             16384x16384 pixels

                      Max 3D image size                             2048x2048x2048 pixels

                      Max number of read image args                 128

                      Max number of write image args                8

                    Local memory type                               Local

                    Local memory size                               32768 (32KiB)

                    Local memory syze per CU (AMD)                  65536 (64KiB)

                    Local memory banks (AMD)                        32

                    Max number of constant args                     8

                    Max constant buffer size                        3078068633 (2.867GiB)

                    Preferred constant buffer size (AMD)            16384 (16KiB)

                    Max size of kernel argument                     1024

                    Queue properties                               

                      Out-of-order execution                        No

                      Profiling                                     Yes

                    Prefer user sync for interop                    Yes

                    Profiling timer resolution                      1ns

                    Profiling timer offset since Epoch (AMD)        1539692369995556810ns (Tue Oct 16 14:19:29 2018)

                    Execution capabilities                         

                      Run OpenCL kernels                            Yes

                      Run native kernels                            No

                      Thread trace supported (AMD)                  Yes

                      Number of async queues (AMD)                  2

                      Max real-time compute queues (AMD)            0

                      Max real-time compute units (AMD)             94

                      SPIR versions                                 1.2

                    printf() buffer size                            4194304 (4MiB)

                    Built-in kernels                               

                    Device Extensions                               cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

                   

                    Device Name                                     Ellesmere

                    Device Vendor                                   Advanced Micro Devices, Inc.

                    Device Vendor ID                                0x1002

                    Device Version                                  OpenCL 1.2 AMD-APP (2671.3)

                    Driver Version                                  2671.3

                    Device OpenCL C Version                         OpenCL C 1.2

                    Device Type                                     GPU

                    Device Board Name (AMD)                         Radeon RX 580 Series

                    Device Topology (AMD)                           PCI-E, 01:00.0

                    Device Profile                                  FULL_PROFILE

                    Device Available                                Yes

                    Compiler Available                              Yes

                    Linker Available                                Yes

                    Max compute units                               36

                    SIMD per compute unit (AMD)                     4

                    SIMD width (AMD)                                16

                    SIMD instruction width (AMD)                    1

                    Max clock frequency                             1366MHz

                    Graphics IP (AMD)                               8.0

                    Device Partition                                (core)

                      Max number of sub-devices                     36

                      Supported partition types                     None

                    Max work item dimensions                        3

                    Max work item sizes                             1024x1024x1024

                    Max work group size                             256

                    Preferred work group size (AMD)                 256

                    Max work group size (AMD)                       1024

                    Preferred work group size multiple              64

                    Wavefront width (AMD)                           64

                    Preferred / native vector sizes                

                      char                                                 4 / 4      

                      short                                                2 / 2      

                      int                                                  1 / 1      

                      long                                                 1 / 1      

                      half                                                 1 / 1        (cl_khr_fp16)

                      float                                                1 / 1      

                      double                                               1 / 1        (cl_khr_fp64)

                    Half-precision Floating-point support           (cl_khr_fp16)

                      Denormals                                     No

                      Infinity and NANs                             No

                      Round to nearest                              No

                      Round to zero                                 No

                      Round to infinity                             No

                      IEEE754-2008 fused multiply-add               No

                      Support is emulated in software               No

                    Single-precision Floating-point support         (core)

                      Denormals                                     No

                      Infinity and NANs                             Yes

                      Round to nearest                              Yes

                      Round to zero                                 Yes

                      Round to infinity                             Yes

                      IEEE754-2008 fused multiply-add               Yes

                      Support is emulated in software               No

                      Correctly-rounded divide and sqrt operations  Yes

                    Double-precision Floating-point support         (cl_khr_fp64)

                      Denormals                                     Yes

                      Infinity and NANs                             Yes

                      Round to nearest                              Yes

                      Round to zero                                 Yes

                      Round to infinity                             Yes

                      IEEE754-2008 fused multiply-add               Yes

                      Support is emulated in software               No

                    Address bits                                    64, Little-Endian

                    Global memory size                              3830943744 (3.568GiB)

                    Global free memory (AMD)                        3719780 (3.547GiB)

                    Global memory channels (AMD)                    8

                    Global memory banks per channel (AMD)           16

                    Global memory bank width (AMD)                  256 bytes

                    Error Correction support                        No

                    Max memory allocation                           3074440806 (2.863GiB)

                    Unified memory for Host and Device              No

                    Minimum alignment for any data type             128 bytes

                    Alignment of base address                       2048 bits (256 bytes)

                    Global Memory cache type                        Read/Write

                    Global Memory cache size                        16384 (16KiB)

                    Global Memory cache line size                   64 bytes

                    Image support                                   Yes

                      Max number of samplers per kernel             16

                      Max size for 1D images from buffer            134217728 pixels

                      Max 1D or 2D image array size                 2048 images

                      Base address alignment for 2D image buffers   256 bytes

                      Pitch alignment for 2D image buffers          256 pixels

                      Max 2D image size                             16384x16384 pixels

                      Max 3D image size                             2048x2048x2048 pixels

                      Max number of read image args                 128

                      Max number of write image args                8

                    Local memory type                               Local

                    Local memory size                               32768 (32KiB)

                    Local memory syze per CU (AMD)                  65536 (64KiB)

                    Local memory banks (AMD)                        32

                    Max number of constant args                     8

                    Max constant buffer size                        3074440806 (2.863GiB)

                    Preferred constant buffer size (AMD)            16384 (16KiB)

                    Max size of kernel argument                     1024

                    Queue properties                               

                      Out-of-order execution                        No

                      Profiling                                     Yes

                    Prefer user sync for interop                    Yes

                    Profiling timer resolution                      1ns

                    Profiling timer offset since Epoch (AMD)        1539692369995556810ns (Tue Oct 16 14:19:29 2018)

                    Execution capabilities                         

                      Run OpenCL kernels                            Yes

                      Run native kernels                            No

                      Thread trace supported (AMD)                  Yes

                      Number of async queues (AMD)                  2

                      Max real-time compute queues (AMD)            0

                      Max real-time compute units (AMD)             120

                      SPIR versions                                 1.2

                    printf() buffer size                            4194304 (4MiB)

                    Built-in kernels                               

                    Device Extensions                               cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

                   

                  NULL platform behavior

                    clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform

                    clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform

                    clCreateContext(NULL, ...) [default]            No platform

                    clCreateContext(NULL, ...) [other]              Success [AMD]

                    clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)

                      Platform Name                                 AMD Accelerated Parallel Processing

                      Device Name                                   Spectre

                    clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform

                    clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (2)

                      Platform Name                                 AMD Accelerated Parallel Processing

                      Device Name                                   Spectre

                      Device Name                                   Ellesmere

                    clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform

                    clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform

                    clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (2)

                      Platform Name                                 AMD Accelerated Parallel Processing

                      Device Name                                   Spectre

                      Device Name                                   Ellesmere

                   

                  By the way very interesting:
                  Without the RX 580 I was not able to install this drivers on the machine (The APUs officially not supported) but now the iGPU runs just fine on this drivers, graphics and compute. I wonder why this artificial restriction then ^^

                • Re: Running OpenCL Work Groups with >256 Elements
                  leyvin

                  The 1024x1024x1024 is listed because that's the Maximum per Dimension., i.e. 1024x1x1, 1x1024x1, or 1x1x1024... although functionally these are identical.

                  Yes, this can be a little confusing but listing a 256x256x256 is simply incorrect as that's not the Maximum that each Dimension can be assigned.

                   

                  1024 is the OpenCL Specification for Maximum Dimension Size.

                  Now I'd recommend (highly) reading the ISA for GCN 3rd Gen., as it helps understand how and why the OpenCL Limitations exist, but more importantly the best approach to setting Thread Affinity in the OpenCL Kernel.

                   

                  Note you have 16kB Per CU and 32kB Shared Cache to work with., what's more you have a Maximum 16 Instruction Queue... and each Object contains 32bit Instruction + 32bit Data., so they actually use double the Cache you might think they do.

                  i.e. 1024 Objects = 16kB *not* 8kB as you might think.

                  This as a note is why Asynchronous Operation is costless, unlike on NVIDIA; where there is a Memory Access Latency penalty to transfer between Shader Modules.

                   

                  GCN is also setup to where each Compute Unit has 16 SIMD, 4 Pipelines and 64 Threads (per Cycle) but are packaged in Complexes of 4 Compute Units.

                  "Global" Cache is only shared between these 4 Units., not the whole GPU... as such while an individual CU might be capable of storing and processing up to 1024 Objects, Shared across 4; you're looking at a maximum of 256 Asynchronously., but as noted because you can only process 64 Per Cycle Per CU; this really shouldn't be an issue as that's the absolute maximum that can be processed without Barriers.

                   

                  This as a note means that there's 16kB for the Data + 16kB for the Kernel within the Shared (Global) Cache., if you want a bigger Kernel, then you have to have fewer workgroups.

                  NVIDIA not only has double the available Threads per Shader Module., but also keep in mind that the code you've written isn't necessarily what is actually being used at runtime... an issue I've had with NVIDIA for some time, is that they have a habit of doing Black Box approaches for "Optimisation"., which can often result in unwanted behaviour or sometime frankly incorrect output.

                   

                  NVIDIA here is the "Special Case" Exception., not the Rule. It's something I'd heavily recommend anyone doing Development always keeps in the back of their mind., and is a main reason why I don't do initial Development with either Intel or NVIDIA Hardware., as it can lead to some false assumptions in terms of Support and Specification, that will result in it being very difficult to then port to other platforms later.

                    • Re: Running OpenCL Work Groups with >256 Elements
                      lolliedieb

                      leyvin
                      Thx for your reply, but I somehow do not really get the connection between your words about GCN ISA and my work group problem.

                       

                      As far as I know AMD cards are not limited to 256 elements in one work group. For example in DirectCompute and HSA as well the limit is 1024 and also here CL_DEVICE_MAX_WORK_GROUP_SIZE_AMD returns 1024 as maximal extended work group size. My question targets: how to use this?

                       

                      Because my problem is that for some reason scheduling 512 x 1 is returned as CL_INVALID_WORK_GROUP_SIZE (I do not understand why) while for some reason 256 x 2 is accepted (enqueueNDRangeKernel returns no error) and the kernel itself recognizes local_size(0) as 256 and local_size(1) as 2 ... so that seems to be fine, but local threads 0-255 x 1 are never executed while 0-255 x 0 are working as expected.

                       

                      I do not think I am running into resource problems here, because the kernel uses
                      32 vgpr / thread
                      ~27.5 kb local memory / work group
                      54 sregs / work group 
                      and has an ISA size of only 4k, so instruction cache should be fine as well.

                       

                      So right now having two of this groups active at the same time is no problem at all. I would expect if the problem would be on the resource side that then I only got one work group with twice the threads in flight. But this is not the case.

                       

                      You may be right that stepping from 256 threads / group to 512 does not have an positive effect for my code. But since my code is mainly global memory latency & global atomics bound I already saw a nice improvement stepping from 128 to 192 to 256 elements per work group. Also running at larger sizes may allow me to use same code on Nvidia and AMD as well - I am only one person working on this code but it must run on both and further on both Windows and Linux. Thats holding me back from switching to CUDA + HSA or Vulcan or ROCm because my development effort would be too huge maintaining more then one platform.

                        • Re: Running OpenCL Work Groups with >256 Elements
                          realhet

                          Hi,

                           

                          I have 3 years old experiences making GCN asm kernels on HD7770 (GCN 2nd gen, I guess) and trying to alloc max LDS with the widest groupsize.

                           

                          workitems: totalStreamsOnCard*4*2

                          groupsize: 256

                          lds size/group: 32768 (this is the maximum per group, not 64)

                          vregs: 128

                           

                          Important thing is that I use one dimensional GID and if there is more actual work that the above workitemsize, I implement a loop in the kernel. This way I can get workitems that working simultaneously for half a second. With 1D GID you can go much more above that 1K OpenCL 'limit'.

                           

                          This thing was able to fulfill each CU: allocate all 64K LDS, used up all the VRegs, while letting me synchronize the groups with ds_barrier.

                          I guess it is not good news... I also tried to find a way to group the most threads as I can, but that was the max (for me at least).

                          • Re: Running OpenCL Work Groups with >256 Elements
                            dipak

                            Hi lolliedieb,

                            As I've come to know, 256 is the default max. work group size so that any OpenCL application should work properly. Whereas, AMD version (i.e.1024) shows max. HW capability and user can use that value by forcing dimensions with __attribute__((reqd_work_group_size(X, Y, Z))). However, the developers must understand the performance impact of such request.

                             

                            By the way, when I ran the code you mentioned on the first post, I got the expected output on a Windows m/c. Currently, I don't have a Linux setup to test it. I have a suggestion though. Instead of printf, could you please try some other methods (such as write to a buffer) to verify that work items corresponding to get_local_id(1) are executing or not? If you still see the problem, please share the repro. I'll report it to the concerned team.

                             

                            Thanks.

                              • Re: Running OpenCL Work Groups with >256 Elements
                                lolliedieb

                                Well the printf is usually not part of my kernel. When I ran it the first time I only had the usual working code in it and wondered why it executed much faster then I anticipated. Until I realized that only a fraction of my output data was written. The kernels amount of output is a quadratic function in the number of input data and so while researching the reason I found out that only first half of input data was read and the second half - belonging to threads 256-511 was missing.
                                Thats why I put the printf in to see whats going on.

                                 

                                I will later try to set up a Windows installation on the same machine to compare if there is a difference in behavior between Linux and Windows driver.