OpenCL

lolliedieb · ‎10-16-2018

Hi all,

I am currently re-writing some OpenCL code of mine and would like to split the work of the group to more waves in order to have more waves in flight. The code is a OpenCL 1.2 code (because it needs to be compatible with Nvidia GPUs as well). On Nvidias running more then 256 items is no problem as long as enough resources are available.

I read that it seems to be possible to run a larger work group OpenCL kernel when Null range is used or when compile time requirement work group size and submit time sizes agree.

Unfortunately I experience the following problem:

I submit a kernel that has attribute header

__attribute__((reqd_work_group_size(256, 2, 1)))

and my submit part looks like

cl_int err = queue.enqueueNDRangeKernel(kernel, cl::NDRange(0), cl::NDRange(8388608,2), cl::NDRange(256,2), NULL, NULL);
So I expect 32768 x 1 work groups with 256x2 items each. Interesting the returned error code is CL_SUCCESS (by the way, switching to 128, 4 gives invalid work group size)

I start the kernel with following two lines:

if (get_global_id(0) == 0) printf("Local Sizes: %d %d \n",get_local_size(0),get_local_size(1));

if (get_group_id(0) == 0) printf("Work Item Id: %d %d \n",get_local_id(0), get_local_id(1));

Interestingly I get 256 and 2 return from the first line, but work group 0 only prints me the local id's (x 0), all the (x 1) are missing and never executed.

My OpenCL platform identifies itself as "OpenCL 2.1 AMD-APP (2671.3)", GPU as "OpenCL 1.2 AMD-APP(2671.3)" (this is strange that it also does not report OpenCL 2.1 compatibility ...),
Its a Ubuntu 18.04 system with recent amdgpu-pro 18.30 running. The GPU is a RX 580 4G (plus a AMD A10 7850 iGPU that is also detected).

clinfo reports

Max work item dimensions	3
Max work item sizes	1024x1024x1024
Max work group size	256
Preferred work group size (AMD)	256
Max work group size (AMD)	1024
Preferred work group size multiple	64
Wavefront width (AMD)	64

So, the OpenCL standard value and CL_DEVICE_MAX_WORK_GROUP_SIZE_AMD do not agree.
The kernel uses 33 registers (it compiles well in rga and CodeXL) and 21.0k local memory. So with 256 work items per group I can have only 3 waves per SIMD active while 512 would allow 6 - I hope to get better performance by this (at least on Nvidia the step from 256 to 512 helped a lot)

Any advice to get this running? Unfortunately using NULL local range and hope the the compiler just will give 512 work group size is not an option, because there are lot of hard-coded optimizations to the work group size in the code.

Thanks in advance.

elstaci · ‎10-16-2018

Hi, since this involves OpenCL programming, your best bet to get help in your coding is from AMD OpenCL & Vulkan Forum from here: OpenCL

If you have problems posting on the Forum, AMD Moderator Deepak can assist you in getting access.

lolliedieb · ‎10-16-2018

Ah, you may be right. Sry, did not find the other forum before.
Well I will post to the new user forum there. Seems the better place for my request. Thx very much.

dipak · ‎10-16-2018

Hi lolliedieb,

I've whitelisted you and I'm moving this post to our OpenCL forum.

And thank you,elstaci

elstaci · ‎10-16-2018

Hey, Thank you for helping me in the past. Surprised you are up so early in the morning

dipak · ‎10-16-2018

If "Max work group size" is reported as 256 then that is the max. limit for work-group size (multiplying all the dimensions i.e. X *Y *Z). Though I'm little bit surprised with the "Max work item sizes" values [1024x1024x1024]. Because, as per my understanding, the limit for each dimension should not be greater than "Max work group size". I need to check this with the concerned team. Interestingly, earlier clinfo used to report it as [256x256x256] which is quite expected.

Meanwhile, could you please attach the clinfo output?

My OpenCL platform identifies itself as "OpenCL 2.1 AMD-APP (2671.3)", GPU as "OpenCL 1.2 AMD-APP(2671.3)" (this is strange that it also does not report OpenCL 2.1 compatibility ...),

It is expected one. Currently OpenCL runtime under AMDGPU-Pro supports OpenCL 1.2 only, so it recognizes any OpenCL 2.0+ devices as OpenCL 1.2.

Thanks.

lolliedieb · ‎10-16-2018

Here is the full clinfo output for both GPUs (Linux):

Number of platforms 1

Platform Name AMD Accelerated Parallel Processing

Platform Vendor Advanced Micro Devices, Inc.

Platform Version OpenCL 2.1 AMD-APP (2671.3)

Platform Profile FULL_PROFILE

Platform Extensions cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Host timer resolution 1ns

Platform Extensions function suffix AMD

Platform Name AMD Accelerated Parallel Processing

Number of devices 2

Device Name Spectre

Device Vendor Advanced Micro Devices, Inc.

Device Vendor ID 0x1002

Device Version OpenCL 1.2 AMD-APP (2671.3)

Driver Version 2671.3

Device OpenCL C Version OpenCL C 1.2

Device Type GPU

Device Board Name (AMD) AMD Radeon Graphics

Device Topology (AMD) PCI-E, 00:01.0

Device Profile FULL_PROFILE

Device Available Yes

Compiler Available Yes

Linker Available Yes

Max compute units 8

SIMD per compute unit (AMD) 4

SIMD width (AMD) 16

SIMD instruction width (AMD) 1

Max clock frequency 720MHz

Graphics IP (AMD) 7.1

Device Partition (core)

Max number of sub-devices 8

Supported partition types None

Max work item dimensions 3

Max work item sizes 1024x1024x1024

Max work group size 256

Preferred work group size (AMD) 256

Max work group size (AMD) 1024

Preferred work group size multiple 64

Wavefront width (AMD) 64

Preferred / native vector sizes

char 4 / 4

short 2 / 2

int 1 / 1

long 1 / 1

half 1 / 1 (n/a)

float 1 / 1

double 1 / 1 (cl_khr_fp64)

Half-precision Floating-point support (n/a)

Single-precision Floating-point support (core)

Denormals No

Infinity and NANs Yes

Round to nearest Yes

Round to zero Yes

Round to infinity Yes

IEEE754-2008 fused multiply-add Yes

Support is emulated in software No

Correctly-rounded divide and sqrt operations Yes

Double-precision Floating-point support (cl_khr_fp64)

Denormals Yes

Infinity and NANs Yes

Round to nearest Yes

Round to zero Yes

Round to infinity Yes

IEEE754-2008 fused multiply-add Yes

Support is emulated in software No

Address bits 64, Little-Endian

Global memory size 4625539072 (4.308GiB)

Global free memory (AMD) 8017340 (7.646GiB)

Global memory channels (AMD) 4

Global memory banks per channel (AMD) 8

Global memory bank width (AMD) 256 bytes

Error Correction support No

Max memory allocation 3078068633 (2.867GiB)

Unified memory for Host and Device Yes

Minimum alignment for any data type 128 bytes

Alignment of base address 2048 bits (256 bytes)

Global Memory cache type Read/Write

Global Memory cache size 16384 (16KiB)

Global Memory cache line size 64 bytes

Image support Yes

Max number of samplers per kernel 16

Max size for 1D images from buffer 134217728 pixels

Max 1D or 2D image array size 2048 images

Base address alignment for 2D image buffers 256 bytes

Pitch alignment for 2D image buffers 256 pixels

Max 2D image size 16384x16384 pixels

Max 3D image size 2048x2048x2048 pixels

Max number of read image args 128

Max number of write image args 8

Local memory type Local

Local memory size 32768 (32KiB)

Local memory syze per CU (AMD) 65536 (64KiB)

Local memory banks (AMD) 32

Max number of constant args 8

Max constant buffer size 3078068633 (2.867GiB)

Preferred constant buffer size (AMD) 16384 (16KiB)

Max size of kernel argument 1024

Queue properties

Out-of-order execution No

Profiling Yes

Prefer user sync for interop Yes

Profiling timer resolution 1ns

Profiling timer offset since Epoch (AMD) 1539692369995556810ns (Tue Oct 16 14:19:29 2018)

Execution capabilities

Run OpenCL kernels Yes

Run native kernels No

Thread trace supported (AMD) Yes

Number of async queues (AMD) 2

Max real-time compute queues (AMD) 0

Max real-time compute units (AMD) 94

SPIR versions 1.2

printf() buffer size 4194304 (4MiB)

Built-in kernels

Device Extensions cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

Device Name Ellesmere

Device Vendor Advanced Micro Devices, Inc.

Device Vendor ID 0x1002

Device Version OpenCL 1.2 AMD-APP (2671.3)

Driver Version 2671.3

Device OpenCL C Version OpenCL C 1.2

Device Type GPU

Device Board Name (AMD) Radeon RX 580 Series

Device Topology (AMD) PCI-E, 01:00.0

Device Profile FULL_PROFILE

Device Available Yes

Compiler Available Yes

Linker Available Yes

Max compute units 36

SIMD per compute unit (AMD) 4

SIMD width (AMD) 16

SIMD instruction width (AMD) 1

Max clock frequency 1366MHz

Graphics IP (AMD) 8.0

Device Partition (core)

Max number of sub-devices 36

Supported partition types None

Max work item dimensions 3

Max work item sizes 1024x1024x1024

Max work group size 256

Preferred work group size (AMD) 256

Max work group size (AMD) 1024

Preferred work group size multiple 64

Wavefront width (AMD) 64

Preferred / native vector sizes

char 4 / 4

short 2 / 2

int 1 / 1

long 1 / 1

half 1 / 1 (cl_khr_fp16)

float 1 / 1

double 1 / 1 (cl_khr_fp64)

Half-precision Floating-point support (cl_khr_fp16)

Denormals No

Infinity and NANs No

Round to nearest No

Round to zero No

Round to infinity No

IEEE754-2008 fused multiply-add No

Support is emulated in software No

Single-precision Floating-point support (core)

Denormals No

Infinity and NANs Yes

Round to nearest Yes

Round to zero Yes

Round to infinity Yes

IEEE754-2008 fused multiply-add Yes

Support is emulated in software No

Correctly-rounded divide and sqrt operations Yes

Double-precision Floating-point support (cl_khr_fp64)

Denormals Yes

Infinity and NANs Yes

Round to nearest Yes

Round to zero Yes

Round to infinity Yes

IEEE754-2008 fused multiply-add Yes

Support is emulated in software No

Address bits 64, Little-Endian

Global memory size 3830943744 (3.568GiB)

Global free memory (AMD) 3719780 (3.547GiB)

Global memory channels (AMD) 8

Global memory banks per channel (AMD) 16

Global memory bank width (AMD) 256 bytes

Error Correction support No

Max memory allocation 3074440806 (2.863GiB)

Unified memory for Host and Device No

Minimum alignment for any data type 128 bytes

Alignment of base address 2048 bits (256 bytes)

Global Memory cache type Read/Write

Global Memory cache size 16384 (16KiB)

Global Memory cache line size 64 bytes

Image support Yes

Max number of samplers per kernel 16

Max size for 1D images from buffer 134217728 pixels

Max 1D or 2D image array size 2048 images

Base address alignment for 2D image buffers 256 bytes

Pitch alignment for 2D image buffers 256 pixels

Max 2D image size 16384x16384 pixels

Max 3D image size 2048x2048x2048 pixels

Max number of read image args 128

Max number of write image args 8

Local memory type Local

Local memory size 32768 (32KiB)

Local memory syze per CU (AMD) 65536 (64KiB)

Local memory banks (AMD) 32

Max number of constant args 8

Max constant buffer size 3074440806 (2.863GiB)

Preferred constant buffer size (AMD) 16384 (16KiB)

Max size of kernel argument 1024

Queue properties

Out-of-order execution No

Profiling Yes

Prefer user sync for interop Yes

Profiling timer resolution 1ns

Profiling timer offset since Epoch (AMD) 1539692369995556810ns (Tue Oct 16 14:19:29 2018)

Execution capabilities

Run OpenCL kernels Yes

Run native kernels No

Thread trace supported (AMD) Yes

Number of async queues (AMD) 2

Max real-time compute queues (AMD) 0

Max real-time compute units (AMD) 120

SPIR versions 1.2

printf() buffer size 4194304 (4MiB)

Built-in kernels

Device Extensions cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

NULL platform behavior

clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform

clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform

clCreateContext(NULL, ...) [default] No platform

clCreateContext(NULL, ...) [other] Success [AMD]

clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)

Platform Name AMD Accelerated Parallel Processing

Device Name Spectre

clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform

clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (2)

Platform Name AMD Accelerated Parallel Processing

Device Name Spectre

Device Name Ellesmere

clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform

clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform

clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (2)

Platform Name AMD Accelerated Parallel Processing

Device Name Spectre

Device Name Ellesmere

By the way very interesting:
Without the RX 580 I was not able to install this drivers on the machine (The APUs officially not supported) but now the iGPU runs just fine on this drivers, graphics and compute. I wonder why this artificial restriction then ^^

leyvin · ‎10-16-2018

The 1024x1024x1024 is listed because that's the Maximum per Dimension., i.e. 1024x1x1, 1x1024x1, or 1x1x1024... although functionally these are identical.

Yes, this can be a little confusing but listing a 256x256x256 is simply incorrect as that's not the Maximum that each Dimension can be assigned.

1024 is the OpenCL Specification for Maximum Dimension Size.

Now I'd recommend (highly) reading the ISA for GCN 3rd Gen., as it helps understand how and why the OpenCL Limitations exist, but more importantly the best approach to setting Thread Affinity in the OpenCL Kernel.

Note you have 16kB Per CU and 32kB Shared Cache to work with., what's more you have a Maximum 16 Instruction Queue... and each Object contains 32bit Instruction + 32bit Data., so they actually use double the Cache you might think they do.

i.e. 1024 Objects = 16kB *not* 8kB as you might think.

This as a note is why Asynchronous Operation is costless, unlike on NVIDIA; where there is a Memory Access Latency penalty to transfer between Shader Modules.

GCN is also setup to where each Compute Unit has 16 SIMD, 4 Pipelines and 64 Threads (per Cycle) but are packaged in Complexes of 4 Compute Units.

"Global" Cache is only shared between these 4 Units., not the whole GPU... as such while an individual CU might be capable of storing and processing up to 1024 Objects, Shared across 4; you're looking at a maximum of 256 Asynchronously., but as noted because you can only process 64 Per Cycle Per CU; this really shouldn't be an issue as that's the absolute maximum that can be processed without Barriers.

This as a note means that there's 16kB for the Data + 16kB for the Kernel within the Shared (Global) Cache., if you want a bigger Kernel, then you have to have fewer workgroups.

NVIDIA not only has double the available Threads per Shader Module., but also keep in mind that the code you've written isn't necessarily what is actually being used at runtime... an issue I've had with NVIDIA for some time, is that they have a habit of doing Black Box approaches for "Optimisation"., which can often result in unwanted behaviour or sometime frankly incorrect output.

NVIDIA here is the "Special Case" Exception., not the Rule. It's something I'd heavily recommend anyone doing Development always keeps in the back of their mind., and is a main reason why I don't do initial Development with either Intel or NVIDIA Hardware., as it can lead to some false assumptions in terms of Support and Specification, that will result in it being very difficult to then port to other platforms later.

lolliedieb · ‎10-17-2018

leyvin
Thx for your reply, but I somehow do not really get the connection between your words about GCN ISA and my work group problem.

As far as I know AMD cards are not limited to 256 elements in one work group. For example in DirectCompute and HSA as well the limit is 1024 and also here CL_DEVICE_MAX_WORK_GROUP_SIZE_AMD returns 1024 as maximal extended work group size. My question targets: how to use this?

Because my problem is that for some reason scheduling 512 x 1 is returned as CL_INVALID_WORK_GROUP_SIZE (I do not understand why) while for some reason 256 x 2 is accepted (enqueueNDRangeKernel returns no error) and the kernel itself recognizes local_size(0) as 256 and local_size(1) as 2 ... so that seems to be fine, but local threads 0-255 x 1 are never executed while 0-255 x 0 are working as expected.

I do not think I am running into resource problems here, because the kernel uses
32 vgpr / thread
~27.5 kb local memory / work group
54 sregs / work group
and has an ISA size of only 4k, so instruction cache should be fine as well.

So right now having two of this groups active at the same time is no problem at all. I would expect if the problem would be on the resource side that then I only got one work group with twice the threads in flight. But this is not the case.

You may be right that stepping from 256 threads / group to 512 does not have an positive effect for my code. But since my code is mainly global memory latency & global atomics bound I already saw a nice improvement stepping from 128 to 192 to 256 elements per work group. Also running at larger sizes may allow me to use same code on Nvidia and AMD as well - I am only one person working on this code but it must run on both and further on both Windows and Linux. Thats holding me back from switching to CUDA + HSA or Vulcan or ROCm because my development effort would be too huge maintaining more then one platform.

realhet · ‎10-18-2018

Hi,

I have 3 years old experiences making GCN asm kernels on HD7770 (GCN 2nd gen, I guess) and trying to alloc max LDS with the widest groupsize.

workitems: totalStreamsOnCard*4*2

groupsize: 256

lds size/group: 32768 (this is the maximum per group, not 64)

vregs: 128

Important thing is that I use one dimensional GID and if there is more actual work that the above workitemsize, I implement a loop in the kernel. This way I can get workitems that working simultaneously for half a second. With 1D GID you can go much more above that 1K OpenCL 'limit'.

This thing was able to fulfill each CU: allocate all 64K LDS, used up all the VRegs, while letting me synchronize the groups with ds_barrier.

I guess it is not good news... I also tried to find a way to group the most threads as I can, but that was the max (for me at least).

dipak · ‎10-22-2018

Hi lolliedieb,

As I've come to know, 256 is the default max. work group size so that any OpenCL application should work properly. Whereas, AMD version (i.e.1024) shows max. HW capability and user can use that value by forcing dimensions with __attribute__((reqd_work_group_size(X, Y, Z))). However, the developers must understand the performance impact of such request.

By the way, when I ran the code you mentioned on the first post, I got the expected output on a Windows m/c. Currently, I don't have a Linux setup to test it. I have a suggestion though. Instead of printf, could you please try some other methods (such as write to a buffer) to verify that work items corresponding to get_local_id(1) are executing or not? If you still see the problem, please share the repro. I'll report it to the concerned team.

Thanks.

lolliedieb · ‎10-22-2018

Well the printf is usually not part of my kernel. When I ran it the first time I only had the usual working code in it and wondered why it executed much faster then I anticipated. Until I realized that only a fraction of my output data was written. The kernels amount of output is a quadratic function in the number of input data and so while researching the reason I found out that only first half of input data was read and the second half - belonging to threads 256-511 was missing.
Thats why I put the printf in to see whats going on.

I will later try to set up a Windows installation on the same machine to compare if there is a difference in behavior between Linux and Windows driver.

lolliedieb · ‎01-25-2019

Hey guys,
I wanted to reply back a thing recently learned for all googlers:

It turned out that what I was asking for is well possible, but requires a (seems undocumented) environment variable to be set. Namely if I set GPU_MAX_WORKGROUP_SIZE=1024 I can go up to that value, its even shown then in cl_info.

But I wonder why there is this switch but information regarding it is very poor, but its working just fine as long as the kernel has enough resources.

Now new question:
Any chance to activate full usage of 64 kBytes local memory (instead of 32) per work group? The kernel compile well but then given an out of resources run time error if I try to use them ^^

Have a nice day

leyvin · ‎01-25-2019

GCN 1st to 4th Gen only have 32kB Local., GCN 5th Gen has the full 64kB that was planned but scrapped, however the ISA was never updated to reflect this.

I believe it was corrected in the GCN3 (Update) ISA … although only briefly in the accompanying text not the Architecture Diagram.

Not entirely sure the reason for said reduced Caches over the ISA, but likely a cost-saving measure.

If you look at the OpenCL (easiest way is via GPU-Z)., it should list 256kB Queue • 16kB Global (GDS) • 32kB Local (LDS); which should be the same from the HD 7770 up to Polaris 30., and technically the Ryzen with Vega Graphics too, as they're Polaris as well.

RX Vega 56 / 64 / FE and Radeon VII however should list 512kB Queue • 64kB Global (GDS) • 64kB Local (LDS).

Although if you want portability, I'd stick within the bounds of GCN 1st to 4th Gen.

OpenCL

Running OpenCL Work Groups with >256 Elements