cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

mindsporter
Journeyman III

Work group size with barrier

Drastic reduction in max work group size when using kernel with barrier

I am seeing some inconsistent behaviour on the HD5970 when using a barrier inside a kernel and supplying a local work group size greater than 64. Depending on where I place the barrier in the kernel (i.e. depending on the number of LDS writes being "barriered"), either the kernel runs to completion or I get a CL_INVALID_WORK_GROUP_SIZE from clEnqueueNDRangeKernel(). I found this post from last year which mentions a similar issue on the HD4850: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=124649

Does use of a barrier drastically reduce the max work group size? Does this depend on the number of memory writes preceding the barrier? Is this an issue on all GPUs? Are there plans to fix this in a future release? Thanks 🙂

FYI, I am already using __attribute__((reqd_work_group_size(X, Y, Z))) on the kernel.

0 Likes
5 Replies
himanshu_gautam
Grandmaster

mindsporter,

Is it a problem for any kernel or just your kernel. 

please share your code and we can discuss what might be causing the issue.

You can also send a suitable test case a streamdeveloper@amd.com.

0 Likes

Hi Himanshu,

Attached is a test kernel that reproduces the behaviour. The kernel just computes the difference between the first few columns of the input image and the reference image. The images are all 64x64 floats.

Observations:

1) If the work group size is 64 or less (e.g: 4x4, 4x8 or 4x16), the kernel builds and runs fine.

2) If the work group size is more than 64 (e.g: 4x32 or 4x64) then the kernel build fails with the following message:

Program build errors:
Warning:W001angerous Barrier Opt Detected!
Warning:W000:Barrier caused limited groupsize
Error: Requested compile size is bigger than the required workgroup size of 64 elements
Error: Creating kernel preprocess failed!

3) If the work group size is more than 64 (e.g: 4x32 or 4x64) and the reqd_work_group_size attribute is not specified then the kernel builds successfully, but clEnqueueNDRangeKernel() fails with CL_INVALID_WORK_GROUP_SIZE

4) If the barrier is removed, the kernel builds and runs fine for any work group size.

 

#define XRES 64 #define YRES 64 #define FLOAT4_COLUMNS_PER_WORK_GROUP 4 #define FLOAT4_ROWS_PER_WORK_GROUP 64 __constant sampler_t imageSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST; __kernel __attribute__((reqd_work_group_size(FLOAT4_COLUMNS_PER_WORK_GROUP, FLOAT4_ROWS_PER_WORK_GROUP, 1))) void preprocess (__read_only image2d_t input_img, __read_only image2d_t reference_img, __write_only image2d_t output_img) { float4 output; float4 input; float4 reference; __local float4 temp_img[FLOAT4_COLUMNS_PER_WORK_GROUP][YRES]; int2 gid = (int2)(get_global_id(0), get_global_id(1)); int2 lid = (int2)(get_local_id(0), get_local_id(1)); int x = gid.x * FLOAT4_COLUMNS_PER_WORK_GROUP; int ystart = gid.y; int ystop = YRES - 1; int ystride = get_global_size(1); for (int y = ystart; y <= ystop; y += ystride) { input = read_imagef(input_img, imageSampler, (int2)(x, y)); reference = read_imagef(reference_img, imageSampler, (int2)(x, y)); output = input - reference; temp_img[lid.x] = output; } barrier(CLK_LOCAL_MEM_FENCE); for (int y = ystart; y <= ystop; y += ystride) { write_imagef(output_img, (int2)(x, y), temp_img[lid.x]); } }

0 Likes

most likely if you do not specify require workgroup size it compile for 64. query local size with CL_KERNEL_COMPILE_WORK_GROUP_SIZE

0 Likes

mindsporter,

I have checked your code in the same condition as you provided in forum. On cypress i get 256 work group size and the code compiles fine giving no errors. What do you get as local work group size by using

 

 

 

status=

clGetKernelWorkGroupInfo(.., ..., CL_KERNEL_WORK_GROUP_SIZE, ...).

 



 





0 Likes

mindsporter,
The problem is that LLVM is doing in invalid optimization on the barrier function and the runtime is detecting it and failing execution for local sizes larger than a wavefront. We should have this fixed in our upcoming release, but please send us a test case as himanshu has asked so that we can verify it is fixed.
0 Likes