5 Replies Latest reply on Dec 13, 2010 2:00 PM by himanshu.gautam

    Work group size with barrier

      Drastic reduction in max work group size when using kernel with barrier

      I am seeing some inconsistent behaviour on the HD5970 when using a barrier inside a kernel and supplying a local work group size greater than 64. Depending on where I place the barrier in the kernel (i.e. depending on the number of LDS writes being "barriered"), either the kernel runs to completion or I get a CL_INVALID_WORK_GROUP_SIZE from clEnqueueNDRangeKernel(). I found this post from last year which mentions a similar issue on the HD4850: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=124649

      Does use of a barrier drastically reduce the max work group size? Does this depend on the number of memory writes preceding the barrier? Is this an issue on all GPUs? Are there plans to fix this in a future release? Thanks :)

      FYI, I am already using __attribute__((reqd_work_group_size(X, Y, Z))) on the kernel.

        • Work group size with barrier


          Is it a problem for any kernel or just your kernel. 

          please share your code and we can discuss what might be causing the issue.

          You can also send a suitable test case a streamdeveloper@amd.com.

            • Work group size with barrier

              Hi Himanshu,

              Attached is a test kernel that reproduces the behaviour. The kernel just computes the difference between the first few columns of the input image and the reference image. The images are all 64x64 floats.


              1) If the work group size is 64 or less (e.g: 4x4, 4x8 or 4x16), the kernel builds and runs fine.

              2) If the work group size is more than 64 (e.g: 4x32 or 4x64) then the kernel build fails with the following message:

              Program build errors:
              Warning:W001angerous Barrier Opt Detected!
              Warning:W000:Barrier caused limited groupsize
              Error: Requested compile size is bigger than the required workgroup size of 64 elements
              Error: Creating kernel preprocess failed!

              3) If the work group size is more than 64 (e.g: 4x32 or 4x64) and the reqd_work_group_size attribute is not specified then the kernel builds successfully, but clEnqueueNDRangeKernel() fails with CL_INVALID_WORK_GROUP_SIZE

              4) If the barrier is removed, the kernel builds and runs fine for any work group size.


              #define XRES 64 #define YRES 64 #define FLOAT4_COLUMNS_PER_WORK_GROUP 4 #define FLOAT4_ROWS_PER_WORK_GROUP 64 __constant sampler_t imageSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST; __kernel __attribute__((reqd_work_group_size(FLOAT4_COLUMNS_PER_WORK_GROUP, FLOAT4_ROWS_PER_WORK_GROUP, 1))) void preprocess (__read_only image2d_t input_img, __read_only image2d_t reference_img, __write_only image2d_t output_img) { float4 output; float4 input; float4 reference; __local float4 temp_img[FLOAT4_COLUMNS_PER_WORK_GROUP][YRES]; int2 gid = (int2)(get_global_id(0), get_global_id(1)); int2 lid = (int2)(get_local_id(0), get_local_id(1)); int x = gid.x * FLOAT4_COLUMNS_PER_WORK_GROUP; int ystart = gid.y; int ystop = YRES - 1; int ystride = get_global_size(1); for (int y = ystart; y <= ystop; y += ystride) { input = read_imagef(input_img, imageSampler, (int2)(x, y)); reference = read_imagef(reference_img, imageSampler, (int2)(x, y)); output = input - reference; temp_img[lid.x][y] = output; } barrier(CLK_LOCAL_MEM_FENCE); for (int y = ystart; y <= ystop; y += ystride) { write_imagef(output_img, (int2)(x, y), temp_img[lid.x][y]); } }

            • Work group size with barrier
              The problem is that LLVM is doing in invalid optimization on the barrier function and the runtime is detecting it and failing execution for local sizes larger than a wavefront. We should have this fixed in our upcoming release, but please send us a test case as himanshu has asked so that we can verify it is fixed.