Work group size with barrier

Discussion created by mindsporter on Oct 21, 2010
Latest reply on Dec 13, 2010 by himanshu.gautam
Drastic reduction in max work group size when using kernel with barrier

I am seeing some inconsistent behaviour on the HD5970 when using a barrier inside a kernel and supplying a local work group size greater than 64. Depending on where I place the barrier in the kernel (i.e. depending on the number of LDS writes being "barriered"), either the kernel runs to completion or I get a CL_INVALID_WORK_GROUP_SIZE from clEnqueueNDRangeKernel(). I found this post from last year which mentions a similar issue on the HD4850: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=124649

Does use of a barrier drastically reduce the max work group size? Does this depend on the number of memory writes preceding the barrier? Is this an issue on all GPUs? Are there plans to fix this in a future release? Thanks :)

FYI, I am already using __attribute__((reqd_work_group_size(X, Y, Z))) on the kernel.