I am seeing some inconsistent behaviour on the HD5970 when using a barrier inside a kernel and supplying a local work group size greater than 64. Depending on where I place the barrier in the kernel (i.e. depending on the number of LDS writes being "barriered"), either the kernel runs to completion or I get a CL_INVALID_WORK_GROUP_SIZE from clEnqueueNDRangeKernel(). I found this post from last year which mentions a similar issue on the HD4850: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=124649
Does use of a barrier drastically reduce the max work group size? Does this depend on the number of memory writes preceding the barrier? Is this an issue on all GPUs? Are there plans to fix this in a future release? Thanks :)
FYI, I am already using __attribute__((reqd_work_group_size(X, Y, Z))) on the kernel.