I'm trying to speed up openCL kernel by using __local memory. Until I used async_work_group_copy, I was able to set global_work_size to 128 and local_work size to 64. When I used async_work_group_copy call, I couldn't set local_work_size to 64 and error code from clEnqueueNDRangeKernel returned was CL_INVALID_WORK_GROUP_SIZE. The following code demonstrate this.
My device ATI HD 4500:
CL_DEVICE_MAX_WORK_ITEM_SIZES: 128 128 128
I couldn't found in documentation why should this error occured. Am I missing something, or doing something wrong?
size_t* globalThreads = new size_t; globalThreads = 128; size_t* localThreads = new size_t; localThreads = 64; //64 -> error, but 32 ->ok cl_int status = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &eventSyncKernel);
query local size with clGetKernelWorkGroupInfo
on 4xxx cards IIRC if you use barrier you can't execute work group larger than 64. and IMHO async_copy must execute barrier to ensure that data are loaded correctly.