I have an OpenCL kernel that uses no local memory at all. Inside the kernel each work item copies about 128 bytes from global memory to registers (private variables), and then the values are accessed hundreds of times - this is only very small amount of memory access compared to other much larger amount of global/image memory access. Strangely, if I use the local memory instead of registers, I saw a performance boost of 33%. Each work item can actually use the same data from global memory so using the local memory to share the data is fine here. I also tried to use constant memory instead of global memory, and do not copy to register or local, the performance is not good.
Can somebody explain why this could happen? Note that the register usage for each work item is very small, the code should have enought registers to hold this 128 bytes.