Moving data from private memory to local memory is a very time-consuming job, isn't it? When using the local memory in the kernel, my program runs much slower than before.
code:
__private float4 block[4];
__local float4 local_block[16];
//very slow here. Why?
local_block[local_id] = block[0];
local_block[local_id + 1] = block[1];
local_block[local_id + 2] = block[2];
local_block[local_id + 3] = block[3];
barrier(CLK_LOCAL_MEM_FENCE);