which is faster, between organization of the local into workgroups vs the global given there is a large program

lets say there is a global index of 128 by 128 by 128 cube matrix, there will be about 20 calculations on each of the 128^3 =2097152 values then 1 stored value, and this process will repeated 10,000 times.

because im working with a huge number of calculations what i need to know is, is it benifietial to organize the cube matrix into workgroups and work-items in local memory


should i just allow opencl to handle it all in global, which losses the fast local speeds but doesnt waste time with thousands of data copies?