My work group size is 256.
In the first part of my kernel, each work item writes to a designated area of a local buffer.
Then, all work items transfer the entire local buffer to global memory.
For optimal performance, what is the best setting for:
1) Size of local buffer i.e 1024 bytes, 2048 bytes, etc.
2) Number of bytes transferred per work item: i.e. 32 bits, 64 bits, etc.
Achieving optimal performance is not always straight forward thing. It depends on the various factors. There are guidelines that provide tips to improve the performance for certain scenario on certain platforms. However, one should do some profiling to examine the actual performance. It can help greatly to achieve optimal performance. So, I would suggest you to profile your application with various settings before making any final decision. AMD's CodeXL tool could be used for this purpose.
Now, coming to your questions.
1) One of the direct impact of LDS size is that it limits the number of work groups that can be active in a CU. For more details, I would refer you to check the section "188.8.131.52 Local Memory (LDS) Size" in AMD's OpenCL optimization guide where Table 2.2 shows the effect of LDS usage on wavefronts/CU.
2) When accessing the the LDS, one of the main consideration is avoiding (or at least minimizing) the bank conflicts. The optimization guide says:
"A simple sequential address pattern, where each work-item reads a float2 value from LDS, generates a conflict-free access pattern on the AMD Radeon HD 7XXX GPU. Note that a sequential access pattern, where each work-item reads a float4 value from LDS, uses only half the banks on each cycle on the AMD Radeon HD 7XXX GPU and delivers half the performance of the float access pattern."