My work group size is 256.
In the first part of my kernel, each work item writes to a designated area of a local buffer.
Then, all work items transfer the entire local buffer to global memory.
For optimal performance, what is the best setting for:
1) Size of local buffer i.e 1024 bytes, 2048 bytes, etc.
2) Number of bytes transferred per work item: i.e. 32 bits, 64 bits, etc.