Best way of transferring from local to global memory for GCN

My work group size is 256.


In the first part of my kernel, each work item writes to a designated area of a local buffer.

Then, all work items transfer the entire local buffer to global memory.


For optimal performance, what is the best setting for:


1) Size of local buffer i.e 1024 bytes, 2048 bytes, etc.

2) Number of bytes transferred per work item:  i.e. 32 bits, 64 bits, etc.