Two performance questions for an AMD OpenCL engineer (NB. Cypress only).
- Given that a 64 thread wavefront is split into four groups of 16 threads to execute on consecutive cycles, how is this split made? If I have a group size of (4, 16, 1) will this be split into groups of (1, 16, 1) or groups of (4, 4, 1)?
- Can a single read from LDS be broadcast to multiple threads, or will the threads all queue to access the same LDS memory location?
The first question is important in order to order my accesses to LDS memory in such a way as to avoid bank conflicts.
The second question is important for sharing single values between multiple threads in a group. For example:
__local int scratch[...];
int i = scratch;
Will the scratch read cause a 64 clock wait?