cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

malcolm3141
Journeyman III

LDS performance questions

Two performance questions for an AMD OpenCL engineer (NB. Cypress only).

 

  1. Given that a 64 thread wavefront is split into four groups of 16 threads to execute on consecutive cycles, how is this split made? If I have a group size of (4, 16, 1) will this be split into groups of (1, 16, 1) or groups of (4, 4, 1)?
  2. Can a single read from LDS be broadcast to multiple threads, or will the threads all queue to access the same LDS memory location?
The first question is important in order to order my accesses to LDS memory in such a way as to avoid bank conflicts.
The second question is important for sharing single values between multiple threads in a group. For example:
__local int scratch[...];
...
int i = scratch[0];
Will the scratch[0] read cause a 64 clock wait?
Thanks,
Malcolm

 

0 Likes
1 Reply
n0thing
Journeyman III

I think threads are assigned to thread-processors in quads, so in your case threads - 0,0 1,0 0,1 1,1 will be assigned to TP0; threads 2,0 3,0 2,1 3,1 will be assigned to TP1 and so on.

Broadcast is supported on Cypress from LDS as in my benchmark I get maximum bandwidth from broadcast and 32-bit linear reads, i.e read bandwidth is around 850GB/s for 5870.

0 Likes