Archives Discussions

malcolm3141 · ‎03-19-2010

Two performance questions for an AMD OpenCL engineer (NB. Cypress only).

Given that a 64 thread wavefront is split into four groups of 16 threads to execute on consecutive cycles, how is this split made? If I have a group size of (4, 16, 1) will this be split into groups of (1, 16, 1) or groups of (4, 4, 1)?
Can a single read from LDS be broadcast to multiple threads, or will the threads all queue to access the same LDS memory location?

The first question is important in order to order my accesses to LDS memory in such a way as to avoid bank conflicts.

The second question is important for sharing single values between multiple threads in a group. For example:

__local int scratch[...];

...

int i = scratch[0];

Will the scratch[0] read cause a 64 clock wait?

Thanks,

Malcolm

n0thing · ‎03-20-2010

I think threads are assigned to thread-processors in quads, so in your case threads - 0,0 1,0 0,1 1,1 will be assigned to TP0; threads 2,0 3,0 2,1 3,1 will be assigned to TP1 and so on.

Broadcast is supported on Cypress from LDS as in my benchmark I get maximum bandwidth from broadcast and 32-bit linear reads, i.e read bandwidth is around 850GB/s for 5870.

Archives Discussions

LDS performance questions