1 Reply Latest reply on Mar 20, 2010 10:21 AM by n0thing

    LDS performance questions


      Two performance questions for an AMD OpenCL engineer (NB. Cypress only).


      1. Given that a 64 thread wavefront is split into four groups of 16 threads to execute on consecutive cycles, how is this split made? If I have a group size of (4, 16, 1) will this be split into groups of (1, 16, 1) or groups of (4, 4, 1)?
      2. Can a single read from LDS be broadcast to multiple threads, or will the threads all queue to access the same LDS memory location?
      The first question is important in order to order my accesses to LDS memory in such a way as to avoid bank conflicts.
      The second question is important for sharing single values between multiple threads in a group. For example:
      __local int scratch[...];
      int i = scratch[0];
      Will the scratch[0] read cause a 64 clock wait?


        • LDS performance questions

          I think threads are assigned to thread-processors in quads, so in your case threads - 0,0 1,0 0,1 1,1 will be assigned to TP0; threads 2,0 3,0 2,1 3,1 will be assigned to TP1 and so on.

          Broadcast is supported on Cypress from LDS as in my benchmark I get maximum bandwidth from broadcast and 32-bit linear reads, i.e read bandwidth is around 850GB/s for 5870.