0 Replies Latest reply on Apr 24, 2016 8:19 AM by venator

    Radeon: Performance of gather from random memory locations

    venator

      I have an computational application where I need to gather memory blocks from seemingly random locations. Here is a simplified kernel which demonstrates this:

       

      // External definitions:

      // DATA_SIZE_WORDS: Size of data in words.

      // BLOCK_SIZE_WORDS: number of words to read continuously.

      // ITERATIONS: number of blocks to read from random locations per group.

      __kernel void gather(

          __global uint* const data,

          __global uint* out)

      {

          // Divide threads into groups. Threads in the same group access continuous

          // memory regions.

          uint group_id = get_global_id(0) / BLOCK_SIZE_WORDS;

         

          // Id of thread within the group.

          uint thread_id = get_local_id(0) % BLOCK_SIZE_WORDS;

         

          // Each group uses a different random stream to decide

          // which memory blocks to read.

          uint random_seed = group_id + 7890;

       

          // Read ITERATIONS blocks and compress them into a single output block

          // adding them together. Each thread reads 1 word per iteration and

          // therefore 4 * ITERATIONS bytes in total.

          uint compressed = 0;

          for(int i = 0; i < ITERATIONS; ++i) {

              // Update random seed to get next location to read from.

              random_seed = (1103515245 * random_seed + 12345);

              uint block = random_seed % (DATA_SIZE_WORDS / BLOCK_SIZE_WORDS);

              compressed ^= data[block * BLOCK_SIZE_WORDS  + thread_id];

          }

         

          // To avoid that the compiler optimizes out everything conditionally write

          // to out.

          if (compressed == 0x123456) {

              out[0] = group_id;

          }

      }

       

      The memory area read from is too large to fit into local memory or caches, so they hit global memory.

       

      This kernel shows performance with reading from continuous memory blocks of N words in the kernel. Unfortunately, for small numbers of N the performance is much worse than for larger ones. And even when reading blocks of 512 bytes the performance does not reach the theoretical limit of the device. I bought a R9 Nano to make use of the higher memory bandwidth that HBM provides. Unfortunately, the increase in speed was not as much as expected:

       

      Performance on a R9 290 and a R9 Nano:

      290Nano
      Block: 64b  104.904 GB/s112.022 GB/s
      Block: 128b 207.284 GB/s221.711 GB/s
      Block: 256b 271.56 GB/s373.278 GB/s

       

      Any ideas how to increase the performance esp. for smaller block sizes on the Nano?

       

      Attaching full source code of the benchmark.