2 Replies Latest reply on Apr 26, 2010 3:55 PM by ryta1203

    Coalescence

    marwen

      i'm trying to evaluate performance differences between global memory access in a 128 bytes aligned boundary manner against an access not aligned.

      the fact is that i can't find any difference although there should be an improvement of like an order of magnitude due to coalesced access.

      so what is the smallest kernel code which exhibit global memory access coalescence?

      Thanks

        • Coalescence
          omkaranathan

           

          Coalescing :  memory access pattern  that occurs when that the memory requests generated by each thread are in the same GPU cache line. So when each thread reads a vec4 sequentially from global memory that access pattern is coalesced.

           You can compare the performance between these 2 kernels –

           __kernel void read(__global float *input,__global float *output)

          {

          float4 val = (float4)(0.0f, 0.0f, 0.0f, 0.0f);

          uint index = get_global_id(0);;

          val = val + input[index + 0];

          val = val + input[index + 1];

          val = val + input[index + 2];

          val = val + input[index + 3];

          val = val + input[index + 4];

          val = val + input[index + 5];

          val = val + input[index + 6];

          val = val + input[index + 7];

          val = val + input[index + 8];

          val = val + input[index + 9];

          val = val + input[index + 10];

          val = val + input[index + 11];

          val = val + input[index + 12];

          val = val + input[index + 13];

          val = val + input[index + 14];

          val = val + input[index + 15];

          output[get_global_id(0);] = val;

           }

           And

           __kernel void read(__global float4 *input,__global float4 *output)

          {

          float4 val = (float4)(0.0f, 0.0f, 0.0f, 0.0f);

          uint index = get_global_id(0);;

          val = val + input[index + 0];

          val = val + input[index + 1];

          val = val + input[index + 2];

          val = val + input[index + 3];

          val = val + input[index + 4];

          val = val + input[index + 5];

          val = val + input[index + 6];

          val = val + input[index + 7];

          val = val + input[index + 8];

          val = val + input[index + 9];

          val = val + input[index + 10];

          val = val + input[index + 11];

          val = val + input[index + 12];

          val = val + input[index + 13];

          val = val + input[index + 14];

          val = val + input[index + 15];

          output[get_global_id(0);] = val;

           }

           Note the difference between float and float4 in arguments.

          The 2nd kernel is using coalesce accesses and you will get a good bandwidth using this pattern.



            • Coalescence
              ryta1203

              Using a similar benchmark in CAL/IL in compute shader mode, I actually get better performance using float than float4.

              Albeit, the float4 gets 4* as many floats as the float but just looking at it, the float global memory access is better overall, but not "per float".

              EDIT: what's oddly funny is that I get a better performance for global reads using pixel shader mode for both the float and float4 types.