1 Reply Latest reply on Sep 22, 2014 7:58 AM by dipak

    vload4 vs 4 individual memory accesses : bank conflicts


      What is the advantage of vload4 over 4 single memory accesses?

      Suppose I am loading memory from local memory. Below are two kernels. The second kernel should exhibit no bank conflict.

      Does the first have bank conflicts? Because, if one vload is executed per clock, then there should be conflicts in a half wave.

      void kernel1() {

      int start = get_global_id(0)*4;

      int4 test = vload4(start,localBuffer);




      void kernel2() {

      int4 test;

      int start = get_global_id(0)*4;

      test.x = localBuffer[start];

      test.y = localBuffer[start+1];

      test.z = localBuffer[start+2];

        test.w = localBuffer[start+3];


        • Re: vload4 vs 4 individual memory accesses : bank conflicts


          As per section 6. OpenCL Performance and Optimization for GCN Devices->6.2 Local Memory (LDS) Optimization in APP Programming guide


          A simple sequential address pattern, where each work-item reads a float2 value from LDS, generates a conflict-free access pattern on the AMD Radeon HD 7XXX GPU. Note that a sequential access pattern, where each work-item reads a float4 value from LDS, uses only half the banks on each cycle on the AMD Radeon. HD 7XXX GPU and delivers half the performance of the float access pattern.

          I guess, in case of kernel1, vload4 has same effect as reading float4 value and accordingly performance will be half.