5 Replies Latest reply on Nov 29, 2012 12:22 PM by settle

    Wide memory access

    josopait

      Hello,

       

      is it possible to have wide memory access without declaring the buffer as float4? Does anybody know?

       

      I have a kernel similar to this:

       

      __kernel void test(__global float* uav0)

      {

        int r = get_global_id(0);

        uav0[r] = 0;

        uav0[r+1] = 0;

        uav0[r+2] = 0;

        uav0[r+3] = 0;

      }

       

      The assembler code generated from this doesn't look very optimal. For each memory access, it has one tbuffer_store_format_x instruction. I could work around this by declaring the buffer float4, but I don't want to do that. From what I understand, the program would be considerably faster if the memory was written with one single tbuffer_store_format_xyzw instruction, even if the index is not guaranteed to be aligned. Is there a way to enforce this? I am surprised that the compiler is not able to optimize this on its own.

       

      Thanks for any help

      Ingo Josopait

        • Re: Wide memory access
          binying

          Set the values before hand in the buffer since they are constants?...

          Use float2? ...

          • Re: Wide memory access
            Bdot

            Would it help if every thread wrote different addresses instead of overlapping? I.e.

            r = get_global_id(0)*4;

            I have similar code for loading 4 single uints from global memory, and that is nicely packed by the compiler.

             

            You can also use vstore to explicitly write a number of bytes. e.g.

             

            vstore4( (float4)0, r, uav0);

             

            (Though I'm not certain if you can use a constant for the first parameter ... - give it a try)

              • Re: Wide memory access
                josopait

                Thanks Bdot!

                 

                I meant non-overlapping memory writes. My mistake.

                 

                With

                r = get_global_id(0)*4;

                the compiler does actually optimize the code nicely. However, if the address is more complex than in this simple test case and the compiler cannot infer that the access is non-overlapping, then it won't pack the memory access anymore. So I think I will use vstore4 instead, which seems to do what I want.

              • Re: Wide memory access
                settle

                Have you considered using async_work_group_copy()?  I know it's for copying between local and global addresses, so maybe you need to create some local temp variable, but it's worth a shot to see if the compiler will optimize better with it.