8 Replies Latest reply on Jun 10, 2010 8:49 PM by Raistmer

    vstore doubt

    bubu

      Should I write this using vstore in a GPU?

       

      __kernel main ( __global float4 *outb )

      {

             const int x = get_global_id(0);
             const int y = get_global_id(1);

             const width = get_global_size(0);

             const float4 data = ...

            outb[y*width+x] = data;

      }

       

      or

       

      __kernel main ( __global float4 *outb )

      {

             const int x = get_global_id(0);
             const int y = get_global_id(1);

             const width = get_global_size(0);

       

             const float4 data = ...;

             vstore4 ( data, 16*(y*width+x), (__global float*)outb );

      }

       

      ???

      (in that way would run efficiently in GPU and  CPU ).

       

      And what's the #define to know the ATI's OpenCL implementation is using the CPU? #ifdef ATI_OPENCL_CPU ?

      thx

       

        • vstore doubt
          omkaranathan

           

           

           

          __kernel main ( __global float4 *outb )

           

          {

           

                 const int x = get_global_id(0);        const int y = get_global_id(1);

           

                 const width = get_global_size(0);

           

                 const float4 data = ...

           

                outb[y*width+x] = data;

           

          }

           

           

           

          or

           

           

           

          __kernel main ( __global float4 *outb )

           

          {

           

                 const int x = get_global_id(0);        const int y = get_global_id(1);

           

                 const width = get_global_size(0);

           

           

           

                 const float4 data = ...;

           

                 vstore4 ( data, 16*(y*width+x), (__global float*)outb );

           

          }

          The second way is better.

           

           

          And what's the #define to know the ATI's OpenCL implementation is using the CPU? #ifdef ATI_OPENCL_CPU ?

           

          There is no such predefined macro at present.

           

            • vstore doubt
              LeeHowes

              I don't think that's right. I'd say the first way is better, but it depends. The first way will do a vector write, so you will use fewer write instructions. It will only work if the pointer is 16-byte aligned, though. The second way will write a vector using float writes, so will work happily unaligned, but will always use multiple writes. Given that you passed a float4 output pointer I'm going to assume it is aligned and say that I'd go with the first version.

              Also, is an ATI_CPU macro really what you want? Other than SIMD width, what purpose do you have for knowing it's on the CPU? Because a CPU isn't a very general term (and of course neither is ATI/AMD) I'm not sure it's likely to be a good macro to have. SIMD width could be, but in either case I think you're better off querying the runtime and specifying compiler macros of your own as necessary.

                • vstore doubt
                  Raistmer
                  vstore4 will use many separate writes? It listed as vector store ...
                    • vstore doubt
                      LeeHowes

                      Yes, they allow you to store vector types to a pointer in memory. The trick there is that could be worded as "allow you to store vector types to a pointer in memory, even if the pointer is not aligned to a multiple of the vector type". Note that the spec only specifies an alignment of gentype where the instruction is vstoren(gentype *...). Given that, unless the compiler can guarantee that the value is aligned it has to write it as multiple unaligned writes. A float4 pointer, on the other hand, is assumed to be aligned, as I understand it.

                       

                      I'll quote Micah on this one:

                      [quote] <!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:0 0 0 0 0 0 0 0 0 0; mso-font-charset:1; mso-generic-font-family:roman; mso-font-formatther; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:-520092929 1073786111 9 0 415 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0in; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:Calibri; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:Calibri; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in; mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;} div.Section1 {page:Section1;} --> Vload4 and vstore4 allow you to load unaligned data into a vector register. They explicitly do not do a single instruction, whereas loading from a float4 pointer would have to be aligned and do a 128-bit load. They're for the case where you can't do the float4 load rather than a way of forcing it to work.[/quote]

                • vstore doubt
                  Raistmer
                  Oops, thery interesting info, thanks.
                  So float4* pointer should be used instead of float* and vload4/vstore4 if one have(or can have) properly aligned data and wants to speed up memory accesses. It was not obvious from documentation.
                  • vstore doubt
                    MicahVillmow
                    Raistmer,
                    The only time you want to use vload/vstore is if the alignment of your data is unknown. If you know that the pointers in question are aligned properly to the size of the vector you want to load, cast the pointer to that vector and load via array syntax.
                    • vstore doubt
                      Raistmer
                      I did appropriate change in all my kernels, will report if it will lead to some different in speed.