12 Replies Latest reply on Jan 21, 2010 9:49 AM by Fr4nz

    vector vs scalar memory operations

    tomhammo

      from the performance guide:

       

      "The GPU memory subsystem can coalesce multiple concurrent accesses to global memory, provided the memory addresses increase sequentially across the work-items in the wavefront and start on a 128-byte alignment boundary."

      so code like the following would be most efficient:

      float* data = ...



      data[get_global_id(0)] = ...

      ... = data[get_global_id(0)]

      however, does this also apply to vector data?

       

       

      float4* data = ...

       

      data[get_global_id(0)] = ...

      ... = data[get_global_id(0)]



      regards,

      - Tom

        • vector vs scalar memory operations
          n0thing

          The linear access pattern optimization applies to both scalar and vector data types, infact you should use vectorized access(both read and write) to get the maximum bandwidth from global memory.

            • vector vs scalar memory operations
              Fr4nz

               

              Originally posted by: n0thing The linear access pattern optimization applies to both scalar and vector data types, infact you should use vectorized access(both read and write) to get the maximum bandwidth from global memory.

               

              Hmm, I'm not so sure that going from scalar to vector read/writing is so "linear", keeping in mind bank conflicts.

              In fact, considering that the newest 5xxx ATI cards have 32 banks and each bank entry is 32-bit wide, when you store (for example) an uint from  global to local memory you know that thread 0 will write in bank 0, thread 1 in bank 1, and so on. Everything is OK here, you have coalesced read/writes and no bank conflicts (obviously you have to do this with a correct pattern access).

              BUT, if you write/read an uint4 (which is 128-bit wide) from global to local (and viceversa), thread 0 will write 0-3 banks, thread 1 4-7 banks and so on. You'll get coalesced write/reads and no conflicts only if you use the first 8 threads of a workgroup.

              In fact, if you do read/writes of uint4s as read/write of simple uints, you'll get, for example, a bank conflict when you use thread 8, because it will read/write the same bank that is reading/writing thread 0. The same applies for threads (1,9), (2,10) and so on. Or, at least, this is what I've understood from lessons and comments here:

              http://www.macresearch.org/opencl_episode5

              http://www.macresearch.org/opencl_episode6

              If I'm wrong, please, correct me...nou, knowing that you're the expert, do you have something to say about this?

                • vector vs scalar memory operations
                  nou

                   

                  Originally posted by: Fr4nz

                  BUT, if you write/read an uint4 (which is 128-bit wide) from global to local (and viceversa), thread 0 will write 0-3 banks, thread 1 4-7 banks and so on. You'll get coalesced write/reads and no conflicts only if you use the first 8 threads of a workgroup.

                  i think if you run 64 wide workgroup then it on one SIMD core execute in four cycles first instruction from each thread. so you still get nice coalescent memory access when first 16 thread read 0-255 bytes in memory. in second cycle it read 256-511 and so on.

                  If I'm wrong, please, correct me...nou, knowing that you're the expert, do you have something to say about this?

                  i am not expert . i am only student on university.

                    • vector vs scalar memory operations
                      Fr4nz

                       

                      Originally posted by: nou
                      Originally posted by: Fr4nz

                       

                      BUT, if you write/read an uint4 (which is 128-bit wide) from global to local (and viceversa), thread 0 will write 0-3 banks, thread 1 4-7 banks and so on. You'll get coalesced write/reads and no conflicts only if you use the first 8 threads of a workgroup.

                       

                       

                      i think if you run 64 wide workgroup then it on one SIMD core execute in four cycles first instruction from each thread. so you still get nice coalescent memory access when first 16 thread read 0-255 bytes in memory. in second cycle it read 256-511 and so on.



                      So you're suggesting that each thread should read an uint4 as 4 uints and place their components on the same bank using an appropriate access pattern (for example "lid + vecComponent*localsize")?

                      The problem of this approach is that you won't see correctly the written uint4 in local memory, because the vector components aren't scattered on successive banks but are always placed in the same bank.

                      For example, components of an uint4 should be placed in n0-n1-n2-n3 banks in order to be seen correctly, but with your approach they would be placed in n0-n0-n0-n0 bank.

                      Anyway, it would be nice to hear something about reading/writing vectors from/to LDS from AMD technicians...

                • vector vs scalar memory operations
                  hazeman

                  OpenCL compiler convert memory reads to uav_raw_load_id(n). The only difference between float* and float4* is the mask on output register.

                  And from IL reference guide: Four consecutive 32-bit components are read from UAV(n), starting at address src0.x (post-
                  swizzle). One to four dwords are written to dst, depending on the dst mask.

                  So only using float4* makes sense ( from performance point of view ).

                   

                    • vector vs scalar memory operations
                      Fr4nz

                       

                      Originally posted by: hazeman OpenCL compiler convert memory reads to uav_raw_load_id(n). The only difference between float* and float4* is the mask on output register.

                       

                      And from IL reference guide: Four consecutive 32-bit components are read from UAV(n), starting at address src0.x (post- swizzle). One to four dwords are written to dst, depending on the dst mask.

                       

                      So only using float4* makes sense ( from performance point of view ).



                      So, if you write a float4 then its components are, by default, written on the same bank avoiding conflicts between threads? Right?

                      And does this only hold for float4? No writing on the same bank for an uint4?

                        • vector vs scalar memory operations
                          hazeman

                           

                          Originally posted by: Fr4nz

                          So, if you write a float4 then its components are, by default, written on the same bank avoiding conflicts between threads? Right?

                          And does this only hold for float4? No writing on the same bank for an uint4?

                          I'm not sure what's the case with writing ( documentation is unclear sometimes ). But from IL perspective it doesn't matter if its float or uint or int - so float4 is treated the same way as uint4. The distincion between types is made at the point of instruction used for operation ( like iadd ( integer ) and add ( float ) ). But for all writes, reads and moves only the size of type is important.

                           

                            • vector vs scalar memory operations
                              Fr4nz

                               

                              Originally posted by: hazeman
                              Originally posted by: Fr4nz

                               

                              So, if you write a float4 then its components are, by default, written on the same bank avoiding conflicts between threads? Right?

                               

                              And does this only hold for float4? No writing on the same bank for an uint4?

                               

                               

                              I'm not sure what's the case with writing ( documentation is unclear sometimes ). But from IL perspective it doesn't matter if its float or uint or int - so float4 is treated the same way as uint4. The distincion between types is made at the point of instruction used for operation ( like iadd ( integer ) and add ( float ) ). But for all writes, reads and moves only the size of type is important.



                              Hmm, so maybe we have a difference here between ATI and Nvidia (obviously in favor of ATI), because David Gohara here says when answering to a question (and referring to his GTX285 if I remember correctly):

                               

                              Question: Each float4 is 128-bits wide, so threads 0,4,8 & 12 would access banks 0-3; Thread 1,5,9 & 13 banks 4-7?
                              This would suggest that bank conflicts are only avoidable when using 32-bit (or less) data items. Is that true?"

                              Answer: In this case the float4 data is mapped across 4 banks. Each bank entry is only 32 bits wide on current hardware. So the first 4 threads would read their respective data with no bank conflict. The 5th thread read would result in a bank conflict since it would overlap with the first and so forth. Although the -N constructs (where N > 1) are supported, the benefit from packing data into the vector type when shared memory is used isn't really there from my understanding. Note that bank conflicts can only occur when more than one thread in a single half-warp is accessing the same bank (with the exception of broadcast, of course). This is because a separate read/write instruction is issued by the hardware for each half-warp.



                              • vector vs scalar memory operations
                                n0thing

                                Documentation says : In a single cycle, local memory can service a request for each bank (up to 32 accesses each cycle).

                                Since only 16 threads are executed on a SIMD in 1 cycle, reading float and float2 linearly from LDS shouldn't give any bank-conflicts but using float2 shows up 100% bank-conflict by the profiler (also the performance is reduced).

                                Bank-conflicts shouldn't depend on group-size as long as it is greater than 16, but on changing the group-size from 64 to 16 I get 0 bank-conflicts on using float2.

                                 

                                  • vector vs scalar memory operations
                                    Fr4nz

                                     

                                    Originally posted by: n0thing Documentation says : In a single cycle, local memory can service a request for each bank (up to 32 accesses each cycle).

                                     

                                    Since only 16 threads are executed on a SIMD in 1 cycle, reading float and float2 linearly from LDS shouldn't give any bank-conflicts but using float2 shows up 100% bank-conflict by the profiler (also the performance is reduced).

                                     

                                    Bank-conflicts shouldn't depend on group-size as long as it is greater than 16, but on changing the group-size from 64 to 16 I get 0 bank-conflicts on using float2.



                                    So this suggests that the 2 components of float2 are scattered on 2 consecutive banks, instead of being allocated on the same bank. This gives conflicts and should explain why you have 0 conflicts with a group-size of 16...

                                      • vector vs scalar memory operations
                                        n0thing

                                        But it doesn't explain that why I get 100% conflicts on using group-size of 64, as 0-15 threads are executed in cycle1, 16-31 in cycle2 and so on...

                                        So even on using float2 no more than 32 banks are being accessed in 1 cycle.

                                          • vector vs scalar memory operations
                                            Fr4nz

                                            I was just guessing: what if we have to write short/ushort values in LDS? They are 16-bit wide, so a pair of ushort values should fit exactly in a bank entry (which is 32-bit wide). But this would mean that we have pairs of threads that accesses concurrently the same bank at the same time, for eg. threads 0,1 access bank 0, threads 2,3 access bank 1, and so on.

                                            This would surely results in bank conflicts, if I'm correct.

                                            Does it work in this way for short values? Do we have to think an ad-hoc access pattern which is different from the one used when writing uint values in LDS?