10 Replies Latest reply on Jun 13, 2010 6:43 PM by bubu

    array in kernel

    bubu

      What happens when I do this, pls?

       

      __kernel void main ()

      {

         int myArray[60];

      }

       

      Is myArray stored in local memory? global memory? private local memory?

      I use a 256x128 grid with 16x4 workgroups.

      thx

        • array in kernel
          MicahVillmow
          Please reference the OpenCL spec for these type of questions, they are all answered in Section 6.5.
          The spec can be located here:
          http://www.khronos.org/opencl/
            • array in kernel
              bubu

               

              Originally posted by: MicahVillmow Please reference the OpenCL spec for these type of questions, they are all answered in Section 6.5. The spec can be located here: http://www.khronos.org/opencl/


              Private space. The question is... is that implemented as global memory with 120 cycles penalty if I don't coalesce the access?

              For example, in CUDA, you can use the __local modifier to avoid the need for coalescing ( that's not OpenCL's local memory but a modifier to use private memory without having to coalesce the accesses ). I want to know how this works for ATI cards and OpenCL pls.

               

              My kernel is very complex and I cannot use OpenCL's local memory, so I need the fastest available way to use a 60-ints stack per thread.

            • array in kernel
              MicahVillmow
              Edit: Private memory goes to scratch, which exists in Global memory.
                • array in kernel
                  bubu

                   

                  Originally posted by: MicahVillmow Edit: Private memory goes to scratch, which exists in Global memory.


                  That's fine but... do I need to perform any kind of coalescing to improve the speed, pls? Or does it work like the CUDA's __local mem? Or perhaps your compiler takes care about that automatically?

                • array in kernel
                  MicahVillmow
                  From Section 4.4 of our programming guide on Global Memory.
                  "The GPU memory subsystem can coalesce multiple concurrent accesses to
                  global memory, provided the memory addresses increase sequentially across the
                  work-items in the wavefront and start on a 128-byte alignment boundary."

                  This is impossible to do w/ private memory sizes larger than a single 128bit value as private memory is allocated as a large chunk for the thread and not interleaved with other threads. So, if you want fast memory, you need to use local memory(you have up to 32KB) or use registers(up to 256 vec4 registers).

                  Also, the first of our General Tips is relevant here:
                  "Avoid declaring global arrays on the kernel's stack frame as these typically
                  cannot be allocated in registers and require expensive global memory
                  operations."
                    • array in kernel
                      bubu

                      But, as I indicate, I just need 60 ints, so that will fit in the "256 vec4" case ( my kernel is using 39 GPRs currently )? What's the total private memory's size for each thread/work group/compute unit?

                      I cannot use local memory, because 16x4 x 60 x sizeof(int) will cause to execute a maximum of 2 wavefronts per compute unit ( vs the optimal 24... so the kernel speed will sux )... or 2 blocks vs the optimal 8 blocks/compute unit.... resulting in an a 8% and 25% occupancy, respectively.

                      And If I'm forced to use global memory... how to coalesce it correctly?. I use 128x256 grids, with 16x4 work groups. Should I then allocate a 128x256x60xsizeof(int) global memory buffer? I was really searching a feature like CUDA's __local, which provides global memory without having to coalesce the accesses manually.

                        • array in kernel
                          Jawed

                          You want to aim for 6 or more hardware threads per SIMD which is a limit of about 42 GPRs (256/6), as a minimum. This allows pairs of hardware threads to occupy the 3 primary execution units: sequencer (control flow, mem export), ALU and load/TEX simultaneously.

                          You need to use SKA to find out the actual GPR allocation of your code.

                          How many bits of each of these 60 ints are you using? e.g. if you're only using 16 bits from each, they'll pack to half size.

                          If you allocate another 15 GPRs for these ints then you'll be down to 4 hardware threads (256/54). This should be faster than putting this data in global memory.

                          Depending on scope, the impact might be less than an additional 15 GPRs.

                            • array in kernel
                              bubu

                               

                              Originally posted by: Jawed You want to aim for 6 or more hardware threads per SIMD which is a limit of about 42 GPRs (256/6), as a minimum. This allows pairs of hardware threads to occupy the 3 primary execution units: sequencer (control flow, mem export), ALU and load/TEX simultaneously.

                              Very nice advice, thx!

                               

                               

                              You need to use SKA to find out the actual GPR allocation of your code.

                              I'm lucky, 40

                               

                               

                              How many bits of each of these 60 ints are you using? e.g. if you're only using 16 bits from each, they'll pack to half size.

                              I'm afraid the 32bits.

                               

                              If you allocate another 15 GPRs for these ints then you'll be down to 4 hardware threads (256/54). This should be faster than putting this data in global memory.

                              I'll try to force global memory and see if that helps, thx.

                               

                               

                               

                                • array in kernel
                                  bubu

                                  5750 With manual 32b global coalescing: 140s, 64b: 240s, 128b=323s

                                  5750 With int var[60], 120s

                                  GT240 NVIDIA int var[60] : 48s

                                  I'm using a 256x256 grid with 16x8 worksize, 40 GPR registers with the int v[60] ( 32reg for NVIDIA with a 50% occupancy )

                          • array in kernel
                            MicahVillmow
                            Max private size is 64KB per thread per the IL spec. Per section 4.4 of our Programming Guide on how to coalesce global memory.
                            "The GPU memory subsystem can coalesce multiple concurrent accesses to
                            global memory, provided the memory addresses increase sequentially across the
                            work-items in the wavefront and start on a 128-byte alignment boundary."