5 Replies Latest reply on Mar 12, 2011 2:43 AM by dmeiser

    Global memory bandwidth ?

    thomasco

      Hello everyone,

      I played with OpenCL on a 5870 and got 118 GB/s of bandwidth doing a copy between 2 arrays in global memory.

      118GB/s was the best result, using float4, with 32-bit floats it gave 98 GB/s.

      The code is similar to the "float4 vs float1" code in the OpenCL programming guide, just moving one float4 per work item.

       

      That's a bit low compared to the peak of 154 GB/s, that's only ~76 % I would have hoped to see something closer to 130 GB/s. Is this number typical ? I'm running on Linux. Does Windows give higher numbers ?

      What can I expect with the latest cards, like the 6970 ?

      Any idea how I can improve this number ?

      Thanks,

        • Global memory bandwidth ?
          himanshu.gautam

          There are a lot of parameters involved in getting good global memory transfer. Important ones include vectorization, Access Alignment, channel conflicts and co-alesced reads/writes.

          For details you can refer to "Global Memory Optimization" section of opencl Programming guide. Also see globalMemoryBandwidth sample from the SDK.

          • Global memory bandwidth ?
            dmeiser

            I have been working with kernels that are entirely bandwidth bound on windows and linux machines with a 5870 and 6970. The bandwidth that you get is pretty similar to what I get regardless of optimizations. I tried everything in a systematic way (coalescing, work group sizes, vectorization, etc.) but I can't seem to get more than about 100GB/s on a 5870.

            I suspect that lower level programming (il or assembler) is required to achieve the peak bandwidth.

            Cheers.

              • Global memory bandwidth ?
                thomasco

                 

                Originally posted by: dmeiser I have been working with kernels that are entirely bandwidth bound on windows and linux machines with a 5870 and 6970. The bandwidth that you get is pretty similar to what I get regardless of optimizations. I tried everything in a systematic way (coalescing, work group sizes, vectorization, etc.) but I can't seem to get more than about 100GB/s on a 5870.

                 

                I suspect that lower level programming (il or assembler) is required to achieve the peak bandwidth.

                 

                Cheers.

                 

                 

                Thanks, that's helpful.

                Can you share the numbers you get on the 6970 ?

                 

                  • Global memory bandwidth ?
                    dmeiser

                     

                    Can you share the numbers you get on the 6970 ?


                    Depending on the kernel I get between about 80GB/s and 100GB/s. Note however that these kernels are not pure copy kernels. There is also a little bit of ALU and control flow.

                     

                    Cheers

                      • Global memory bandwidth ?
                        dmeiser

                        The GlobalMemoryBandwidth example in the SDK may be helpful here. That example implements different types of memory accesses and measures the memory bandwidth for each. It appears that the theoretical peak bandwidth of ~160GB/s (for the 5870) is only achieved if all threads access the same address. For linear memory accesses (i.e. perfect coalescing) that microbenchmark produces about 90 GB/s.

                        Cheers.