6 Replies Latest reply on Aug 22, 2011 10:55 PM by notzed

    Performance Discrepancy between Win7 and Linux

    jholewinski

      I'm hoping an AMD dev can shed some light on a performance discrepancy I am experiencing on an HD5870 between Windows 7 x86_64 and Red Hat Enterprise Linux 6.1 x86_64.  Both machines are up-to-date with fresh installations of Catalyst 11.8.

      With the OpenCL matrix multiplication sample from the AMD APP SDK 2.5, I am getting the following results:

       

      RHEL 6.1 x86_64, Catalyst 11.8:
      $ ./MatrixMultiplication  -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
      GFlop/s: 361.148
      $ ./MatrixMultiplication  -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q 
      GFlop/s: 459.921
      Win7 x86_64, Catalyst 11.8:
      $ ./MatrixMultiplication  -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
      GFlop/s: 321.949
      $ ./MatrixMultiplication  -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q 
      GFlop/s: 547.915
      A ~19% difference in device performance seems a bit high.  Are there any known performance issues with the Linux drivers (11.8)?  I just want to get some "official" feedback before I spend a lot of time trying to dig deeper into this one.


        • Performance Discrepancy between Win7 and Linux
          genaganna

           

          Originally posted by: jholewinski I'm hoping an AMD dev can shed some light on a performance discrepancy I am experiencing on an HD5870 between Windows 7 x86_64 and Red Hat Enterprise Linux 6.1 x86_64.  Both machines are up-to-date with fresh installations of Catalyst 11.8.

           

          With the OpenCL matrix multiplication sample from the AMD APP SDK 2.5, I am getting the following results:

           

          RHEL 6.1 x86_64, Catalyst 11.8:
          $ ./MatrixMultiplication  -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
          GFlop/s: 361.148
          $ ./MatrixMultiplication  -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q 
          GFlop/s: 459.921
          Win7 x86_64, Catalyst 11.8:
          $ ./MatrixMultiplication  -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
          GFlop/s: 321.949
          $ ./MatrixMultiplication  -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q 
          GFlop/s: 547.915
          A ~19% difference in device performance seems a bit high.  Are there any known performance issues with the Linux drivers (11.8)?  I just want to get some "official" feedback before I spend a lot of time trying to dig deeper into this one.

           

          Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.  

          Make sure you are using high value for i option when you compare performances.

            • Performance Discrepancy between Win7 and Linux
              jholewinski

               

              Originally posted by: genaganna

              Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.  

               

              Make sure you are using high value for i option when you compare performances.

               

              How do zero copy buffers work on non-Fusion hardware?  The copy to device memory still has to occur, so what optimization is being done here?

                • Performance Discrepancy between Win7 and Linux
                  genaganna

                   

                  Originally posted by: jholewinski
                  Originally posted by: genaganna

                  Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.  

                  Make sure you are using high value for i option when you compare performances.

                  How do zero copy buffers work on non-Fusion hardware?  The copy to device memory still has to occur, so what optimization is being done here?

                  It overlap the computation and transfer if you use zero copy buffers.  Please go through the chapter 4 for programming guide.

                    • Performance Discrepancy between Win7 and Linux
                      jholewinski

                       

                      Originally posted by: genaganna

                       

                      It overlap the computation and transfer if you use zero copy buffers.  Please go through the chapter 4 for programming guide.

                       

                      Wait, so computation/mem-transfer overlap is not even supported on Linux?  Wow.

                        • Performance Discrepancy between Win7 and Linux
                          genaganna

                           

                          Originally posted by: jholewinski
                          Originally posted by: genaganna

                           

                          It overlap the computation and transfer if you use zero copy buffers.  Please go through the chapter 4 for programming guide.

                           

                           

                          Wait, so computation/mem-transfer overlap is not even supported on Linux?  Wow.

                           

                          Transfering data and running some kernel is supported both in Linux and Windows.

                          • Performance Discrepancy between Win7 and Linux
                            notzed

                             

                            Originally posted by: jholewinski
                            Originally posted by: genaganna

                             

                             

                             

                            It overlap the computation and transfer if you use zero copy buffers.  Please go through the chapter 4 for programming guide.

                             

                             

                             

                             

                            Wait, so computation/mem-transfer overlap is not even supported on Linux?  Wow.

                             

                            Chapter 4 isn't the clearest of bits of documentation.

                            I think genaganna means that the GPU is accessing the CPU memory directly as it computes - i.e. interleaving computing/memory access.  Although the accesses is much slower than GPU memory, for certain rather limited cases the overall speed might be higher since you avoid the batched copies bracketing the kernel.