I'm hoping an AMD dev can shed some light on a performance discrepancy I am experiencing on an HD5870 between Windows 7 x86_64 and Red Hat Enterprise Linux 6.1 x86_64. Both machines are up-to-date with fresh installations of Catalyst 11.8.
With the OpenCL matrix multiplication sample from the AMD APP SDK 2.5, I am getting the following results:
Originally posted by: jholewinski I'm hoping an AMD dev can shed some light on a performance discrepancy I am experiencing on an HD5870 between Windows 7 x86_64 and Red Hat Enterprise Linux 6.1 x86_64. Both machines are up-to-date with fresh installations of Catalyst 11.8.
With the OpenCL matrix multiplication sample from the AMD APP SDK 2.5, I am getting the following results:
RHEL 6.1 x86_64, Catalyst 11.8:$ ./MatrixMultiplication -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -qGFlop/s: 361.148$ ./MatrixMultiplication -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -qGFlop/s: 459.921Win7 x86_64, Catalyst 11.8:$ ./MatrixMultiplication -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -qGFlop/s: 321.949$ ./MatrixMultiplication -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -qGFlop/s: 547.915A ~19% difference in device performance seems a bit high. Are there any known performance issues with the Linux drivers (11.8)? I just want to get some "official" feedback before I spend a lot of time trying to dig deeper into this one.
Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.
Make sure you are using high value for i option when you compare performances.
Originally posted by: genaganna
Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.
Make sure you are using high value for i option when you compare performances.
How do zero copy buffers work on non-Fusion hardware? The copy to device memory still has to occur, so what optimization is being done here?
Originally posted by: jholewinski Originally posted by: genaganna
Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.
Make sure you are using high value for i option when you compare performances.
How do zero copy buffers work on non-Fusion hardware? The copy to device memory still has to occur, so what optimization is being done here?
It overlap the computation and transfer if you use zero copy buffers. Please go through the chapter 4 for programming guide.
Originally posted by: genaganna
It overlap the computation and transfer if you use zero copy buffers. Please go through the chapter 4 for programming guide.
Wait, so computation/mem-transfer overlap is not even supported on Linux? Wow.
Originally posted by: jholewinski Originally posted by: genaganna
It overlap the computation and transfer if you use zero copy buffers. Please go through the chapter 4 for programming guide.
Wait, so computation/mem-transfer overlap is not even supported on Linux? Wow.
Transfering data and running some kernel is supported both in Linux and Windows.
Originally posted by: jholewinski Originally posted by: genaganna
It overlap the computation and transfer if you use zero copy buffers. Please go through the chapter 4 for programming guide.
Wait, so computation/mem-transfer overlap is not even supported on Linux? Wow.
Chapter 4 isn't the clearest of bits of documentation.
I think genaganna means that the GPU is accessing the CPU memory directly as it computes - i.e. interleaving computing/memory access. Although the accesses is much slower than GPU memory, for certain rather limited cases the overall speed might be higher since you avoid the batched copies bracketing the kernel.