Archives Discussions

jholewinski · ‎08-18-2011

I'm hoping an AMD dev can shed some light on a performance discrepancy I am experiencing on an HD5870 between Windows 7 x86_64 and Red Hat Enterprise Linux 6.1 x86_64. Both machines are up-to-date with fresh installations of Catalyst 11.8.

With the OpenCL matrix multiplication sample from the AMD APP SDK 2.5, I am getting the following results:

RHEL 6.1 x86_64, Catalyst 11.8:

$ ./MatrixMultiplication -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q

GFlop/s: 361.148

$ ./MatrixMultiplication -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q

GFlop/s: 459.921

Win7 x86_64, Catalyst 11.8:

$ ./MatrixMultiplication -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q

GFlop/s: 321.949

$ ./MatrixMultiplication -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q

GFlop/s: 547.915

A ~19% difference in device performance seems a bit high. Are there any known performance issues with the Linux drivers (11.8)? I just want to get some "official" feedback before I spend a lot of time trying to dig deeper into this one.

genaganna · ‎08-19-2011

Originally posted by: jholewinski I'm hoping an AMD dev can shed some light on a performance discrepancy I am experiencing on an HD5870 between Windows 7 x86_64 and Red Hat Enterprise Linux 6.1 x86_64. Both machines are up-to-date with fresh installations of Catalyst 11.8.

With the OpenCL matrix multiplication sample from the AMD APP SDK 2.5, I am getting the following results:

RHEL 6.1 x86_64, Catalyst 11.8:
$ ./MatrixMultiplication -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
GFlop/s: 361.148
$ ./MatrixMultiplication -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
GFlop/s: 459.921
Win7 x86_64, Catalyst 11.8:
$ ./MatrixMultiplication -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
GFlop/s: 321.949
$ ./MatrixMultiplication -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
GFlop/s: 547.915
A ~19% difference in device performance seems a bit high. Are there any known performance issues with the Linux drivers (11.8)? I just want to get some "official" feedback before I spend a lot of time trying to dig deeper into this one.

Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.

Make sure you are using high value for i option when you compare performances.

jholewinski · ‎08-20-2011

Originally posted by: genaganna

Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.

Make sure you are using high value for i option when you compare performances.

How do zero copy buffers work on non-Fusion hardware? The copy to device memory still has to occur, so what optimization is being done here?

genaganna · ‎08-21-2011

Originally posted by: jholewinski
Originally posted by: genaganna

Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.
Make sure you are using high value for i option when you compare performances.
How do zero copy buffers work on non-Fusion hardware? The copy to device memory still has to occur, so what optimization is being done here?

It overlap the computation and transfer if you use zero copy buffers. Please go through the chapter 4 for programming guide.

jholewinski · ‎08-21-2011

Originally posted by: genaganna

It overlap the computation and transfer if you use zero copy buffers. Please go through the chapter 4 for programming guide.

Wait, so computation/mem-transfer overlap is not even supported on Linux? Wow.

genaganna · ‎08-22-2011

Originally posted by: jholewinski
Originally posted by: genaganna

It overlap the computation and transfer if you use zero copy buffers. Please go through the chapter 4 for programming guide.

Wait, so computation/mem-transfer overlap is not even supported on Linux? Wow.

Transfering data and running some kernel is supported both in Linux and Windows.

notzed · ‎08-22-2011

Originally posted by: jholewinski
Originally posted by: genaganna

It overlap the computation and transfer if you use zero copy buffers. Please go through the chapter 4 for programming guide.

Wait, so computation/mem-transfer overlap is not even supported on Linux? Wow.

Chapter 4 isn't the clearest of bits of documentation.

I think genaganna means that the GPU is accessing the CPU memory directly as it computes - i.e. interleaving computing/memory access. Although the accesses is much slower than GPU memory, for certain rather limited cases the overall speed might be higher since you avoid the batched copies bracketing the kernel.

Archives Discussions

Performance Discrepancy between Win7 and Linux