cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jholewinski
Journeyman III

Performance Discrepancy between Win7 and Linux

I'm hoping an AMD dev can shed some light on a performance discrepancy I am experiencing on an HD5870 between Windows 7 x86_64 and Red Hat Enterprise Linux 6.1 x86_64.  Both machines are up-to-date with fresh installations of Catalyst 11.8.

With the OpenCL matrix multiplication sample from the AMD APP SDK 2.5, I am getting the following results:

RHEL 6.1 x86_64, Catalyst 11.8:
$ ./MatrixMultiplication  -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
GFlop/s: 361.148
$ ./MatrixMultiplication  -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q 
GFlop/s: 459.921
Win7 x86_64, Catalyst 11.8:
$ ./MatrixMultiplication  -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
GFlop/s: 321.949
$ ./MatrixMultiplication  -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q 
GFlop/s: 547.915
A ~19% difference in device performance seems a bit high.  Are there any known performance issues with the Linux drivers (11.8)?  I just want to get some "official" feedback before I spend a lot of time trying to dig deeper into this one.


0 Likes
6 Replies
genaganna
Journeyman III

Originally posted by: jholewinski I'm hoping an AMD dev can shed some light on a performance discrepancy I am experiencing on an HD5870 between Windows 7 x86_64 and Red Hat Enterprise Linux 6.1 x86_64.  Both machines are up-to-date with fresh installations of Catalyst 11.8.

 

With the OpenCL matrix multiplication sample from the AMD APP SDK 2.5, I am getting the following results:

 

RHEL 6.1 x86_64, Catalyst 11.8:
$ ./MatrixMultiplication  -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
GFlop/s: 361.148
$ ./MatrixMultiplication  -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q 
GFlop/s: 459.921
Win7 x86_64, Catalyst 11.8:
$ ./MatrixMultiplication  -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
GFlop/s: 321.949
$ ./MatrixMultiplication  -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q 
GFlop/s: 547.915
A ~19% difference in device performance seems a bit high.  Are there any known performance issues with the Linux drivers (11.8)?  I just want to get some "official" feedback before I spend a lot of time trying to dig deeper into this one.

 

Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.  

Make sure you are using high value for i option when you compare performances.

0 Likes

Originally posted by: genaganna

Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.  

 

Make sure you are using high value for i option when you compare performances.

 

How do zero copy buffers work on non-Fusion hardware?  The copy to device memory still has to occur, so what optimization is being done here?

0 Likes

Originally posted by: jholewinski
Originally posted by: genaganna

Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.  

Make sure you are using high value for i option when you compare performances.

How do zero copy buffers work on non-Fusion hardware?  The copy to device memory still has to occur, so what optimization is being done here?

It overlap the computation and transfer if you use zero copy buffers.  Please go through the chapter 4 for programming guide.

0 Likes

Originally posted by: genaganna

 

It overlap the computation and transfer if you use zero copy buffers.  Please go through the chapter 4 for programming guide.

 

Wait, so computation/mem-transfer overlap is not even supported on Linux?  Wow.

0 Likes

Originally posted by: jholewinski
Originally posted by: genaganna

 

It overlap the computation and transfer if you use zero copy buffers.  Please go through the chapter 4 for programming guide.

 

 

Wait, so computation/mem-transfer overlap is not even supported on Linux?  Wow.

 

Transfering data and running some kernel is supported both in Linux and Windows.

0 Likes

Originally posted by: jholewinski
Originally posted by: genaganna

 

 

 

It overlap the computation and transfer if you use zero copy buffers.  Please go through the chapter 4 for programming guide.

 

 

 

 

Wait, so computation/mem-transfer overlap is not even supported on Linux?  Wow.

 

Chapter 4 isn't the clearest of bits of documentation.

I think genaganna means that the GPU is accessing the CPU memory directly as it computes - i.e. interleaving computing/memory access.  Although the accesses is much slower than GPU memory, for certain rather limited cases the overall speed might be higher since you avoid the batched copies bracketing the kernel.

0 Likes