cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

jholewinski
Journeyman III
Journeyman III

Performance Discrepancy between Win7 and Linux

I'm hoping an AMD dev can shed some light on a performance discrepancy I am experiencing on an HD5870 between Windows 7 x86_64 and Red Hat Enterprise Linux 6.1 x86_64.  Both machines are up-to-date with fresh installations of Catalyst 11.8.

With the OpenCL matrix multiplication sample from the AMD APP SDK 2.5, I am getting the following results:

RHEL 6.1 x86_64, Catalyst 11.8:
$ ./MatrixMultiplication  -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
GFlop/s: 361.148
$ ./MatrixMultiplication  -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q 
GFlop/s: 459.921
Win7 x86_64, Catalyst 11.8:
$ ./MatrixMultiplication  -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
GFlop/s: 321.949
$ ./MatrixMultiplication  -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q 
GFlop/s: 547.915
A ~19% difference in device performance seems a bit high.  Are there any known performance issues with the Linux drivers (11.8)?  I just want to get some "official" feedback before I spend a lot of time trying to dig deeper into this one.


0 Kudos
Reply
6 Replies
genaganna
Journeyman III
Journeyman III

Performance Discrepancy between Win7 and Linux

Originally posted by: jholewinski I'm hoping an AMD dev can shed some light on a performance discrepancy I am experiencing on an HD5870 between Windows 7 x86_64 and Red Hat Enterprise Linux 6.1 x86_64.  Both machines are up-to-date with fresh installations of Catalyst 11.8.

 

With the OpenCL matrix multiplication sample from the AMD APP SDK 2.5, I am getting the following results:

 

RHEL 6.1 x86_64, Catalyst 11.8:
$ ./MatrixMultiplication  -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
GFlop/s: 361.148
$ ./MatrixMultiplication  -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q 
GFlop/s: 459.921
Win7 x86_64, Catalyst 11.8:
$ ./MatrixMultiplication  -i 1 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q
GFlop/s: 321.949
$ ./MatrixMultiplication  -i 24 -t --eAppGflops -x 4000 -y 4000 -z 4000 -b 16 -q 
GFlop/s: 547.915
A ~19% difference in device performance seems a bit high.  Are there any known performance issues with the Linux drivers (11.8)?  I just want to get some "official" feedback before I spend a lot of time trying to dig deeper into this one.

 

Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.  

Make sure you are using high value for i option when you compare performances.

0 Kudos
Reply
jholewinski
Journeyman III
Journeyman III

Performance Discrepancy between Win7 and Linux

Originally posted by: genaganna

Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.  

 

Make sure you are using high value for i option when you compare performances.

 

How do zero copy buffers work on non-Fusion hardware?  The copy to device memory still has to occur, so what optimization is being done here?

0 Kudos
Reply
genaganna
Journeyman III
Journeyman III

Performance Discrepancy between Win7 and Linux

Originally posted by: jholewinski
Originally posted by: genaganna

Zero copy buffers are not supported yet on Linux. This difference is because of Zero copy buffers.  

Make sure you are using high value for i option when you compare performances.

How do zero copy buffers work on non-Fusion hardware?  The copy to device memory still has to occur, so what optimization is being done here?

It overlap the computation and transfer if you use zero copy buffers.  Please go through the chapter 4 for programming guide.

0 Kudos
Reply
jholewinski
Journeyman III
Journeyman III

Performance Discrepancy between Win7 and Linux

Originally posted by: genaganna

 

It overlap the computation and transfer if you use zero copy buffers.  Please go through the chapter 4 for programming guide.

 

Wait, so computation/mem-transfer overlap is not even supported on Linux?  Wow.

0 Kudos
Reply
genaganna
Journeyman III
Journeyman III

Performance Discrepancy between Win7 and Linux

Originally posted by: jholewinski
Originally posted by: genaganna

 

It overlap the computation and transfer if you use zero copy buffers.  Please go through the chapter 4 for programming guide.

 

 

Wait, so computation/mem-transfer overlap is not even supported on Linux?  Wow.

 

Transfering data and running some kernel is supported both in Linux and Windows.

0 Kudos
Reply
notzed
Challenger
Challenger

Performance Discrepancy between Win7 and Linux

Originally posted by: jholewinski
Originally posted by: genaganna

 

 

 

It overlap the computation and transfer if you use zero copy buffers.  Please go through the chapter 4 for programming guide.

 

 

 

 

Wait, so computation/mem-transfer overlap is not even supported on Linux?  Wow.

 

Chapter 4 isn't the clearest of bits of documentation.

I think genaganna means that the GPU is accessing the CPU memory directly as it computes - i.e. interleaving computing/memory access.  Although the accesses is much slower than GPU memory, for certain rather limited cases the overall speed might be higher since you avoid the batched copies bracketing the kernel.

0 Kudos
Reply