cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

gahwtf
Journeyman III

ATI vs. NVIDIA performance issues with kernel invocations, large overhead

I've been encountering problems where a large overhead is caused by multiple kernel invocations on ATI hardware.

I have been searching on different forums regarding this problem, but haven't been successful in finding a similar problem, or maybe I haven't been searching well enough.


I am currently developing ultrasound processing of raw radio frequency data to a visible image. Processing such an image requires a number of steps, so there are a number of kernels that need to be run for one frame.
I have been successful so far, being able to process a raw data on a NVIDIA 9600GT at ~333 fps. I have chosen OpenCL so that there is support for different hardware.


I have fully optimized the code looking at multiple best practices guides, and I'm sure I can't squeeze any extra power out of it. From data transport to shared memory, coalesced memory, etc.

Now that extensive testing has been successful on NVIDIA hardware I wanted to try it on ATI 5750 hardware. I use both profiling and CPU timers to clock processing time and any other overhead calls.
What struck me that a large overhead is created when using ATI hardware. Here are some numbers:

NVIDIA 9600GT:

  • (mem) Frame copy to pinned memory in 225 us
  • (mem) Pinned to device copy in 213 us
  • (ker) Short to float conversion in 494 us
  • (mem) Device to device copy in 51 us
  • (ker) Scan Sum of data pass 1 in 279 us
  • (ker) Scan Sum of data pass 2 in 23 us
  • (ker) Scan Sum of data pass 3 in 16 us
  • (mem) Read sum of data (1 Float) in 1 us
  • (ker) DC removal in 63 us
  • (ker) Hilbert Transformation in 674 us
  • (ker) Envelope Detection in 92 us
  • (mem) Device to device copy in 50 us
  • (ker) Scan Max element pass 1 in 290 us
  • (ker) Scan Max element pass 2 in 23 us
  • (ker) Scan Max element pass 3 in 16 us
  • (mem) Read Max element (1 Float) in 1 us
  • (ker) Normalisation, Threshold, Log Compression, Tone mapping in 68 us


Total GPU Processing Time 2354 us
Frame processing time with overhead: 3217.53 us

Each (ker) item in the list is a kernel, and its execution time. The total GPU processing time is all items added up. There is a CPU timer surrounding the entire process to time overhead calls, on NVIDIA it's about 1ms of extra overhead time.

ATI 5750

  • (mem) Frame copy to pinned memory in 267 us
  • (mem) Pinned to device copy in 81 us
  • (ker) Short to float conversion in 477 us
  • (mem) Device to device copy in 45 us
  • (ker) Scan Sum of data pass 1 in 143 us
  • (ker) Scan Sum of data pass 2 in 10 us
  • (ker) Scan Sum of data pass 3 in 5 us
  • (mem) Read sum of data (1 Float) in 199 us
  • (ker) DC removal in 56 us
  • (ker) Hilbert Transformation in 1008 us
  • (ker) Envelope Detection in 58 us
  • (ker) Device to device copy in 44 us
  • (ker) Scan Max element pass 1 in 171 us
  • (ker) Scan Max element pass 2 in 12 us
  • (ker) Scan Max element pass 3 in 6 us
  • (mem) Read Max element (1 Float) in 199 us
  • (ker) Normalisation, Threshold, Log Compression, Tone mapping in 76 us


Total GPU Processing Time 2590 us
Frame processing time with overhead: 6440.24 us

The overhead on ATI hardware is 3.8 ms, which is three times more than NVIDIA. I've disabled some kernels and noticed, with each extra kernel, there is a large increase in overhead time. I believe there is a massive amount of extra overhead in calling multiple kernels on ATI hardware.

Any thoughts on this problem?


Also I have noticed, reading 1 Float from GPU memory for ATI is 199 us and on NVIDIA 1 us. (clEnqueueReadBuffer) Also ran into some problems where the ATI GPU would just crash when having a barrier(CLK_LOCAL_MEM_FENCE) in a loop.

 

0 Likes
5 Replies
Raistmer
Adept II

Do you make distinction between startup overhead and processing loop overhead?
Longer times with more kernels involved may come from longer (considerably longer) compilation times for big CL file.
I see this for my own app. Even if kernel not used but presents in CL file execution time noticeable increase.
0 Likes

Raistmer,

These numbers are from the processing loop, the timers encapsulate only the kernel invocation calls and there are distinct differences when a kernel is not run on ATI hardware, that the overhead decreases significantly.

 

Any thoughts that the ATI hardware is in "earlier" development with OpenCL, causing overhead?

0 Likes

gahwtf, 

The current implementation is not optimized for performance. You can expect this to change in upcoming releases.

 

0 Likes

omkaranthan,

Ok that explains a lot, do you have any idea how fast these releases happen?

0 Likes

rougly 3 monts.

0 Likes