gahwtf

ATI vs. NVIDIA performance issues with kernel invocations, large overhead

Discussion created by gahwtf on May 25, 2010
Latest reply on May 27, 2010 by nou
I've been encountering problems where a large overhead is caused by multiple kernel invocations on ATI hardware.

I have been searching on different forums regarding this problem, but haven't been successful in finding a similar problem, or maybe I haven't been searching well enough.


I am currently developing ultrasound processing of raw radio frequency data to a visible image. Processing such an image requires a number of steps, so there are a number of kernels that need to be run for one frame.
I have been successful so far, being able to process a raw data on a NVIDIA 9600GT at ~333 fps. I have chosen OpenCL so that there is support for different hardware.


I have fully optimized the code looking at multiple best practices guides, and I'm sure I can't squeeze any extra power out of it. From data transport to shared memory, coalesced memory, etc.

Now that extensive testing has been successful on NVIDIA hardware I wanted to try it on ATI 5750 hardware. I use both profiling and CPU timers to clock processing time and any other overhead calls.
What struck me that a large overhead is created when using ATI hardware. Here are some numbers:

NVIDIA 9600GT:

 

  • (mem) Frame copy to pinned memory in 225 us
  • (mem) Pinned to device copy in 213 us
  • (ker) Short to float conversion in 494 us
  • (mem) Device to device copy in 51 us
  • (ker) Scan Sum of data pass 1 in 279 us
  • (ker) Scan Sum of data pass 2 in 23 us
  • (ker) Scan Sum of data pass 3 in 16 us
  • (mem) Read sum of data (1 Float) in 1 us
  • (ker) DC removal in 63 us
  • (ker) Hilbert Transformation in 674 us
  • (ker) Envelope Detection in 92 us
  • (mem) Device to device copy in 50 us
  • (ker) Scan Max element pass 1 in 290 us
  • (ker) Scan Max element pass 2 in 23 us
  • (ker) Scan Max element pass 3 in 16 us
  • (mem) Read Max element (1 Float) in 1 us
  • (ker) Normalisation, Threshold, Log Compression, Tone mapping in 68 us


Total GPU Processing Time 2354 us
Frame processing time with overhead: 3217.53 us

Each (ker) item in the list is a kernel, and its execution time. The total GPU processing time is all items added up. There is a CPU timer surrounding the entire process to time overhead calls, on NVIDIA it's about 1ms of extra overhead time.

ATI 5750

 

  • (mem) Frame copy to pinned memory in 267 us
  • (mem) Pinned to device copy in 81 us
  • (ker) Short to float conversion in 477 us
  • (mem) Device to device copy in 45 us
  • (ker) Scan Sum of data pass 1 in 143 us
  • (ker) Scan Sum of data pass 2 in 10 us
  • (ker) Scan Sum of data pass 3 in 5 us
  • (mem) Read sum of data (1 Float) in 199 us
  • (ker) DC removal in 56 us
  • (ker) Hilbert Transformation in 1008 us
  • (ker) Envelope Detection in 58 us
  • (ker) Device to device copy in 44 us
  • (ker) Scan Max element pass 1 in 171 us
  • (ker) Scan Max element pass 2 in 12 us
  • (ker) Scan Max element pass 3 in 6 us
  • (mem) Read Max element (1 Float) in 199 us
  • (ker) Normalisation, Threshold, Log Compression, Tone mapping in 76 us


Total GPU Processing Time 2590 us
Frame processing time with overhead: 6440.24 us

The overhead on ATI hardware is 3.8 ms, which is three times more than NVIDIA. I've disabled some kernels and noticed, with each extra kernel, there is a large increase in overhead time. I believe there is a massive amount of extra overhead in calling multiple kernels on ATI hardware.

Any thoughts on this problem?


Also I have noticed, reading 1 Float from GPU memory for ATI is 199 us and on NVIDIA 1 us. (clEnqueueReadBuffer) Also ran into some problems where the ATI GPU would just crash when having a barrier(CLK_LOCAL_MEM_FENCE) in a loop.

 

Outcomes