5 Replies Latest reply on May 27, 2010 2:28 PM by nou

    ATI vs. NVIDIA performance issues with kernel invocations, large overhead

    gahwtf
      I've been encountering problems where a large overhead is caused by multiple kernel invocations on ATI hardware.

      I have been searching on different forums regarding this problem, but haven't been successful in finding a similar problem, or maybe I haven't been searching well enough.


      I am currently developing ultrasound processing of raw radio frequency data to a visible image. Processing such an image requires a number of steps, so there are a number of kernels that need to be run for one frame.
      I have been successful so far, being able to process a raw data on a NVIDIA 9600GT at ~333 fps. I have chosen OpenCL so that there is support for different hardware.


      I have fully optimized the code looking at multiple best practices guides, and I'm sure I can't squeeze any extra power out of it. From data transport to shared memory, coalesced memory, etc.

      Now that extensive testing has been successful on NVIDIA hardware I wanted to try it on ATI 5750 hardware. I use both profiling and CPU timers to clock processing time and any other overhead calls.
      What struck me that a large overhead is created when using ATI hardware. Here are some numbers:

      NVIDIA 9600GT:

       

      • (mem) Frame copy to pinned memory in 225 us
      • (mem) Pinned to device copy in 213 us
      • (ker) Short to float conversion in 494 us
      • (mem) Device to device copy in 51 us
      • (ker) Scan Sum of data pass 1 in 279 us
      • (ker) Scan Sum of data pass 2 in 23 us
      • (ker) Scan Sum of data pass 3 in 16 us
      • (mem) Read sum of data (1 Float) in 1 us
      • (ker) DC removal in 63 us
      • (ker) Hilbert Transformation in 674 us
      • (ker) Envelope Detection in 92 us
      • (mem) Device to device copy in 50 us
      • (ker) Scan Max element pass 1 in 290 us
      • (ker) Scan Max element pass 2 in 23 us
      • (ker) Scan Max element pass 3 in 16 us
      • (mem) Read Max element (1 Float) in 1 us
      • (ker) Normalisation, Threshold, Log Compression, Tone mapping in 68 us


      Total GPU Processing Time 2354 us
      Frame processing time with overhead: 3217.53 us

      Each (ker) item in the list is a kernel, and its execution time. The total GPU processing time is all items added up. There is a CPU timer surrounding the entire process to time overhead calls, on NVIDIA it's about 1ms of extra overhead time.

      ATI 5750

       

      • (mem) Frame copy to pinned memory in 267 us
      • (mem) Pinned to device copy in 81 us
      • (ker) Short to float conversion in 477 us
      • (mem) Device to device copy in 45 us
      • (ker) Scan Sum of data pass 1 in 143 us
      • (ker) Scan Sum of data pass 2 in 10 us
      • (ker) Scan Sum of data pass 3 in 5 us
      • (mem) Read sum of data (1 Float) in 199 us
      • (ker) DC removal in 56 us
      • (ker) Hilbert Transformation in 1008 us
      • (ker) Envelope Detection in 58 us
      • (ker) Device to device copy in 44 us
      • (ker) Scan Max element pass 1 in 171 us
      • (ker) Scan Max element pass 2 in 12 us
      • (ker) Scan Max element pass 3 in 6 us
      • (mem) Read Max element (1 Float) in 199 us
      • (ker) Normalisation, Threshold, Log Compression, Tone mapping in 76 us


      Total GPU Processing Time 2590 us
      Frame processing time with overhead: 6440.24 us

      The overhead on ATI hardware is 3.8 ms, which is three times more than NVIDIA. I've disabled some kernels and noticed, with each extra kernel, there is a large increase in overhead time. I believe there is a massive amount of extra overhead in calling multiple kernels on ATI hardware.

      Any thoughts on this problem?


      Also I have noticed, reading 1 Float from GPU memory for ATI is 199 us and on NVIDIA 1 us. (clEnqueueReadBuffer) Also ran into some problems where the ATI GPU would just crash when having a barrier(CLK_LOCAL_MEM_FENCE) in a loop.