I have been searching on different forums regarding this problem, but haven't been successful in finding a similar problem, or maybe I haven't been searching well enough.
I am currently developing ultrasound processing of raw radio frequency data to a visible image. Processing such an image requires a number of steps, so there are a number of kernels that need to be run for one frame.
I have been successful so far, being able to process a raw data on a NVIDIA 9600GT at ~333 fps. I have chosen OpenCL so that there is support for different hardware.
I have fully optimized the code looking at multiple best practices guides, and I'm sure I can't squeeze any extra power out of it. From data transport to shared memory, coalesced memory, etc.
Now that extensive testing has been successful on NVIDIA hardware I wanted to try it on ATI 5750 hardware. I use both profiling and CPU timers to clock processing time and any other overhead calls.
What struck me that a large overhead is created when using ATI hardware. Here are some numbers:
Total GPU Processing Time 2354 us
Frame processing time with overhead: 3217.53 us
Each (ker) item in the list is a kernel, and its execution time. The total GPU processing time is all items added up. There is a CPU timer surrounding the entire process to time overhead calls, on NVIDIA it's about 1ms of extra overhead time.
Total GPU Processing Time 2590 us
Frame processing time with overhead: 6440.24 us
The overhead on ATI hardware is 3.8 ms, which is three times more than NVIDIA. I've disabled some kernels and noticed, with each extra kernel, there is a large increase in overhead time. I believe there is a massive amount of extra overhead in calling multiple kernels on ATI hardware.
Any thoughts on this problem?
Also I have noticed, reading 1 Float from GPU memory for ATI is 199 us and on NVIDIA 1 us. (clEnqueueReadBuffer) Also ran into some problems where the ATI GPU would just crash when having a barrier(CLK_LOCAL_MEM_FENCE) in a loop.
These numbers are from the processing loop, the timers encapsulate only the kernel invocation calls and there are distinct differences when a kernel is not run on ATI hardware, that the overhead decreases significantly.
Any thoughts that the ATI hardware is in "earlier" development with OpenCL, causing overhead?
The current implementation is not optimized for performance. You can expect this to change in upcoming releases.