I have been searching on different forums regarding this problem, but haven't been successful in finding a similar problem, or maybe I haven't been searching well enough.
I am currently developing ultrasound processing of raw radio frequency data to a visible image. Processing such an image requires a number of steps, so there are a number of kernels that need to be run for one frame.
I have been successful so far, being able to process a raw data on a NVIDIA 9600GT at ~333 fps. I have chosen OpenCL so that there is support for different hardware.
I have fully optimized the code looking at multiple best practices guides, and I'm sure I can't squeeze any extra power out of it. From data transport to shared memory, coalesced memory, etc.
Now that extensive testing has been successful on NVIDIA hardware I wanted to try it on ATI 5750 hardware. I use both profiling and CPU timers to clock processing time and any other overhead calls.
What struck me that a large overhead is created when using ATI hardware. Here are some numbers:
Total GPU Processing Time 2354 us
Frame processing time with overhead: 3217.53 us
Each (ker) item in the list is a kernel, and its execution time. The total GPU processing time is all items added up. There is a CPU timer surrounding the entire process to time overhead calls, on NVIDIA it's about 1ms of extra overhead time.
Total GPU Processing Time 2590 us
Frame processing time with overhead: 6440.24 us
The overhead on ATI hardware is 3.8 ms, which is three times more than NVIDIA. I've disabled some kernels and noticed, with each extra kernel, there is a large increase in overhead time. I believe there is a massive amount of extra overhead in calling multiple kernels on ATI hardware.
Any thoughts on this problem?
Also I have noticed, reading 1 Float from GPU memory for ATI is 199 us and on NVIDIA 1 us. (clEnqueueReadBuffer) Also ran into some problems where the ATI GPU would just crash when having a barrier(CLK_LOCAL_MEM_FENCE) in a loop.
These numbers are from the processing loop, the timers encapsulate only the kernel invocation calls and there are distinct differences when a kernel is not run on ATI hardware, that the overhead decreases significantly.
Any thoughts that the ATI hardware is in "earlier" development with OpenCL, causing overhead?