I have recently moved an OpenCL application from a NVIDIA GPU to a Radeon HD 6320 Fusion running on Ubuntu 12.04, and it is unexpectedly running significantly slower.
My program copies a very large data structure on setup to the GPU (this data structure is never read or accessed by the CPU again), and then it:
- Queues several kernels and a read buffer (to copy a very small data structure back to main memory).
- Calls clFinish to wait for the kernels and the read buffer to complete.
- This continually repeats, with occasionally some extra data copied depending on what information is returned by the read buffer (this means the read buffer has to complete before the next round of kernels can be added to the queue).
After profiling both GPUs, the delay on the ATI GPU seems to be entirely from the first kernel being being added to the queue (CL_PROFILING_COMMAND_QUEUED) to the first kernel starting execution (CL_PROFILING_COMMAND_START). On the NVIDIA GPU, this takes a few microseconds each iteration. On the ATI GPU, this takes around 20ms each iteration, which is far too long for my use.
Is there any reason why I could be getting this large delay?