I would have liked to ask you a question. I wrote and successfully compiled a quite complicated program (at least for me) with OpenCL 1.2 on a AMD GPU card equipped with 20 CU (actually I don't remember the ID of the card). Correct output (numerically speaking) and good performance guaranteed, but still not excellent as I thought at the beginning. I'm gonna explain the problem. Unfortunately I still cannot share the code for policy rights. Hope to be able to explain everything as clear as possible and eventually I think I could add some pics about CodeXL GPU tracing results.
My code should beat the same, identical code wrote on the CPU using SIMD instructions. This seemed very hard to me at the beginning, but I tried.
The results provided by CodeXL are quite good. No bank conflicts, LDS almost fully used, satisfying kernel execution and data transfer time for the way I chose to implement instructions. The problem concerns the fact that the code is organized (or, better speaking, has to be organized) to process parameters relative to 1ms sampling data, so for every millisecond the CPU collects samples, which have to be properly organized and attached to buffers for processing on the GPU. This has to be accomplished for a very long simulation period. Unfortunately the work space needed for processing is quite small, but at the moment I have constraints about how the entire algorithm is structured, so no improvements can be applied except rethinking entirely the algorithm tree and operation order.
By the way, the real problem is not the kernel structure and low GPU efficiency at the moment, since single time performance are satisfying. The problem is the need for calling, at every iteration, APIs for buffer creation, for NDRkernel execution, for buffer rect reading, and for releasing memory objects. This causes a huge time increase on the overall code.
My question is: CodeXL timeline showed in the upper part is the mirror of the REAL duration of the entire code running on the CPU/GPU machine? I know it is possible to divide operations and actions among CPU and GPU using different contexts and queues so that there are no latencies and empty intervals between computation on the different devices, but what about API duration? Is there a way to cut or at least reduce the calling time associated to API functions or it's just something machine/device dependent?
Thank in advance to everyone of you.
I guess you are taking about execution time and overhead of the OpenCL run-time APIs. The actual number depends on various factors like actual device (so as device architecture), driver/run-time implementation, OS etc.. Programmers has very limited control over these system dependent factors. However, programmers can apply certain knowledge of such system dependent information to improve the overall performance of the application.
Running each command using clEnqueue has an associated launch overhead. To reduce the launch overhead, the AMD OpenCL runtime combines several command submissions into a batch. One can take advantage of this feature when enqueuing multiple commands, say clEnqueueNDRangeKernel.
Enabling profiling on a command queue adds few more overhead (in the range of microseconds) to all clEnqueue calls. So, one may avoid this profiling information in the final deliverable code.
As launching a kernel in CPU is faster than GPU, sometimes it is better to run a simple low intensive kernel on CPU rather than GPU.
Devices with newer generation having multiple ACEs and DMA engines can exhibit better performance than low performance devices.
Hi Dipak, thank you very much for replying me. You gave me a very exhaustive answer.
Yes, I checked in the past that enabling profling events is time-consuming, and since I got all the relevant informations from CodeXL I just use it to test the correctness and computational performance of the code.
Yes, that is the problem, the overhead associated with the OpenCL run-time APIs. Since I'm not still very good in using it, I've written the entire code dividing it in different kernels, so the run-time APIs time associated grows up very much, reducing the overall perfomance. I was addressing all the operations on the GPU since I thought it was better, I'll try doing the same on the CPU and compare the results. Unfortunately the OS is Windows, so there are many other threads running behind. This is also why I chose to address all the code on the GPU.
I followed your suggestion and saw that replacing the clEnqueue commands is documented on the AMD Accelerated Parallel Processing guide. I'll try substituting the clEnqueue commands submissions into a batch. Hope that it will be better.
I don't know, but on Matthew Scarpino's book there's a chapter (chapter 8) about C++/OpenCL development. I just had a glance on it, but maybe it could be useful for the purpose.
Thank you again.