I would have liked to ask you a question. I wrote and successfully compiled a quite complicated program (at least for me) with OpenCL 1.2 on a AMD GPU card equipped with 20 CU (actually I don't remember the ID of the card). Correct output (numerically speaking) and good performance guaranteed, but still not excellent as I thought at the beginning. I'm gonna explain the problem. Unfortunately I still cannot share the code for policy rights. Hope to be able to explain everything as clear as possible and eventually I think I could add some pics about CodeXL GPU tracing results.
My code should beat the same, identical code wrote on the CPU using SIMD instructions. This seemed very hard to me at the beginning, but I tried.
The results provided by CodeXL are quite good. No bank conflicts, LDS almost fully used, satisfying kernel execution and data transfer time for the way I chose to implement instructions. The problem concerns the fact that the code is organized (or, better speaking, has to be organized) to process parameters relative to 1ms sampling data, so for every millisecond the CPU collects samples, which have to be properly organized and attached to buffers for processing on the GPU. This has to be accomplished for a very long simulation period. Unfortunately the work space needed for processing is quite small, but at the moment I have constraints about how the entire algorithm is structured, so no improvements can be applied except rethinking entirely the algorithm tree and operation order.
By the way, the real problem is not the kernel structure and low GPU efficiency at the moment, since single time performance are satisfying. The problem is the need for calling, at every iteration, APIs for buffer creation, for NDRkernel execution, for buffer rect reading, and for releasing memory objects. This causes a huge time increase on the overall code.
My question is: CodeXL timeline showed in the upper part is the mirror of the REAL duration of the entire code running on the CPU/GPU machine? I know it is possible to divide operations and actions among CPU and GPU using different contexts and queues so that there are no latencies and empty intervals between computation on the different devices, but what about API duration? Is there a way to cut or at least reduce the calling time associated to API functions or it's just something machine/device dependent?
Thank in advance to everyone of you.