The section "1.4 OpenCL Data Transfer Optimization" in AMD OpenCL Programming Optimization Guide describes varies ways of buffer creation and data transfer mechanisms to optimize the buffer transfer/access overhead applicable to some common application scenarios. I would suggest you to go through that section once.
There are many points to consider before choosing any particular mechanism which is best suitable to your own application. Many times its not so straight forward to do. Normally, better to perform few experiments before taking any final decision.
For example, just want to mention few points regarding the above code:
1) If you want to completely overwrite the contents of "inBuf", its better to use flag "CL_MAP_WRITE_INVALIDATE_REGION" instead of "CL_MAP_WRITE". Because, it can save one memory copy overhead.
2) If you want to fill "inBuf" from an existing host buffer, you may set the content during the buffer creation itself or even use the same memory as pinned host buffer.
3) As "outBuf" is created on the host-side, depending on situation the access time of "outBuf" from the kernel may be longer compare to any similar device-side buffer. So, you may actually observe slower kernel performance or even lower overall application performance.
Thanks for your suggestions, especially the CL_MAP_WRITE_INVALIDATE_REGION hint.
I took your recommendation and experimented a bit to find a good combination of flags - thanks to AMD's excellent CodeXL it is very easy to observe what is going on. For my use-case (low-bandwith kernel), host-side buffers in CPU cacheable area seem to work best (CL_MEMREAD_WRITE |
I am also quite curious about map-free SVM buffers, although I understand the comfort they provide comes at the cost of throughput / bandwidth.
Thanks & best regards, Clemens
PS: Thanks again for CodeXL, especially for providing such an excellent linux version.
1 of 1 people found this helpful
In general, a proper data-transfer methodology over multiple independent devices is to provide "rotating buffers", that is, instead of having a single set of input/result buffers you have two (three? Four?) and start working on the n+1-th while you wait for mapping (nonblocking) the n-th so you don't force a full CPU-GPU sync.
In my experience this makes the mapping even higher latency but higher bandwidth. Most importantly, you waste no GPU time.
Games have been doing that for decades.
Good News! Drivers internally apply some of those tricks for you. CL_MAP_WRITE_INVALIDATE_REGION is one of the most long-lived hints about buffer management. Hopefully it will improve your situation.
I'm honestly surprised mapping takes so long on such a recent APU, I think I have seen similar performances on my AM3 system.
I used to do the same "rotating buffer" technique on digital-signal-processors, the downside is however, that it can be sometimes hard to integrate such an approach into existing software where the whole application-design relies on synchronous execution. That was the reason why GPUs didn't seem very attractive for our use-case, however due to recent advancements in APUs it looks like things have changed
Thanks & br, Clemens