I don't know if anyone else has experienced this, but I just found that changing the order in which I allocate
opencl buffers has a dramatic effect on performance. I have a group of buffers to hold uncompressed data,
and another group of buffers to hold compressed data. If I allocate all of the uncompressed buffers first, and then
allocate all of the compressed buffers, performance is dramatically higher than if I interleave the allocation i.e.
allocate one uncompressed buffer, then one compressed, then one uncompressed etc.
This is for Ellesmere arch. I saw a similar issue with Cape Verde, where if I allocated a small dummy buffer, performance
went way up.
In the Cape Verde case, I was told this was related to the memory channels on the card, and I suppose for Ellesmere as well,
if the uncompressed buffer and corresponding compressed buffer are assigned to the same channel, then performance is better.
My app is very memory-intensive.
It would be nice to have a deep-dive into how the memory controllers are designed on Polaris. Also, would be nice to provide
hints to compiler to place certain buffers on the same memory controller.
For my app, when I compress an image, I have a number of buffers assigned to that image, and having these buffers assigned to the same
controller (if that is what is going on) seems to give a huge performance boost ( around 100% faster performance)
Any advice on how buffer allocation order can affect performance?
I also read in AMD opencl best practices guide that there are two DMA engines on cards, and command queues are assigned to one or the other
engine based on when they are allocated : first queue goes to engine 1, second queue to engine 2, third queue to engine 1 etc.
I suppose the same logic applies to memory controllers on card ?