I don't know if anyone else has experienced this, but I just found that changing the order in which I allocate
opencl buffers has a dramatic effect on performance. I have a group of buffers to hold uncompressed data,
and another group of buffers to hold compressed data. If I allocate all of the uncompressed buffers first, and then
allocate all of the compressed buffers, performance is dramatically higher than if I interleave the allocation i.e.
allocate one uncompressed buffer, then one compressed, then one uncompressed etc.
This is for Ellesmere arch. I saw a similar issue with Cape Verde, where if I allocated a small dummy buffer, performance
went way up.
In the Cape Verde case, I was told this was related to the memory channels on the card, and I suppose for Ellesmere as well,
if the uncompressed buffer and corresponding compressed buffer are assigned to the same channel, then performance is better.
My app is very memory-intensive.