I have a certain kernel, doing some computations on blocks of matrices (I have a block size of 128 matrices).
To process a block, I iterate over all the matrices in a block.
It seems that this loop is too agressively unrolled, and it disminishes the number of in-flight wavefronts from 4 to 2.
As I need a certain number of wavefronts to hide the memory latency, I would like to specify to the OpenCL compiler to NOT unroll this loop, is this possible?
I am currently passing this constant with a kernel argument to avoid the unrolling, but that's not a great solution IMO...
Thank you for your help,