I've run into a problem where an inner loop is not run the correct number of times unless I use a #pragma unroll on it. In what I was working on, the results were wrong and I noticed that everything was running much quicker than it should. I then added an atomic counter to get the number of times the loop actually runs, and found it was wrong. It works correctly if I force the unroll.
I've put the CL kernel and IL + ISA here