There seems to be a bug in Stream 2.5 related to loop unrolling:
This code works: http://pastebin.com/6gSaKpKD
for (i = 0; i < 4095; i++)
for (i = 0; i < 4096; i++)
and it seems like the kernel returns immediately, doing no work.
Changing again to
for (i = 0; i < 4097; i++)
and the kernel works as intended again.
This looks a lot like the problem I had here