#pragma unroll supported?
Hi,
I have a kernel with a fixed size loop (64x) that the KernelAnalyzer does not unroll automatically. When specifying #pragma unroll, KernelAnalyzer crashes.
For a test, I used the changed constantBandwidth_single_static kernel. When not specifying the unroll count, or a high one, the crash is also reproducable here, even though this kernel is unrolled automatically, even without the #pragma.
Oddly enough, the loop is even unrolled when specifying #pragma unroll 1 ...
My kernel additionally has the odd side effect the a higher unroll count results in a longer runtime (as predicted by KernelAnalyzer) ! When I unroll the loop manually (specifying the block 64 times with no loop), it is about half the runtime of the loop, meaning -50% runtime.
#pragma unroll 2 results in +50% runtime
#pragma unroll 4 +150%
#pragma unroll 8 +350%
This leads to the question: Is #pragma unroll supported? Is it useful in some cases (in which?) ?
Or is that just a problem in KernelAnalyzer? I currently cannot test on real Hardware ...
__kernel void constantBandwidth_single_static(__global DATATYPE *cb, __global DATATYPE *output) { DATATYPE val = (DATATYPE)(0.0f); uint gid = get_global_id(0); uint index = 0; uint i; #pragma unroll 64 for (i=0; i<256; i++) { val = val + cb[index + i]; } output[gid] = val; }