
conditional operator behaviour on vector operand
nou Jul 21, 2010 12:15 PM (in response to Meteorhead)look at select() function.

conditional operator behaviour on vector operand
Meteorhead Jul 23, 2010 12:24 PM (in response to nou)Thank you for the quick reply, however this did not the issue I was facing.
I was using the ?: operator in the first approach with vector types to achieve my goal, but the compiler gave me an "Error:E010:Irreducible ControlFlow Detected". However the error persists even with the select() function. The problem is becoming urgent, so if somebody could give an answer I would be grateful. Here is the exact line that causes the problem (I guess).
factor = select(zero4, three4,( (uint4)( ((particle << (uint4)i) >> (uint4)(SD  2) ) == two4) + abs(isless(random,min(p,rp[my_share])) ) ) == two4 );
This is the heart of the algorithm I wrote (inside a for cycle(i)). It is not simple, so let me explain real quick to save you time: I look through an integer on bit level by groups of two to find sequences of "10". If a group of two bits matches this sequence AND random is less than the lesser out of p and rp, than factor be equal to three, otherwise it should be zero.
I use ((A+B) == 2) instead of (A && B) in the conditional part because NVIDIA has not implemented && operator on vectors (no comment). all types in the code are uint4 except random,p,rp which are float4. Naturally in each element of this vectorexpression the results may differ but no if statements are included.
Is the problem really irreducible or only my code?

conditional operator behaviour on vector operand
omkaranathan Jul 23, 2010 12:34 PM (in response to Meteorhead)Meteorhead,
Please post your kernel code.

conditional operator behaviour on vector operand
Meteorhead Jul 23, 2010 12:42 PM (in response to omkaranathan)The code is long and it utilizes the sample Mersennetwister PRNG.
The most important part of the code is the last big for cycle: for (uint t = 0 ; t < timestep ; ++t)
This is where the algorithm runs. In Odd timestep both sequences of "10" and "01" are looked for. In Even timesteps communication is needed with neighbouring vectors (only the last bit).
Hope is it understandable.
// Vector type shifting with carryover between elements void lshift128(uint4 input, uint shift, uint4* output) { unsigned int invshift = 32u  shift; uint4 temp; temp.x = input.x << shift; temp.y = (input.y << shift)  (input.x >> invshift); temp.z = (input.z << shift)  (input.y >> invshift); temp.w = (input.w << shift)  (input.z >> invshift); *output = temp; } void rshift128(uint4 input, uint shift, uint4* output) { unsigned int invshift = 32u  shift; uint4 temp; temp.w = input.w >> shift; temp.z = (input.z >> shift)  (input.w << invshift); temp.y = (input.y >> shift)  (input.z << invshift); temp.x = (input.x >> shift)  (input.y << invshift); *output = temp; } __constant unsigned int thirteen = 13u; __constant unsigned int fifteen = 15u; __constant unsigned int shift = 8u * 3u; __constant unsigned int mask11 = 0xfdff37ffu; __constant unsigned int mask12 = 0xef7f3f7du; __constant unsigned int mask13 = 0xff777b7du; __constant unsigned int mask14 = 0x7ff7fb2fu; __constant float one = 1.0f; __constant float intMax = 4294967296.0f; inline float4 rand(uint4* r1, uint4* r2, uint4* a, uint4* b, uint4* e, uint4* f, uint4* state1, uint4* state2, uint4* state3, uint4* state4, uint4* state5, uint4* temp) { *r1 = *state4; *r2 = *state5; *a = *state1; *b = *state3; lshift128(*a, shift, e); rshift128(*r1, shift, f); (*temp).x = (*a).x ^ (*e).x ^ (((*b).x >> thirteen) & mask11) ^ (*f).x ^ ((*r2).x << fifteen); (*temp).y = (*a).y ^ (*e).y ^ (((*b).y >> thirteen) & mask12) ^ (*f).y ^ ((*r2).y << fifteen); (*temp).z = (*a).z ^ (*e).z ^ (((*b).z >> thirteen) & mask13) ^ (*f).z ^ ((*r2).z << fifteen); (*temp).w = (*a).w ^ (*e).w ^ (((*b).w >> thirteen) & mask14) ^ (*f).w ^ ((*r2).w << fifteen); return convert_float4(temp[0]) * one / intMax; } // rand // Chainlength is the number of vectors __kernel void test( __global uint4* seedArray, __global uint4* particles, uint chainlength, __local uint4* share, __global uint4* partial, uint timestep, float4 p, float4 q, __global float4* r_p, __global float4* r_q, __local float4* rp, __local float4* rq) { // Global init int gidX = get_global_id(0); int tidX = get_local_id(0); int lsiX = get_local_size(0); int GIDX = get_group_id(0); // KPZ init uint4 particle; uint vectors_per_thread = chainlength / lsiX; uint4 XOR = (uint4)(0); uint4 factor = (uint4)(0); uint SD = 8 * sizeof(uint); uint4 zero4 = (uint4)(0); uint my_share; for (uint v = 0 ; v < vectors_per_thread ; ++v) { my_share = tidX * vectors_per_thread + v; share[my_share] = particles[my_share]; rp[my_share] = r_p[my_share]; rq[my_share] = r_q[my_share]; } // PRNG init uint4 temp; uint4 state1 = seedArray[GIDX * lsiX + tidX]; uint4 state2 = (uint4)(0); uint4 state3 = (uint4)(0); uint4 state4 = (uint4)(0); uint4 state5 = (uint4)(0); uint stateMask = 1812433253u; uint thirty = 30u; uint4 mask4 = (uint4)(stateMask); uint4 thirty4 = (uint4)(thirty); uint4 one4 = (uint4)(1u); uint4 two4 = (uint4)(2u); uint4 three4 = (uint4)(3u); uint4 four4 = (uint4)(4u); uint4 r1 = (uint4)(0); uint4 r2 = (uint4)(0); uint4 a = (uint4)(0); uint4 b = (uint4)(0); uint4 e = (uint4)(0); uint4 f = (uint4)(0); //Initializing states. state2 = mask4 * (state1 ^ (state1 >> thirty4)) + one4; state3 = mask4 * (state2 ^ (state2 >> thirty4)) + two4; state4 = mask4 * (state3 ^ (state3 >> thirty4)) + three4; state5 = mask4 * (state4 ^ (state4 >> thirty4)) + four4; float4 random; barrier(CLK_LOCAL_MEM_FENCE); // All threads wait for each other for (uint t = 0 ; t < timestep ; ++t) { for (uint v = 0 ; v < vectors_per_thread ; ++v) { my_share = tidX * vectors_per_thread + v; particle = share[my_share]; // Odd timestep for (uint i = 0 ; i < SD ; i += 2) { random = rand(&r1, &r2, &a, &b, &e, &f, &state1, &state2, &state3, &state4, &state5, &temp); factor = select(zero4, three4,( (uint4)( ((particle << (uint4)i) >> (uint4)(SD  2) ) == two4) + abs(isless(random,min(p,rp[my_share])) ) ) == two4 ); random = rand(&r1, &r2, &a, &b, &e, &f, &state1, &state2, &state3, &state4, &state5, &temp); factor = select(zero4, three4,( (uint4)( ((particle << (uint4)i) >> (uint4)(SD  2) ) == one4) + abs(isless(random,min(q,rq[my_share])) ) ) == two4 ); XOR = XOR << two4; XOR = factor; } particle ^= XOR; share[my_share] = particle; mem_fence(CLK_LOCAL_MEM_FENCE); // Local memory operations before and after fence don't mix // Even timestep lshift128(particle, 1, &particle); particle.x = share[(my_share + 1) % chainlength].w >> (SD  1); for (uint i = 0 ; i < SD ; i += 2) { random = rand(&r1, &r2, &a, &b, &e, &f, &state1, &state2, &state3, &state4, &state5, &temp); factor = select(zero4, three4,( (uint4)( ((particle << (uint4)i) >> (uint4)(SD  2) ) == two4) + abs(isless(random,min(p,rp[my_share])) ) ) == two4 ); random = rand(&r1, &r2, &a, &b, &e, &f, &state1, &state2, &state3, &state4, &state5, &temp); factor = select(zero4, three4,( (uint4)( ((particle << (uint4)i) >> (uint4)(SD  2) ) == one4) + abs(isless(random,min(q,rq[my_share])) ) ) == two4 ); XOR = XOR << two4; XOR = factor; } particle ^= XOR; rshift128(particle, 1, &particle); share[my_share] = particle; share[(my_share + 1) % chainlength].w = particle.x << (SD  1); mem_fence(CLK_LOCAL_MEM_FENCE); } // vectors_per_thread } // timestep for (uint v = 0 ; v < vectors_per_thread ; ++v) { partial[GIDX * chainlength + tidX * vectors_per_thread + v] = share[tidX * vectors_per_thread + v];} } // main

conditional operator behaviour on vector operand
Meteorhead Jul 23, 2010 12:56 PM (in response to Meteorhead)I have figured that simply saying
random = rand(...)
factor = select(look for "10")
random = rand(...)
factor = select(look for "01")
will produce wrong behaviour on the level of the algorithm. It should rather be factor =, so that one doesn't overwrite the other.
Anyway the code should be able to compile.

conditional operator behaviour on vector operand
Meteorhead Jul 25, 2010 12:42 PM (in response to Meteorhead)I have found the cause of the Irreducible controlflow.
When one makes nested for cycles only the outermost cycle can have the length given as a parameter, all cycles inside must have hardcoded length. No matter how constant I try to make the length of the loop, it only compiles if I have it hardcoded. (Which is definately not flexible)
Inside my code is a nested cycle that looks like the following. 'timestep' is a constant argument that the kernel gets. 'vectors_per_thread' was originally a calculated constant by each parameter (identical to all threads, not just those inside a workgroup), now it is passed on as a constant argument also but it still doesn't want to compile.
It should be able to run cause NVIDIA compilers manage to create a running code out of it.
Could you tell me if this is normal behaviour?
for(int t = 0 ; t < timestep ; ++i) { for(int v = 0 ; v < vectors_per_thread ; ++v) { ... } }

conditional operator behaviour on vector operand
nou Jul 28, 2010 3:46 PM (in response to Meteorhead)i have seen several irreduciable control flow on the forum. i think this is limitation of current SDK.

conditional operator behaviour on vector operand
douglas125 Jul 28, 2010 5:07 PM (in response to nou)Hi
I have had this same issue. I've posted it some time ago and I'm posting the way I found to circumvent the driver limitation. The code attached shows how to loop i from 0 to imax and j from 0 to jmax without irreducible controlflow.
Just make sure to double check if this loop works for you as it is. It should solve the problem while a final solution doesn't come.
int i = 0; int j = 0; for (int kk = 0; kk < imax*jmax; kk++) { //Do your stuff int cor = DerivData[x + j + xfmx0 * (y + i)]; RowSig[i][j] += (float)cor * (float)corCentro; j++; if (j>jmax) //switching i { j = 0; i++; } }






