Archives Discussions

Meteorhead · ‎07-21-2010

Hi all!

I wanted to neatly vectorize a highly paralellable physics code, which requires only bit operations, so when one vector processor goes through an integer bit by bit to process it, it can do it with 3 neighbouring integers also, all operations are identical.

There is one point where flow control would be needed, but since only the outcome of a single variable depends on it, I thought of avoiding it with the "?" operator. The OpenCL computing doc mentioned a subtle difference between vector and scalar conditional operator behaviour. My first question is simple C:

Where does it matter what the output of the "?" operator is when you must always specify what the output should be through the ":" statement after?

Second question:

If used on vector types, can the outcome be different separately on each element? If I wanted to compare two vectors, and have the result differ by elements based on the elements of the input vectors, is there any way to achieve this without flow control?

nou · ‎07-21-2010

look at select() function.

Meteorhead · ‎07-23-2010

Thank you for the quick reply, however this did not the issue I was facing.

I was using the ?: operator in the first approach with vector types to achieve my goal, but the compiler gave me an "Error:E010:Irreducible ControlFlow Detected". However the error persists even with the select() function. The problem is becoming urgent, so if somebody could give an answer I would be grateful. Here is the exact line that causes the problem (I guess).

factor = select(zero4, three4,( (uint4)( ((particle << (uint4)i) >> (uint4)(SD - 2) ) == two4) + abs(isless(random,min(p,rp[my_share])) ) ) == two4 );

This is the heart of the algorithm I wrote (inside a for cycle(i)). It is not simple, so let me explain real quick to save you time: I look through an integer on bit level by groups of two to find sequences of "10". If a group of two bits matches this sequence AND random is less than the lesser out of p and rp, than factor be equal to three, otherwise it should be zero.

I use ((A+B) == 2) instead of (A && B) in the conditional part because NVIDIA has not implemented && operator on vectors (no comment). all types in the code are uint4 except random,p,rp which are float4. Naturally in each element of this vector-expression the results may differ but no if statements are included.

Is the problem really irreducible or only my code?

omkaranathan · ‎07-23-2010

Meteorhead,

Please post your kernel code.

Meteorhead · ‎07-23-2010

The code is long and it utilizes the sample Mersenne-twister PRNG.

The most important part of the code is the last big for cycle: for (uint t = 0 ; t < timestep ; ++t)

This is where the algorithm runs. In Odd timestep both sequences of "10" and "01" are looked for. In Even timesteps communication is needed with neighbouring vectors (only the last bit).

Hope is it understandable.

// Vector type shifting with carryover between elements void lshift128(uint4 input, uint shift, uint4* output) { unsigned int invshift = 32u - shift; uint4 temp; temp.x = input.x << shift; temp.y = (input.y << shift) | (input.x >> invshift); temp.z = (input.z << shift) | (input.y >> invshift); temp.w = (input.w << shift) | (input.z >> invshift); *output = temp; } void rshift128(uint4 input, uint shift, uint4* output) { unsigned int invshift = 32u - shift; uint4 temp; temp.w = input.w >> shift; temp.z = (input.z >> shift) | (input.w << invshift); temp.y = (input.y >> shift) | (input.z << invshift); temp.x = (input.x >> shift) | (input.y << invshift); *output = temp; } __constant unsigned int thirteen = 13u; __constant unsigned int fifteen = 15u; __constant unsigned int shift = 8u * 3u; __constant unsigned int mask11 = 0xfdff37ffu; __constant unsigned int mask12 = 0xef7f3f7du; __constant unsigned int mask13 = 0xff777b7du; __constant unsigned int mask14 = 0x7ff7fb2fu; __constant float one = 1.0f; __constant float intMax = 4294967296.0f; inline float4 rand(uint4* r1, uint4* r2, uint4* a, uint4* b, uint4* e, uint4* f, uint4* state1, uint4* state2, uint4* state3, uint4* state4, uint4* state5, uint4* temp) { *r1 = *state4; *r2 = *state5; *a = *state1; *b = *state3; lshift128(*a, shift, e); rshift128(*r1, shift, f); (*temp).x = (*a).x ^ (*e).x ^ (((*b).x >> thirteen) & mask11) ^ (*f).x ^ ((*r2).x << fifteen); (*temp).y = (*a).y ^ (*e).y ^ (((*b).y >> thirteen) & mask12) ^ (*f).y ^ ((*r2).y << fifteen); (*temp).z = (*a).z ^ (*e).z ^ (((*b).z >> thirteen) & mask13) ^ (*f).z ^ ((*r2).z << fifteen); (*temp).w = (*a).w ^ (*e).w ^ (((*b).w >> thirteen) & mask14) ^ (*f).w ^ ((*r2).w << fifteen); return convert_float4(temp[0]) * one / intMax; } // rand // Chainlength is the number of vectors __kernel void test( __global uint4* seedArray, __global uint4* particles, uint chainlength, __local uint4* share, __global uint4* partial, uint timestep, float4 p, float4 q, __global float4* r_p, __global float4* r_q, __local float4* rp, __local float4* rq) { // Global init int gidX = get_global_id(0); int tidX = get_local_id(0); int lsiX = get_local_size(0); int GIDX = get_group_id(0); // KPZ init uint4 particle; uint vectors_per_thread = chainlength / lsiX; uint4 XOR = (uint4)(0); uint4 factor = (uint4)(0); uint SD = 8 * sizeof(uint); uint4 zero4 = (uint4)(0); uint my_share; for (uint v = 0 ; v < vectors_per_thread ; ++v) { my_share = tidX * vectors_per_thread + v; share[my_share] = particles[my_share]; rp[my_share] = r_p[my_share]; rq[my_share] = r_q[my_share]; } // PRNG init uint4 temp; uint4 state1 = seedArray[GIDX * lsiX + tidX]; uint4 state2 = (uint4)(0); uint4 state3 = (uint4)(0); uint4 state4 = (uint4)(0); uint4 state5 = (uint4)(0); uint stateMask = 1812433253u; uint thirty = 30u; uint4 mask4 = (uint4)(stateMask); uint4 thirty4 = (uint4)(thirty); uint4 one4 = (uint4)(1u); uint4 two4 = (uint4)(2u); uint4 three4 = (uint4)(3u); uint4 four4 = (uint4)(4u); uint4 r1 = (uint4)(0); uint4 r2 = (uint4)(0); uint4 a = (uint4)(0); uint4 b = (uint4)(0); uint4 e = (uint4)(0); uint4 f = (uint4)(0); //Initializing states. state2 = mask4 * (state1 ^ (state1 >> thirty4)) + one4; state3 = mask4 * (state2 ^ (state2 >> thirty4)) + two4; state4 = mask4 * (state3 ^ (state3 >> thirty4)) + three4; state5 = mask4 * (state4 ^ (state4 >> thirty4)) + four4; float4 random; barrier(CLK_LOCAL_MEM_FENCE); // All threads wait for each other for (uint t = 0 ; t < timestep ; ++t) { for (uint v = 0 ; v < vectors_per_thread ; ++v) { my_share = tidX * vectors_per_thread + v; particle = share[my_share]; // Odd timestep for (uint i = 0 ; i < SD ; i += 2) { random = rand(&r1, &r2, &a, &b, &e, &f, &state1, &state2, &state3, &state4, &state5, &temp); factor = select(zero4, three4,( (uint4)( ((particle << (uint4)i) >> (uint4)(SD - 2) ) == two4) + abs(isless(random,min(p,rp[my_share])) ) ) == two4 ); random = rand(&r1, &r2, &a, &b, &e, &f, &state1, &state2, &state3, &state4, &state5, &temp); factor = select(zero4, three4,( (uint4)( ((particle << (uint4)i) >> (uint4)(SD - 2) ) == one4) + abs(isless(random,min(q,rq[my_share])) ) ) == two4 ); XOR = XOR << two4; XOR |= factor; } particle ^= XOR; share[my_share] = particle; mem_fence(CLK_LOCAL_MEM_FENCE); // Local memory operations before and after fence don't mix // Even timestep lshift128(particle, 1, &particle); particle.x |= share[(my_share + 1) % chainlength].w >> (SD - 1); for (uint i = 0 ; i < SD ; i += 2) { random = rand(&r1, &r2, &a, &b, &e, &f, &state1, &state2, &state3, &state4, &state5, &temp); factor = select(zero4, three4,( (uint4)( ((particle << (uint4)i) >> (uint4)(SD - 2) ) == two4) + abs(isless(random,min(p,rp[my_share])) ) ) == two4 ); random = rand(&r1, &r2, &a, &b, &e, &f, &state1, &state2, &state3, &state4, &state5, &temp); factor = select(zero4, three4,( (uint4)( ((particle << (uint4)i) >> (uint4)(SD - 2) ) == one4) + abs(isless(random,min(q,rq[my_share])) ) ) == two4 ); XOR = XOR << two4; XOR |= factor; } particle ^= XOR; rshift128(particle, 1, &particle); share[my_share] = particle; share[(my_share + 1) % chainlength].w |= particle.x << (SD - 1); mem_fence(CLK_LOCAL_MEM_FENCE); } // vectors_per_thread } // timestep for (uint v = 0 ; v < vectors_per_thread ; ++v) { partial[GIDX * chainlength + tidX * vectors_per_thread + v] = share[tidX * vectors_per_thread + v];} } // main

Meteorhead · ‎07-23-2010

I have figured that simply saying

random = rand(...)

factor = select(look for "10")

random = rand(...)

factor = select(look for "01")

will produce wrong behaviour on the level of the algorithm. It should rather be factor |=, so that one doesn't overwrite the other.

Anyway the code should be able to compile.

Meteorhead · ‎07-25-2010

I have found the cause of the Irreducible control-flow.

When one makes nested for cycles only the outermost cycle can have the length given as a parameter, all cycles inside must have hardcoded length. No matter how constant I try to make the length of the loop, it only compiles if I have it hardcoded. (Which is definately not flexible)

Inside my code is a nested cycle that looks like the following. 'timestep' is a constant argument that the kernel gets. 'vectors_per_thread' was originally a calculated constant by each parameter (identical to all threads, not just those inside a workgroup), now it is passed on as a constant argument also but it still doesn't want to compile.

It should be able to run cause NVIDIA compilers manage to create a running code out of it.

Could you tell me if this is normal behaviour?

for(int t = 0 ; t < timestep ; ++i) { for(int v = 0 ; v < vectors_per_thread ; ++v) { ... } }

nou · ‎07-28-2010

i have seen several irreduciable control flow on the forum. i think this is limitation of current SDK.

douglas125 · ‎07-28-2010

Hi

I have had this same issue. I've posted it some time ago and I'm posting the way I found to circumvent the driver limitation. The code attached shows how to loop i from 0 to imax and j from 0 to jmax without irreducible controlflow.

Just make sure to double check if this loop works for you as it is. It should solve the problem while a final solution doesn't come.

int i = 0; int j = 0; for (int kk = 0; kk < imax*jmax; kk++) { //Do your stuff int cor = DerivData[x + j + xfmx0 * (y + i)]; RowSig += (float)cor * (float)corCentro; j++; if (j>jmax) //switching i { j = 0; i++; } }

Archives Discussions

conditional operator behaviour on vector operand