With a bool4 built-in type I could optimize raytracing nearly a 5%, pls!
I need that to precompute ray signs < 0
I tried to define my own structure but, unfortunately, seems to be stored in global memory by the compiler instead of using registers.
What about step? From Page 171 of this spec:
www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf
gentype step (gentype edge, gentype x)
Returns 0.0 if x < edge, otherwise it returns 1.0.
Could you do something like:
float4 results = step(float4(0.0f), ray_vector)
?
Originally posted by: kbrafford What about step?
There's a "sign" built-in funtion, the problem is that all the computations based on float4 will require a < 0.0f and I need a direct boolean value to save a comparison function. Example:
Ideal:
bool4 mybool = ....
for ()...
{
if ( mybool4 [ j] )
{
}
else
{
}
}
currently:
float4 signs = sign(rayDir)....
for ()...
{
if ( signs [j ] < 0.0f )
{
}
else
{
}
}
How complicated is each path of that if statement? Can you post a more detailed example? If it's simple enough, can't you eliminate the if statement altogether and take advantage of the 1.0 and 0.0 given by the step function?
Originally posted by: kbrafford How complicated is each path of that if statement? Can you post a more detailed example? If it's simple enough, can't you eliminate the if statement altogether and take advantage of the 1.0 and 0.0 given by the step function?
Yes, I could use some kind of trick. But the point of this post is why there is no built-in bool4 type which, in my case, will be very useful.
Yes, I could use some kind of trick. But the point of this post is why there is no built-in bool4 type which, in my case, will be very useful.
And my point is that with data parallel programming you are supposed to start thinking differently about how you do things. It is not a "trick" to replace a branch that is running on hundreds of processors with calculations.
How complicated is each path of that if statement?
I'm afraid the branches there are quite complex.
Too complicated to post?
Originally posted by: kbrafford Too complicated to post?
Yep, complicated plus I have not the rights to post the code. Just assume the code there is very complex.
With a bool4 built-in I could save one floating point comparison for each loop iteration ( which is quite large too ). I hope bool4 will be fully supported in the CL 1.1 spec.
Well, if one floating point comparison is 5% of your processing, then it can't be too complicated 😉
I understand the proprietariness issue. Thinking out of the box here, can you move to an algorithm more like this, with no loop, and where you get to keep using vectorized computations?:
float4 ifclause_factor = step(float4(0.0f), ray_vector);
float4 elseclause_factor = float4(1.0f) - ifclause_factor;
// do the if clause work first
float4 ifclause_work = some secret work you are doing, assuming
all slots in the float4 are going to take the
if branch;
// now do the else clause work
float4 elseclause_work = other secret work you are doing, assuming
all slots in the float4 are going to take the
else branch;
// then commit the results using the step factors
actual_ results = ifclause_factor * ifclause_work +
elseclause_factor * elseclause_work;
Wait! I'm stupid
I can precompute the ray dirs as an int4 and then
int4 rd = (int4)((rDir.x<0.0f)?1:0, ................ )
for (...)
{
if ( rd[ j ]!=0 )
{
}
else
{
}
}
but I still want bool4 as syntax sugar!
Is the loop over the 4 slots of the float4 data?
I still think it is worth the effort to try and get rid of the loop and branch. Remember, in a GPU all work items end up taking both the if and the else clause anyway. Don't settle for 5% improvement. Go for 400% 🙂
Originally posted by: MicahVillmow bubu, Keep in mind that on many pieces of hardware, a boolean value is represented as a integer.
Uhm, not as 8-bit words (so a byte/char)?