cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Barsik107
Journeyman III

IF in OpenCL kernel

I have vector of 40000 elements, I wan't to calculate it somehow, but I need next condition(IF):

 

uint tid = get_global_id(0);

uint tid1 = get_global_id(1);

 

const uint range=200;

index = tid*range+tid1;

 

if(index>200)

{

output_Tay[index]=index;

}

But this kernel calculate not 39800, but only 7979 elements and I can't understand why? Is there something special about using IF in kernels?

 

 

 

0 Likes
15 Replies
nou
Exemplar

what is your global range?

0 Likes

Global range is {200;200} I dont think thats the problem, cause if I use kernel without condition - it proceed throu all of 40000 elements. That seems very strange to me.

 

0 Likes

If you remove the if, does the output equal 0, 1, 2, ... 40000? If not, your indexing is messed up. There is nothing magical about if statements with respect to correctness. Branch divergence only affects performance. Can you try using a 1D global range of size 40,000?

0 Likes

Originally posted by: rick.weber If you remove the if, does the output equal 0, 1, 2, ... 40000? If not, your indexing is messed up. 


Yes it is, if I remove IF then everything is ok. It proceed throu all 40000 values. I know it by simple operation

 output_Tay[index]=index;

index = tid*get_global_size([0|1])+tid1; 


Still doesnt help, I use it like:

index = tid*get_global_size(1)+tid1; 

And so it was just 7979 elements calculated.

0 Likes

I think you are messing up with indexing.

AFAIK it should be tid1*range + tid

or get_global_size(1) * get_global_size(0) + get_global_id(0)

0 Likes

Originally posted by: himanshu.gautam I think you are messing up with indexing.

 

AFAIK it should be tid1*range + tid

 

or get_global_size(1) * get_global_size(0) + get_global_id(0)

 

You're just transposing from column-major to row-major (or vice versa).  If the global dimensions are 200x200, then you should get the same results up to transposition.

Jeff

0 Likes

I KNOW thaht I'm messing with indexes, but I can't find where. I just can't explain that fact that if I use kernel like:

 

uint tid = get_global_id(0);

uint tid1 = get_global_id(1);

uint index;

index = tid*get_global_size(0)+tid1; 

output_Tay[index]=index;

barrier(CLK_LOCAL_MEM_FENCE);

 

Thet it work's and proceed throu all 40000 elements, but if I'm just adding simple IF:

 

uint tid = get_global_id(0);

uint tid1 = get_global_id(1);

uint index;

index = tid*get_global_size(0)+tid1; 

if (index>200)

{

output_Tay[index]=index;

 

barrier(CLK_LOCAL_MEM_FENCE);

}

Then it doesn't work. In that case only one of five elements is recordered in output_Tay, and actually recording begins from number 200, but why only one of 5 elements? The numbers of elements: 205,210,215,200 and so on. Something really really strange is going on.



 

0 Likes

output_Tay is a global buffer?  If so, then the barrier isn't needed (and is the wrong barrier since you are writing to global memory, not local) because no threads write to the same address.

Are you running this on the CPU or GPU?  What SDK are you using?  What driver do you have installed?

Jeff

0 Likes

Yeah output_Tay is global buffer. I'm running it on CPU(simulation). ATI Stream SDK

0 Likes

that barrier is bad. barries  must hit all workitems in workgroup  or it will be lead to undefined behaviour.

0 Likes

yeah nou seems to be right.

Barrier should not be present inside an If block. If all workitems do not take the same route at the if(which is the case here), you can expect undefined behaviour.

0 Likes

All right! Tra barrier is removed and I've got good result! And I have next questions about barriers. So I have simple kernel like that:

 

__kernel void KernelTay(__global  float * input_Tay,

         __global  float * output_Tay)

{

uint tid = get_global_id(0);

uint tid1 = get_global_id(1);

uint index;

index = tid*get_global_size(0)+tid1; 

 

//saving to local memory

float Tay= input_Tay[index];

barrier(CLK_LOCAL_MEM_FENCE);

 

if(index>get_global_size(0))

{

//some operations with Tay

}

barrier(CLK_LOCAL_MEM_FENCE);

output_Tay[index]=Tay;

 

}

Am I using bariers correctly, or shold I just go and read specification one more time?

 

 

0 Likes

yes every workitem in workgroup must hit the SAME barrire.

even this is bad if workitem diverge and take different path.

if(some_condition)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
else
{
barrier(CLK_LOCAL_MEM_FENCE);
}

and your example is wrong. first you use local barrier when you work with global memory. (it will work but it is meaningless)

and barrier is needed if you modify global/local memory and you will use this modified value in other workitem. otherwise barrier is unnecessary. like

__kernel kkk(__global int *buf) { size_t lid = get_local_id(0); size_t gid = get_global_id(0); __local mem[64]; mem[lid] = buf[gid]; barrier(CLK_LOCAL_MEM_FENCE);//barrier to ensure that every woritem get proper modified value int s=0; for(int i=0;i<64;i++)s+=mem; buf[gid] = s; //barrier(CLK_GLOBAL_MEM_FENCE); another barrier if you want use modified buf[gid] value in another workitem. //working with buf[gid+-local_range] }

0 Likes

Thanks a lot! I think I need to learn much more about memory model.

0 Likes

Looks likely your indexing might be messed up. Please use this calculation instead.
index = tid*get_global_size([0|1])+tid1;

This will give you the value for the x dimension of your launch size instead of a hard coded number.
0 Likes