I have vector of 40000 elements, I wan't to calculate it somehow, but I need next condition(IF):
uint tid = get_global_id(0);
uint tid1 = get_global_id(1);
const uint range=200;
index = tid*range+tid1;
if(index>200)
{
output_Tay[index]=index;
}
But this kernel calculate not 39800, but only 7979 elements and I can't understand why? Is there something special about using IF in kernels?
what is your global range?
Global range is {200;200} I dont think thats the problem, cause if I use kernel without condition - it proceed throu all of 40000 elements. That seems very strange to me.
If you remove the if, does the output equal 0, 1, 2, ... 40000? If not, your indexing is messed up. There is nothing magical about if statements with respect to correctness. Branch divergence only affects performance. Can you try using a 1D global range of size 40,000?
Originally posted by: rick.weber If you remove the if, does the output equal 0, 1, 2, ... 40000? If not, your indexing is messed up.
Yes it is, if I remove IF then everything is ok. It proceed throu all 40000 values. I know it by simple operation
output_Tay[index]=index;
index = tid*get_global_size([0|1])+tid1;
Still doesnt help, I use it like:
index = tid*get_global_size(1)+tid1;
And so it was just 7979 elements calculated.
I think you are messing up with indexing.
AFAIK it should be tid1*range + tid
or get_global_size(1) * get_global_size(0) + get_global_id(0)
Originally posted by: himanshu.gautam I think you are messing up with indexing.
AFAIK it should be tid1*range + tid
or get_global_size(1) * get_global_size(0) + get_global_id(0)
You're just transposing from column-major to row-major (or vice versa). If the global dimensions are 200x200, then you should get the same results up to transposition.
Jeff
I KNOW thaht I'm messing with indexes, but I can't find where. I just can't explain that fact that if I use kernel like:
uint tid = get_global_id(0);
uint tid1 = get_global_id(1);
uint index;
index = tid*get_global_size(0)+tid1;
output_Tay[index]=index;
barrier(CLK_LOCAL_MEM_FENCE);
Thet it work's and proceed throu all 40000 elements, but if I'm just adding simple IF:
uint tid = get_global_id(0);
uint tid1 = get_global_id(1);
uint index;
index = tid*get_global_size(0)+tid1;
if (index>200)
{
output_Tay[index]=index;
barrier(CLK_LOCAL_MEM_FENCE);
}
Then it doesn't work. In that case only one of five elements is recordered in output_Tay, and actually recording begins from number 200, but why only one of 5 elements? The numbers of elements: 205,210,215,200 and so on. Something really really strange is going on.
output_Tay is a global buffer? If so, then the barrier isn't needed (and is the wrong barrier since you are writing to global memory, not local) because no threads write to the same address.
Are you running this on the CPU or GPU? What SDK are you using? What driver do you have installed?
Jeff
Yeah output_Tay is global buffer. I'm running it on CPU(simulation). ATI Stream SDK
that barrier is bad. barries must hit all workitems in workgroup or it will be lead to undefined behaviour.
yeah nou seems to be right.
Barrier should not be present inside an If block. If all workitems do not take the same route at the if(which is the case here), you can expect undefined behaviour.
All right! Tra barrier is removed and I've got good result! And I have next questions about barriers. So I have simple kernel like that:
__kernel void KernelTay(__global float * input_Tay,
__global float * output_Tay)
{
uint tid = get_global_id(0);
uint tid1 = get_global_id(1);
uint index;
index = tid*get_global_size(0)+tid1;
//saving to local memory
float Tay= input_Tay[index];
barrier(CLK_LOCAL_MEM_FENCE);
if(index>get_global_size(0))
{
//some operations with Tay
}
barrier(CLK_LOCAL_MEM_FENCE);
output_Tay[index]=Tay;
}
Am I using bariers correctly, or shold I just go and read specification one more time?
yes every workitem in workgroup must hit the SAME barrire.
even this is bad if workitem diverge and take different path.
if(some_condition)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
else
{
barrier(CLK_LOCAL_MEM_FENCE);
}
and your example is wrong. first you use local barrier when you work with global memory. (it will work but it is meaningless)
and barrier is needed if you modify global/local memory and you will use this modified value in other workitem. otherwise barrier is unnecessary. like
__kernel kkk(__global int *buf) { size_t lid = get_local_id(0); size_t gid = get_global_id(0); __local mem[64]; mem[lid] = buf[gid]; barrier(CLK_LOCAL_MEM_FENCE);//barrier to ensure that every woritem get proper modified value int s=0; for(int i=0;i<64;i++)s+=mem; buf[gid] = s; //barrier(CLK_GLOBAL_MEM_FENCE); another barrier if you want use modified buf[gid] value in another workitem. //working with buf[gid+-local_range] }
Thanks a lot! I think I need to learn much more about memory model.