Hi,
First my apologies this is probably a classic newbie mistake, I thought to help me visualize what is going on between the local/compute unit and global I'd do something very simple. My local workgroup size is 256 I'm using a R9 390 card I'm explicitly stating the global worksize (array size +1) in the host code. The interesting thing which occurs here is once my arrays get bigger than 5-10k floating point elements I start to get erratic behaviour often in this range the second run works while the first fails, when I go bigger to 100,000 its almost random in the response. I truly believe its the kernel and not the host code as I can append one line of code to overwrite one value in the output array and its always consistent with what I've overwritten it with irrespective to the input array sizes.
I know the code is rubbish and as you can see I was confused/desperate with the local and global locks, code was more made to help me get my head around openCL and the GPU but this issue is beyond me.
Thanks
Sean
__kernel void dot_product(__global float* a_vec, __global float* b_vec, __global float* output, __local float* partial_dot)
{
int Offset = 256;
int gid = get_global_id(0);
int globalSize = get_global_size(0);
int lid = get_local_id(0);
int localSize = get_local_size(0);
//local float* partial_dot;
/* Place product of global values into local memory */
partial_dot[lid] = a_vec[gid];
partial_dot[lid + Offset] = b_vec[gid];
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
/* Repeatedly multiply values in local memory */
output[gid] = partial_dot[lid]*partial_dot[lid + Offset];
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
/* Transfer final result to global memory, only want ONE thread to sum*/
if(gid == 0) {
for (int Count = 1;Count<=globalSize;Count++)
output[0] += output[Count];
}
//output[0] = globalSize;
}