AnsweredAssumed Answered

Calc's Inconsistent with Large Datasets (using Local Memory)

Question asked by qld_sean on Dec 16, 2015
Latest reply on Dec 17, 2015 by qld_sean



First my apologies this is probably a classic newbie mistake, I thought to help me visualize what is going on between the local/compute unit and global I'd do something very simple. My local workgroup size is 256 I'm using a R9 390 card I'm explicitly stating the global worksize (array size +1) in the host code. The interesting thing which occurs here is once my arrays get bigger than 5-10k floating point elements I start to get erratic behaviour often in this range the second run works while the first fails, when I go bigger to 100,000 its almost random in the response. I truly believe its the kernel and not the host code as I can append one line of code to overwrite one value in the output array and its always consistent with what I've overwritten it with irrespective to the input array sizes.


I know the code is rubbish and as you can see I was confused/desperate with the local and global locks, code was more made to help me get my head around openCL and the GPU but this issue is beyond me.





__kernel void dot_product(__global float* a_vec, __global float* b_vec, __global float* output, __local float* partial_dot)


   int Offset = 256;

   int gid = get_global_id(0);

   int globalSize = get_global_size(0);

   int lid = get_local_id(0);

   int localSize = get_local_size(0);



   //local float* partial_dot;

   /* Place product of global values into local memory */

  partial_dot[lid] = a_vec[gid];

  partial_dot[lid + Offset] = b_vec[gid];


/* Repeatedly multiply values in local memory */

  output[gid] = partial_dot[lid]*partial_dot[lid + Offset];



  /* Transfer final result to global memory, only want ONE thread to sum*/

   if(gid == 0) {

  for (int Count = 1;Count<=globalSize;Count++)

  output[0] += output[Count];




//output[0] = globalSize;