2 Replies Latest reply on Dec 17, 2015 4:11 AM by qld_sean

    Calc's Inconsistent with Large Datasets (using Local Memory)

    qld_sean

      Hi,

       

      First my apologies this is probably a classic newbie mistake, I thought to help me visualize what is going on between the local/compute unit and global I'd do something very simple. My local workgroup size is 256 I'm using a R9 390 card I'm explicitly stating the global worksize (array size +1) in the host code. The interesting thing which occurs here is once my arrays get bigger than 5-10k floating point elements I start to get erratic behaviour often in this range the second run works while the first fails, when I go bigger to 100,000 its almost random in the response. I truly believe its the kernel and not the host code as I can append one line of code to overwrite one value in the output array and its always consistent with what I've overwritten it with irrespective to the input array sizes.

       

      I know the code is rubbish and as you can see I was confused/desperate with the local and global locks, code was more made to help me get my head around openCL and the GPU but this issue is beyond me.

       

      Thanks

      Sean

       

      __kernel void dot_product(__global float* a_vec, __global float* b_vec, __global float* output, __local float* partial_dot)

      {

         int Offset = 256;

         int gid = get_global_id(0);

         int globalSize = get_global_size(0);

         int lid = get_local_id(0);

         int localSize = get_local_size(0);

       

       

         //local float* partial_dot;

         /* Place product of global values into local memory */

        partial_dot[lid] = a_vec[gid];

        partial_dot[lid + Offset] = b_vec[gid];

        barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);

      /* Repeatedly multiply values in local memory */

        output[gid] = partial_dot[lid]*partial_dot[lid + Offset];

      barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);

       

        /* Transfer final result to global memory, only want ONE thread to sum*/

         if(gid == 0) {

        for (int Count = 1;Count<=globalSize;Count++)

        output[0] += output[Count];

       

       

        } 

      //output[0] = globalSize;

      }