2 Replies Latest reply on Aug 10, 2011 3:54 AM by akhal

    Global threads size and local threads size

    akhal

      Hello

      I am new to OpenCL and I am trying guassian filter on a simple matrix of ints. I have a matrix 1000 x 1000 and I need to run a separable filter first in x-direction and then in y-direction. For width-wise filtering, I made;

      size_t global_threads[2] = {1000, 1000};

      size_t local_threads[1] = {1000};

      clEnqueueNDRangeKernel(command_queue, row_kernel, 2, NULL, global_threads, local_threads, 0, NULL, NULL);

      And inside the kernel, I declare

      int lid = get_local_id(0);

      int IDy = get_global_id(1);

      __local int localSrc[1000];
       localSrc[lid] = Src[IDy*1000 + lid];

      So what I want is that for every work item in a work-group (which consist of the whole row, and there are 1000 rows=work-groups), there whole row data is copied to local memory of that work group, so when filtering runs over any work-item in the whole 1000x1000 matrix, each work item will read its surrounding row elements from the local memory so it will be faster and would avoid race condition in otherwise reading from global space.

       

      But my clEnqueueNDRangeKernel(---) as above fails with return error code -54, what this means?

      Also I want to know if my setting of global_threads and local_threads sizes are correct when I want to copy rows to same local memory for efficiency?

      Thanks in advance

        • Global threads size and local threads size
          genaganna

           

          Originally posted by: akhal Hello

          I am new to OpenCL and I am trying guassian filter on a simple matrix of ints. I have a matrix 1000 x 1000 and I need to run a separable filter first in x-direction and then in y-direction. For width-wise filtering, I made;

          size_t global_threads[2] = {1000, 1000};

          size_t local_threads[1] = {1000};

          clEnqueueNDRangeKernel(command_queue, row_kernel, 2, NULL, global_threads, local_threads, 0, NULL, NULL);

          And inside the kernel, I declare

          int lid = get_local_id(0);

          int IDy = get_global_id(1);

          __local int localSrc[1000];  localSrc[lid] = Src[IDy*1000 + lid];

          So what I want is that for every work item in a work-group (which consist of the whole row, and there are 1000 rows=work-groups), there whole row data is copied to local memory of that work group, so when filtering runs over any work-item in the whole 1000x1000 matrix, each work item will read its surrounding row elements from the local memory so it will be faster and would avoid race condition in otherwise reading from global space.

          But my clEnqueueNDRangeKernel(---) as above fails with return error code -54, what this means?

          Also I want to know if my setting of global_threads and local_threads sizes are correct when I want to copy rows to same local memory for efficiency?

          Thanks in advance

          1. Global work group dimension and local work group dimensions must be same as follows

           

          size_t global_threads[2] = {1000, 1000};

          size_t local_threads[2] = {1000, 1};

          2. Local work group size should not be more than 1024 for CPU and 256 for GPU.