cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

akhal
Journeyman III

Global threads size and local threads size

Hello

I am new to OpenCL and I am trying guassian filter on a simple matrix of ints. I have a matrix 1000 x 1000 and I need to run a separable filter first in x-direction and then in y-direction. For width-wise filtering, I made;

size_t global_threads[2] = {1000, 1000};

size_t local_threads[1] = {1000};

clEnqueueNDRangeKernel(command_queue, row_kernel, 2, NULL, global_threads, local_threads, 0, NULL, NULL);

And inside the kernel, I declare

int lid = get_local_id(0);

int IDy = get_global_id(1);

__local int localSrc[1000];
 localSrc[lid] = Src[IDy*1000 + lid];

So what I want is that for every work item in a work-group (which consist of the whole row, and there are 1000 rows=work-groups), there whole row data is copied to local memory of that work group, so when filtering runs over any work-item in the whole 1000x1000 matrix, each work item will read its surrounding row elements from the local memory so it will be faster and would avoid race condition in otherwise reading from global space.

 

But my clEnqueueNDRangeKernel(---) as above fails with return error code -54, what this means?

Also I want to know if my setting of global_threads and local_threads sizes are correct when I want to copy rows to same local memory for efficiency?

Thanks in advance

0 Likes
2 Replies
genaganna
Journeyman III

Originally posted by: akhal Hello

I am new to OpenCL and I am trying guassian filter on a simple matrix of ints. I have a matrix 1000 x 1000 and I need to run a separable filter first in x-direction and then in y-direction. For width-wise filtering, I made;

size_t global_threads[2] = {1000, 1000};

size_t local_threads[1] = {1000};

clEnqueueNDRangeKernel(command_queue, row_kernel, 2, NULL, global_threads, local_threads, 0, NULL, NULL);

And inside the kernel, I declare

int lid = get_local_id(0);

int IDy = get_global_id(1);

__local int localSrc[1000];  localSrc[lid] = Src[IDy*1000 + lid];

So what I want is that for every work item in a work-group (which consist of the whole row, and there are 1000 rows=work-groups), there whole row data is copied to local memory of that work group, so when filtering runs over any work-item in the whole 1000x1000 matrix, each work item will read its surrounding row elements from the local memory so it will be faster and would avoid race condition in otherwise reading from global space.

But my clEnqueueNDRangeKernel(---) as above fails with return error code -54, what this means?

Also I want to know if my setting of global_threads and local_threads sizes are correct when I want to copy rows to same local memory for efficiency?

Thanks in advance

1. Global work group dimension and local work group dimensions must be same as follows

 

size_t global_threads[2] = {1000, 1000};

size_t local_threads[2] = {1000, 1};

2. Local work group size should not be more than 1024 for CPU and 256 for GPU.

 

0 Likes

ok thank you sooooo much 🙂

0 Likes