I'm new to OpenCL and am currently exploring some potential uses of it and GPU processing. So far I have written the following kernel:-
__kernel void test(__global const read_only int* region_table, __global const read_only short* record, __global write_only short* results)
{
int i = get_global_id(0);
// Total up the lengths of the regions before this one, to get the start position in the results buffer to write to.
int res_idx = 0;
for (int s = 0; s < i; s++)
{
res_idx += region_table[s * 2 + 1] - region_table[s * 2] + 1;
}
// Copy region data to results buffer.
int reg_start = region_table[i * 2];
int reg_end = region_table[i * 2 + 1];
for (int r = reg_start; r <= reg_end; r++)
{
results[res_idx] = record
; res_idx++;
}
}
The kernel is passed a "record" (60,000 values) via the second parameter. Most of this data is of no interest, so I also pass in a "region table" (first parameter) containing a series of repeating start and end values, each defining a subset of data within the record that I'm interested in. There will be around 200 regions, and they will not all be of the same size.
The kernel basically just copies the data from each of these regions into the results buffer, one after the other. The first part of the code figures out the start position within the results buffer where the region is to be copied, by summing the length of each region prior to the one being processed (global_size = number of regions). The second half of the code simply copies the region's data to the results buffer.
The kernel works fine, and as I've already mentioned, global_size = number of regions, while local_size = 1. It also works fine if I pass a NULL local size to `clEnqueueNDRangeKernel()` (which results in a local size of 200). However if I use a local size of (say) 64, or configure it using `clGetKernelWorkGroupInfo()` (resulting in a local size of 1024), then the results buffer comes back containing just zeros. I'm curious to know what is going on here, i.e. why does the value of local_size have an effect on the kernel's behaviour?
(On a side-note, am I barking up the right tree with the way I'm tackling this problem? And is there anything I can do to improve efficiency/performance of this kernel?)
Solved! Go to Solution.
As per clEnqueueNDRangeKernel in OpenCL1.X, global_work_size must be evenly divisible by the corresponding values specified in local_work_size. Otherwise it returns CL_INVALID_WORK_GROUP_SIZE error .
In above case, when global_work_size is set as 60K, setting a local_work_size = 64 doesn't meet the above condition. Hence clEnqueueNDRangeKernel might be failing and you are getting unmodified result buffer which already contains zero values. For details, please check the error code against clEnqueueNDRangeKernel.
Thanks.
As per clEnqueueNDRangeKernel in OpenCL1.X, global_work_size must be evenly divisible by the corresponding values specified in local_work_size. Otherwise it returns CL_INVALID_WORK_GROUP_SIZE error .
In above case, when global_work_size is set as 60K, setting a local_work_size = 64 doesn't meet the above condition. Hence clEnqueueNDRangeKernel might be failing and you are getting unmodified result buffer which already contains zero values. For details, please check the error code against clEnqueueNDRangeKernel.
Thanks.
Well spotted, my stupid mistake sorry. The global work size is set to 200 (the number of "regions"), but you are right that I was forgetting this needed to be divisible by the local size. Thanks for your help.