I'm new to OpenCL and am currently exploring some potential uses of it and GPU processing. So far I have written the following kernel:-
__kernel void test(__global const read_only int* region_table, __global const read_only short* record, __global write_only short* results)
int i = get_global_id(0);
// Total up the lengths of the regions before this one, to get the start position in the results buffer to write to.
int res_idx = 0;
for (int s = 0; s < i; s++)
res_idx += region_table[s * 2 + 1] - region_table[s * 2] + 1;
// Copy region data to results buffer.
int reg_start = region_table[i * 2];
int reg_end = region_table[i * 2 + 1];
for (int r = reg_start; r <= reg_end; r++)
results[res_idx] = record[r];
The kernel is passed a "record" (60,000 values) via the second parameter. Most of this data is of no interest, so I also pass in a "region table" (first parameter) containing a series of repeating start and end values, each defining a subset of data within the record that I'm interested in. There will be around 200 regions, and they will not all be of the same size.
The kernel basically just copies the data from each of these regions into the results buffer, one after the other. The first part of the code figures out the start position within the results buffer where the region is to be copied, by summing the length of each region prior to the one being processed (global_size = number of regions). The second half of the code simply copies the region's data to the results buffer.
The kernel works fine, and as I've already mentioned, global_size = number of regions, while local_size = 1. It also works fine if I pass a NULL local size to `clEnqueueNDRangeKernel()` (which results in a local size of 200). However if I use a local size of (say) 64, or configure it using `clGetKernelWorkGroupInfo()` (resulting in a local size of 1024), then the results buffer comes back containing just zeros. I'm curious to know what is going on here, i.e. why does the value of local_size have an effect on the kernel's behaviour?
(On a side-note, am I barking up the right tree with the way I'm tackling this problem? And is there anything I can do to improve efficiency/performance of this kernel?)