From my understanding of description of shuffle from OpenCL spec, the function should accept the supported opencl vector types (like uint8, float16 etc), and not any array of values.
If you are looking to shuffle 1000 values, wouldn't you want to do them using multiple threads, instead of a single workitem ?
So I would suggest to have something like:
int gid = get_global_id(0);
out[gid] = shuffle2(data1[gid], data2[gid], mask[gid]);
Even I had a slight intuition about what you are saying about shuffle2, but then what if I need to shuffle some odd no. of values...? I mean if I choose float8 then I should have just multiple of 8 no. of values to get it done using get_global_id(0)...!
I mean if I choose float8 then I should have just multiple of 8 no. of values to get it done using get_global_id(0)...!
That is a common problem with any vector operation. It all depends on what the algorithm requires and how many threads have been launched. You can consider adding some padding if it helps you gain performance.