I have a kernel that's basically only doing a lookup. The total numer of lookups is about 32.000.
Question: Is the overhead to start & execute ONE thread PER lookup worth it?
Or would it be faster if I make a little loop inside the kernel to do several, lets say 16, lookups per thread? Just to avoid the overhead of too much thread creation.
__constant int lookup_table[256] = {...some values...}; __kernel void some_kernel(__global int* in, __global int* out) { uint tid = get_global_id(0); out[tid] = lookup_table[in[tid]]; }