Question about optimizing to use VLIW

Discussion created by notyou on Sep 8, 2011
Latest reply on Sep 9, 2011 by notyou


I have a very simple algorithm which I'm using to give an introduction to OpenCL and I'm working on optimizing the algorithm to use VLIW to increase performance.

What I'm wondering is, when I vectorize the kernel, do I need to change the overall number of work items? i.e. size_t global[1] = {NUM_ITEMS/4}. I tested both with and without the /4 and the only difference (after verifying off of a sequential run) is the execution speed is halved when using the /4. Can anyone shed some light on this? Attached is the kernel I'm using.

__kernel void basic_kernel(__global int4* ocl_buffer, __local int4* local_ocl_buffer, int LOOP_ITERATIONS) { /* get our thread's global ID so we know where to write in memory */ int id = get_global_id(0); int i = 0; local_ocl_buffer[id] = ocl_buffer[id]; /* copy the data to local memory */ for ( i = 0; i < LOOP_ITERATIONS; i += 4 ) /* simple constant to loop many instructions to create work */ { /* this is a bad example as we can actually use registers instead of global/local memory. */ local_ocl_buffer[id] *= 2; local_ocl_buffer[id] /= 2; } local_ocl_buffer[id] += 1; /* offset by 1 just to verify work has been done since we set the original value to id */ ocl_buffer[id] = local_ocl_buffer[id]; /* transfer the data to global memory so we can retrieve it from the CPU */ }