float4 vs. float in sdk2.1 ?? For simple kernels, float* args appear more efficient

Discussion created by kb1vc on Jul 7, 2010
Latest reply on Jul 10, 2010 by Jawed
In SDK2.1 with the 10-5 driver (Linux) a simple vector add kernel runs faster with float* ops vs. float4* ops

All the guides tell us to use float4.  I'd even thought that I'd done testing in the past that said that float4's were more efficient than floats.  So, I was somewhat surprised when I ran a simple vector test with the two kernels shown here over a set of 4M element vectors.  vector_add was the faster of the two operations.

For a fullblown tarball with test code and all the right stuff, take a look at


My results: (see the README file)

vector_add took (typically) 630e-6 sec per 4M element sum vs. ~820e-6 sec for vector_add4



__kernel void vector_add(__global const float * a, __global const float * b, __global float * c) { uint tid = get_global_id(0); c[tid] = a[tid] + b[tid]; } __kernel void vector_add4(__global const float4 * a, __global const float4 * b, __global float4 * c) { uint tid = get_global_id(0); // unroll the loop 4 times. So launch vecsize/4 threads. c[tid] = a[tid] + b[tid]; }