All the guides tell us to use float4. I'd even thought that I'd done testing in the past that said that float4's were more efficient than floats. So, I was somewhat surprised when I ran a simple vector test with the two kernels shown here over a set of 4M element vectors. vector_add was the faster of the two operations.

For a fullblown tarball with test code and all the right stuff, take a look at

http://www.bigncomputing.org/Big_N_Computing/Test_Programs_files/vec4_test.tgz

My results: (see the README file)

vector_add took (typically) 630e-6 sec per 4M element sum vs. ~820e-6 sec for vector_add4

__kernel void vector_add(__global const float * a, __global const float * b, __global float * c) { uint tid = get_global_id(0); c[tid] = a[tid] + b[tid]; } __kernel void vector_add4(__global const float4 * a, __global const float4 * b, __global float4 * c) { uint tid = get_global_id(0); // unroll the loop 4 times. So launch vecsize/4 threads. c[tid] = a[tid] + b[tid]; }

As a rule of thumb with HD5870 (which you are apparently using) there are 10 ALU cycles available per single 128-bit result written to memory. (HD5850, HD5770 etc. are the same and older GPUs like HD4870/HD4850 are also the same.)

Your scalar kernel chooses to write 32-bits instead of 128-bits, so in this case the kernel spends 10 cycles to write 32-bits.

Put naively, there's 100 FLOP per 128-bit write available - 10 cycles * 5-way VLIW * 2 FLOP. Your scalar kernel is doing 1 FLOP per 128-bit write (with 96-bits wasted) and your vector kernel is doing 4 FLOP per 128-bit write.

Obtaining the global work item ID will add some overhead.

So in both cases the kernel is not ALU bound. You can check this by obtaining the ISA for your kernel. Since you don't have Visual Studio 2008 you're forced to do this using GPU_DUMP_DEVICE_KERNEL - I presume this works. The ISA will have less than 10 ALU cycles. (I haven't actually checked this, to be honest - I'm betting compilation of these simple kernels isn't unspeakably terrible).

In general when a GPU is not ALU bound it's bound by input rate, output rate, bandwidth or other more involved things.

One of those more involved things is the spin-up and spin-down rates for the GPU.

Workgroups on the GPU are created sequentially (not strictly true for HD5870 - but not material here).

The GPU can only create one workgroup every 2 cycles - this is because rasterisation is what generates workgroups and their work items, and rasterisation runs at the rate of 32 work items per cycle. This means that after 10 cycles only 5 workgroups have started work on the kernel. And every 10 cycles 5 workgroups will finish working on the kernel.

So you can see the problem here: the GPU's SIMD cores are basically idle. 20 SIMD cores are twiddling their thumbs as the GPU is only allocating work for all cores every 40 cycles but each core can complete each work group in 10 cycles (excluding the latency for fetching from a and b).

I'm not aware of the method of allocation of workgroups. Does the GPU attempt to fill one SIMD with workgroups before moving onto allocating work for the next SIMD? Or, does the GPU allocate successive workgroups to successive SIMDS?

The devil is in the detail here as throughput is also affected by latency-hiding and cache-reuse factors. With the hundreds of cycles of worst-case latency for fetches from a and b and the lack of arithmetic intensity in the kernel to soak up these latencies, you can start to see how the pattern of workgroup allocation to cores affects throughput.

And, of course, the 4-fold difference in workgroup count between the two kernels is also another variable. The scalar kernel requires 62,500 workgroups, while the vector version requires 15,625. There's two different memory access patterns seen here so cache re-use and general latency hiding will differ.

To be honest I'm surprised scalar is faster, but the overall arithmetic intensity is so mind-bogglingly low that performance is a crap-shoot.

Performance in this case is dominated by cache access patterns and that's mostly dominated by the pattern of workgroup allocation to cores and that's a function of the count of workgroups and the time a workgroup spends in a core (10 cycles to compute a+b plus some random time spent waiting for a and b to be fetched).