As a rule of thumb with HD5870 (which you are apparently using) there are 10 ALU cycles available per single 128-bit result written to memory. (HD5850, HD5770 etc. are the same and older GPUs like HD4870/HD4850 are also the same.)
Your scalar kernel chooses to write 32-bits instead of 128-bits, so in this case the kernel spends 10 cycles to write 32-bits.
Put naively, there's 100 FLOP per 128-bit write available - 10 cycles * 5-way VLIW * 2 FLOP. Your scalar kernel is doing 1 FLOP per 128-bit write (with 96-bits wasted) and your vector kernel is doing 4 FLOP per 128-bit write.
Obtaining the global work item ID will add some overhead.
So in both cases the kernel is not ALU bound. You can check this by obtaining the ISA for your kernel. Since you don't have Visual Studio 2008 you're forced to do this using GPU_DUMP_DEVICE_KERNEL - I presume this works. The ISA will have less than 10 ALU cycles. (I haven't actually checked this, to be honest - I'm betting compilation of these simple kernels isn't unspeakably terrible).
In general when a GPU is not ALU bound it's bound by input rate, output rate, bandwidth or other more involved things.
One of those more involved things is the spin-up and spin-down rates for the GPU.
Workgroups on the GPU are created sequentially (not strictly true for HD5870 - but not material here).
The GPU can only create one workgroup every 2 cycles - this is because rasterisation is what generates workgroups and their work items, and rasterisation runs at the rate of 32 work items per cycle. This means that after 10 cycles only 5 workgroups have started work on the kernel. And every 10 cycles 5 workgroups will finish working on the kernel.
So you can see the problem here: the GPU's SIMD cores are basically idle. 20 SIMD cores are twiddling their thumbs as the GPU is only allocating work for all cores every 40 cycles but each core can complete each work group in 10 cycles (excluding the latency for fetching from a and b).
I'm not aware of the method of allocation of workgroups. Does the GPU attempt to fill one SIMD with workgroups before moving onto allocating work for the next SIMD? Or, does the GPU allocate successive workgroups to successive SIMDS?
The devil is in the detail here as throughput is also affected by latency-hiding and cache-reuse factors. With the hundreds of cycles of worst-case latency for fetches from a and b and the lack of arithmetic intensity in the kernel to soak up these latencies, you can start to see how the pattern of workgroup allocation to cores affects throughput.
And, of course, the 4-fold difference in workgroup count between the two kernels is also another variable. The scalar kernel requires 62,500 workgroups, while the vector version requires 15,625. There's two different memory access patterns seen here so cache re-use and general latency hiding will differ.
To be honest I'm surprised scalar is faster, but the overall arithmetic intensity is so mind-bogglingly low that performance is a crap-shoot.
Performance in this case is dominated by cache access patterns and that's mostly dominated by the pattern of workgroup allocation to cores and that's a function of the count of workgroups and the time a workgroup spends in a core (10 cycles to compute a+b plus some random time spent waiting for a and b to be fetched).
However, as you point out, since the kernel performance is bound by memory transfer and kernel startup cost, we should see the vector (float4) version run faster, not slower. (The reason I used a simple kernel was to ensure that the operation was not ALU bound, but instead was memory bound.)
That's why I posted the code. The result is repeatable, and doesn't make sense from any of the first principles reasoning.
Perhaps the cause is that a workgroup of 256 threads, each fetching float4s will touch 256 x 4 x 4 or 4K bytes per operand. IIRC thats 2x or 4x a DDR5 DRAM block? If so, then the float4 version of the code would require 20 * 3 * (2 or 4) open DRAM pages at a time (if we schedule just one workgroup per SIMD). That's more than the DRAM complement can handle, so there are a bunch of page close/ras cycles going on because of all the bank conflicts. Versus the scalar version where there are far fewer simultaneously open pages. In this case there should be fewer (though not zero) bank conflicts.
I'll work on some experiments for this one...
I'm looking forward to getting AMD's performance analysis support for Linux.
You should find this interesting, particularly from page 39 onwards:
and this is useful for some parameters:
This same example was brought up internally and it's actually quite complicated. Float was achieving 127 GB/s total bandwidth and float4 only 80 GB/s. The buffers allocated were 16 MB each, if I recall correctly.
If you look at the simple case where the group size is 64 threads, same as wavefront size, it makes analysis a little easier.
In the float case, each wavefront reads 256 bytes from each input buffer and writes 256 bytes. 256 bytes happens to be the width of a single memory channel. Since the allocations are likely exactly 16 MB apart, the starting address for each buffer is in the same bank and channel. So we are getting some channel and bank collisions on the reads. Writes happen at a later time (due to pipelining) so don't interfere much. Since you are only reading a float from each buffer, we are underutilized on the read side. This helps alleviate the costs of the collisions.
In the float4 case, each wavefront reads 1024 bytes from each input buffer and writes 1024 bytes. So now reads span 4 channels and collide on these channels. To make things worse, the reads are in the same bank. By offsetting one of the input buffers by 2048 bytes (1 bank) performance jumps to 113 GB/s. Still less than the float case, but this is likely due to the cost of frequent read/write switching (remember, float case is underutilizing the FIFOs and bandwidth so gets more help from them).
The profiler can give you some info. In the float case, you see no stalls on reads, yet the float4 case shows quite a bit of stalls, 25% is what I recall.
In general, vector reads/writes are still preferable, but cases with frequent read/write switching may benefit from smaller vector sizes. For example, if you did 128 reads per thread and output the sum, you'd find that float4 was much faster.
We plan on documenting all of this, the main concern is how to broach the subject without confusing people Also, I plan on proposing a way for the developer to request a certain alignment of buffers so you can control what bank/channel you start in. That way you know how to offset the buffers if collisions are an issue. Random access operations are less likely to collide and shouldn't need any offset.
Hope this helps.
Funny, I was going to suggest that performance be graphed for varying element count, but decided not to as I didn't think it would make a difference. Yet it seems like it would.
The behaviour with images is presumably quite different (though pixel shading in conventional graphics is where it'd be optimal on memory access pattern, I suppose). If that's so then when you document this stuff I guess you can add that in as a variable to play with.