In arithmetic code, no, they compile to the same thing. In global memory access, yes, because loading 128-bit values is more efficient than loading three 32-bit values, and the compiler may not be able to infer the vector load from the set of scalars.
I ran several float and floatn saxpy kernels through the AMD APP KernelAnalyzer, but I can't make sense of several results.
First, float, float2, and float4 for most listed devices had the same throughput (threads/sec), but then decreases by a little less than half for float8 and then again by a little less than half for float16. The throughput for float, float2, and float4 makes sense considering VLIW4. Could you explain why float8 and float16 provide even better FLOPS than float4?
Second, I tried to simulate the float4 kernel using four floats, but I couldn't find them being packed into the same vector multiple or add IL or assembly instruction. The throughput of the manually 4x unrolled kernel was significantly less than that of the float4 kernel. I tried all the examples of striding in the APP OpenCL Programming Guide's section on VLIW packing, but still no luck. Any idea what's going wrong? Do you know which compilers do and don't support this optimization?
BTW, how exactly are float4s stored in the LDS? As I understand floats are sequentially stripped across consecutive memory banks (16 or 32), each holds 32 bits. I ask this because the guide recommends each float be strided 16 apart for manual VLIW packing. So is a float4 stored all in one memory bank, or across four memory banks?
Thanks a million!
Hi Settle, sorry about the delay I've been at a conference.
If we ignore global memory reads or local memory banking then using larger vectors is similar to unrolling a loop. It just gives you more ALU work to perform before you next need to do control flow or hit an dependency, and hence you get higher ILP. So you might get performance gains as a result, until you start to create register pressure.
The compiler doesn't really pay much attention to packing vector ops into VLIW packets, they're generated as scalar instructions and then scheduled from there. Four scalar floats should then be dealt with similarly given that there are no vector instructions in the GPU ISA (it would come up as four scalar operations in IL). There are caveats to that. When you read from DRAM the 128-bit reads are more efficient, and that's particularly true if you read from bytes 0, 4, 8 and so on as you read x, x, x then y, y, y as scalars. Earlier versions of the compiler didn't pack scalar values into vector registers (the VLIW unit can access 4 separate register banks) and even on newer compilers it may be that we have to read all four banks of the register file from the same address so if the register packing isn't quite right it may be wasteful or only able to read 32 or 64 bits from the register file instead of 128.
As for LDS, it would break pointer arithmetic if we transposed the data, so a float 4 takes 16 consecutive bytes (order is officially undefined). That covers four banks and hence will only give you 1/4 LDS throughput at peak if you are doing operations on scalars (1/2 if you access 2 or 4 elements at once and the compiler is being nice to you, I think).