Originally posted by: FangQ in the "ATI Stream SDK v2.01 Performance and Optimization" document, I noticed the following suggestion:
Vectorize. AMD GPU hardware is fundamentally a five-wide VLIW unit. Vectorization can lead to substantially greater efficiency. The ALUPacking counter ...
does that mean using swizzle operators whenever possible is helpful to improve the speed? I know OpenCL does not support float3, I used float4 in most cases, will it be possible if I use float4 var; var.xyz=... to do the calculation can gain any benefits in speed?
The benefit from vectorization is that when you calculate an operation between 2 n-vectorized elements you use n functional units at a time (please note that n is limited by the maximum number of functional units per processing element present on your OpenCL device).
So, for example, when you calculate an operation between two uint4/float4/etc. elements, you will use 4 functional (on 5 available) units per stream processor at a given time.
When you use swizzle operators, the number of functional units per stream processor used in parallel should depend on the number of components on which you are applying a given operation. For example, if you have two uint4 variables, vec1 and vec2, and you write:
vec1.x + vec2.x
you will use only one functional unit per SP, because you're adding only 1 of the four components of these two uint4 vectors. Anyway, it is possibile that at kernel compile time some sort of optimization is done, but this can only be clarified by AMD staff.
Hope this can help
thanks for your reply.
I just tried on a program I have. I found the run-time when using swizzles became slightly longer
The only place I modified is the following line (this is at the inner loop of the kernel):
everything else is the same.
For a given configuration, the original code took 2271 ms, but the modified version took 2279 ms (consistent with several repetitions).
Is this expected when using .xyz for a float4 vector?
Not sure what the deal is but for the second code I only get 1 ALU instruction (it packs just fine), so there's no way the ALU should be the problem.
I couldn't get the first code to compile in SKA.
Either way though, you aren't going to get any cheaper than 1 ALU, so it must be something to do with how that code interacts with the rest of your kernel, not that code itself.
That's my guess from the limited code you provided.
In general the IL compiler does a good job obtaining VLIW parallelism, so even if you just write scalar code you are going to obtain good resource utilization. For your example, the compiler is able to use just 1 ALU slot in both cases, it will pack up to 5 operations in one big instruction that can be performed in one cycle.
However, it is still important to have in mind that the underlying architecture is VLIW. For example, if you write a kernel that continuously performs dependant operations, only 1/5 of the ALU units will be used. Sometimes you could merge two kernels (even if they do unrelated things) to increase ALU efficiency. Under some circurmstances the special operations T unit could be a bottleneck, so some VLIW slots could contain only one instruction.
Another point is that hardware usually performs memory fetches in blocks, so using SIMD types instead of scalar types for vectors can increase efficiency. For example, if you need to compute 1024 scalar outputs you could as well compute 256 float4 outputs instead, writing 4 elements per kernel invocation and improving ALU utilization (SIMD packing is also applicable to reduce the number of read operations).
Those are just some examples. Using Stream KernelAnalyzer you can measure the ALU utilization of your code and obtain some other useful statistics.