Hi to everybody.
I'm thinking about VLIW utilization on a 5870 HD.
Suppose you have the following kernel:
__kernel void saxpy(const __global float * x, __global float * y, const float a)
uint guid = get_global_id(0);
y[guid] = a * x[guid] + y[guid];
Each work item operates on a single vector element and no vectorization (float4).
Is the compiler still capable of packing instructions to exploit the 4 ALUs of each processing element?
Is there any tool to determine the way instructions are packed into VLIW?
Thank you very much!