Another Performance Question

Discussion created by ryta1203 on Feb 5, 2009
Latest reply on Feb 9, 2009 by ryta1203
I have two very simple kernels:

kernel void step1(float4 a<>, float4 b<>, out float c<>, out float d<>)
c = a.x + a.y + a.z + a.w;
d = b.x + b.y + b.z + b.w;

kernel void step2(float4 a<>, float4 b<>, out float4 e<>)
e.x = a.x + a.y + a.z + a.w;
e.y = b.x + b.y + b.z + b.w;

the size of all the streams are the same, lets say 2048. I iterate over the kernels 2048 times (just to get longer timing results for the GPU).

My question is this:

Why does the first kernel run significantly faster than the second kernel?

Looking at the KSA, the GPR is lower for the 2nd kernel and the ALU:Fetch is 1.25 for the 2nd kernel and 2.5 for the 1st kernel. Since the GPR is higher for the 1st kernel than the wavefronts in the run queue are going to be less. The KSA says the throughput should be higher for the 2nd kernel along with threads/clock (which is another reason the KSA "measurables" don't speak much about performance).

My only guess would be that the higher ALU:Fetch ratio is allowing latency hiding across wavefronts in the GPU, is that an accurate statement?