Odd Performance Results

Discussion created by ryta1203 on Mar 5, 2009
Latest reply on Mar 12, 2009 by ryta1203

Given three simple kernels:

kernel void foo1(float4 a<>, out float4 f1<>{ f1=a;}

kernel void foo2(float4 a<>, out float4 f1<>, out float4 f2<>{f1=a;f2=a;}

kernel void foo3(float4 a<>, out float4 f1<>, out float4 f2<>, out float4 f3<>{f1=a;f2=a;f3=a;}

Why would foo3 running faster than foo1 and foo2 given small stream sizes: <8,8>, <16,16>, etc...??? This confused me, it doesn't seem to happen at larger stream sizes, say <1024, 1024>.

I looked at the ISA and it's the same except that foo2 has 1 more bundle than foo1 (all MOV instr) and has burstcount(1) and foo3 has 1 more bundle than foo2 (all MOV instr) and has burstcount(2).