9 Replies Latest reply on Mar 12, 2009 5:22 PM by ryta1203

    Odd Performance Results

    ryta1203

      Given three simple kernels:

      kernel void foo1(float4 a<>, out float4 f1<>{ f1=a;}

      kernel void foo2(float4 a<>, out float4 f1<>, out float4 f2<>{f1=a;f2=a;}

      kernel void foo3(float4 a<>, out float4 f1<>, out float4 f2<>, out float4 f3<>{f1=a;f2=a;f3=a;}

      Why would foo3 running faster than foo1 and foo2 given small stream sizes: <8,8>, <16,16>, etc...??? This confused me, it doesn't seem to happen at larger stream sizes, say <1024, 1024>.

      I looked at the ISA and it's the same except that foo2 has 1 more bundle than foo1 (all MOV instr) and has burstcount(1) and foo3 has 1 more bundle than foo2 (all MOV instr) and has burstcount(2).

        • Odd Performance Results
          ryta1203

          In case this wasn't clear, foo3 is running FASTER than foo2 and foo3 is running FASTER than foo1, but only for very small stream sizes, for example 1 or 2 wavefronts (This is all I have tested so far).

            • Odd Performance Results
              ryta1203

              I'm going to assume that this is some kind of bug and AMD has no idea why this happens.

                • Odd Performance Results
                  MicahVillmow

                  Ryta,

                   There are a lot of reasons why the performance can be different and there just is not enough information right now to make a valid judgement on it. When you are dealing with sizes that small, you are not longer hitting the normal bottlenecks on the chips and they require detailed analysis to figure out exactly what is causing the perceived performance differences.

                   

                    • Odd Performance Results
                      ryta1203

                       

                      Originally posted by: MicahVillmow Ryta,

                       There are a lot of reasons why the performance can be different and there just is not enough information right now to make a valid judgement on it. When you are dealing with sizes that small, you are not longer hitting the normal bottlenecks on the chips and they require detailed analysis to figure out exactly what is causing the perceived performance differences.

                       

                       

                      OK. Yeah, it just seems odd that output 2 floats is slower than outputing 3, everything else equal.