7 Replies Latest reply on Aug 23, 2008 12:59 AM by eduardoschardong

    Gflops with Simple_matmult and optimized_matmult

    jonathan81

      Hello,

      i run some test to view Gflops on my Radeon 3870

      In example simple_matmult i use those arguments : -t -i 10000 and i add the line to have the Width Height Time and Gflops (the same thing that in optimized_matmult)

      I have with simple_matmult :

      Width   Height  Iterations      Time            Gflops
      64      64          10000           0.718681        6.794130

      with optimized_matmult:

      Width   Height  Iterations      Time            Gflops
      64      64          10000           1.084024        4.504341

      CPU Time : 15.086446

      Can you give me your results with those two samples projects  ??

       i find that's results about GFlops are bad and optimized_matmult takes more time that simple_matmult that's very weird.

      Thanks in advance

      Jonathan

      Edit :

      Better result with 1024 height and width and 100 iterations

      simple_matmult :

      Width   Height  Iterations      Time            Gflops
      1024    1024    100             20.184839       9.908427

      optimized_matmult

      Width   Height  Iterations      Time            Gflops
      1024    1024    100             2.112175        94.689126

       

       

       

        • Gflops with Simple_matmult and optimized_matmult
          Remotion

          Hi,


          Here are results on my new Radeon 4870:

          with simple_matmult :

          Width   Height  Iterations      Time            Gflops         
          64      64      10000           0.571594        8.542454 

          Width   Height  Iterations      Time            Gflops         
          640     640     100             2.423211        20.150175  

           

          with optimized_matmult:

          Width   Height  Iterations      Time            Gflops         
          64      64      10000           1.075558        4.539794  

          Width   Height  Iterations      Time            Gflops         
          640     640     100             0.852667        57.265160

          As you can see in second test the result are pretty different.

          I think this is because every 10000 iteration very small kernel 64*64 will be called from CPU and this is slow.

           

          Edit

          with simple_matmult :

          Width   Height  Iterations      Time            Gflops         
          1024    1024    100             10.006094       19.987819

          with optimized_matmult:

          Width   Height  Iterations      Time            Gflops         
          1024    1024    100             1.401922        142.661312

          Remo

            • Gflops with Simple_matmult and optimized_matmult
              jonathan81

              Thanks

              Initialization in optimized_matmult is more expansive than simple_matmult so with big iterations and small size we lose the power of optimization because kernel are so small

               

              • Gflops with Simple_matmult and optimized_matmult
                eduardoschardong
                Originally posted by: Remotion
                Here are results on my new Radeon 4870:



                Great card

                Originally posted by: Remotion
                I think this is because every 10000 iteration very small kernel 64*64 will be called from CPU and this is slow.



                Yep, those streams are too small for GPUs,

                Originally posted by: RemotionWidth   Height  Iterations      Time            Gflops         
                1024    1024    100             1.401922        142.661312



                But 1024 seens to be too small for 4870 either, just to find the peak, could you run a single interation with 8192x8192?

                Thanks in advance.
                  • Gflops with Simple_matmult and optimized_matmult
                    Remotion

                    Here is test with 4096*4096 using optimized_matmult:

                    Width   Height  Iterations      Time            Gflops         
                    4096    4096    100             50.357503       254.182580     

                    I can not run 8192x8192 or even 7168x7168 only 6144x6144 will work for me.

                    Width   Height  Iterations      Time            Gflops         
                    6144    6144    1               2.919084        147.991647

                    This last test war prety long and it apears that my system is hanging but here is the result.
                    Width   Height  Iterations      Time            Gflops          
                    6144    6144    100             168.628834      256.183946   

                    So it seems that 256 Gflops is the peak.   

                      • Gflops with Simple_matmult and optimized_matmult
                        eduardoschardong
                        Originally posted by: Remotion
                        I can not run 8192x8192 or even 7168x7168 only 6144x6144 will work for me.

                        I told 8192x8192 due the texture size limit and forgot about instaled memory...
                        7168x7168x4x3 = 588MB
                        6144x6144x4x3 = 432MB

                        Originally posted by: RemotionSo it seems that 256 Gflops is the peak.


                        Much higher than any CPU but still far below the maximum 1.2TFlops, I am trying to find what is limiting it, maybe texture fetchs? In this case a more_optimized_matmult will solve...
                        Looking at the disassembly it doens't look like it's ALU limited.
                        BTW, looking at the disassembly a found what seens to be a poor job of the optimizer, it generates a good number of MULs and ADDs, replacing the:
                        accumulator1 += A11.xxxx * B11.xyzw + A11.yyyy * B22.xyzw + A11.zzzz * B33.xyzw + A11.wwww * B44.xyzw;
                        by
                        accumulator1 = accumulator1 + A11.xxxx * B11.xyzw + A11.yyyy * B22.xyzw + A11.zzzz * B33.xyzw + A11.wwww * B44.xyzw;
                        was enough for replacing all MULs and some ADDs by MULADDs, and even with fast math enabled, also removing the i0 and modifying the while the compiler generated 3 MOVs at the end wich I didn't understood, my loop:

                        while(index.w < loopVar0)
                        {
                        // Fetching values from A
                        float4 A11 = A1[index.wy];
                        float4 A22 = A2[index.wy];
                        float4 A33 = A3[index.wy];
                        float4 A44 = A4[index.wy];
                        float4 A55 = A5[index.wy];
                        float4 A66 = A6[index.wy];
                        float4 A77 = A7[index.wy];
                        float4 A88 = A8[index.wy];

                        // Fetching values from B
                        float4 B11 = B1[index.xw];
                        float4 B22 = B2[index.xw];
                        float4 B33 = B3[index.xw];
                        float4 B44 = B4[index.xw];

                        accumulator1 = accumulator1 + A11.xxxx * B11.xyzw + A11.yyyy * B22.xyzw + A11.zzzz * B33.xyzw + A11.wwww * B44.xyzw;
                        accumulator2 = accumulator2 + A22.xxxx * B11.xyzw + A22.yyyy * B22.xyzw + A22.zzzz * B33.xyzw + A22.wwww * B44.xyzw;
                        accumulator3 = accumulator3 + A33.xxxx * B11.xyzw + A33.yyyy * B22.xyzw + A33.zzzz * B33.xyzw + A33.wwww * B44.xyzw;
                        accumulator4 = accumulator4 + A44.xxxx * B11.xyzw + A44.yyyy * B22.xyzw + A44.zzzz * B33.xyzw + A44.wwww * B44.xyzw;
                        accumulator5 = accumulator5 + A55.xxxx * B11.xyzw + A55.yyyy * B22.xyzw + A55.zzzz * B33.xyzw + A55.wwww * B44.xyzw;
                        accumulator6 = accumulator6 + A66.xxxx * B11.xyzw + A66.yyyy * B22.xyzw + A66.zzzz * B33.xyzw + A66.wwww * B44.xyzw;
                        accumulator7 = accumulator7 + A77.xxxx * B11.xyzw + A77.yyyy * B22.xyzw + A77.zzzz * B33.xyzw + A77.wwww * B44.xyzw;
                        accumulator8 = accumulator8 + A88.xxxx * B11.xyzw + A88.yyyy * B22.xyzw + A88.zzzz * B33.xyzw + A88.wwww * B44.xyzw;

                        index.w += 1.0f;
                        // Reducing iterator
                        //i0 = i0 - 1.0f;
                        }