8 Replies Latest reply on Jan 16, 2009 5:27 AM by zpdixon

    Measuring HD 4850 performance

    zpdixon

      I wrote the following IL kernel to benchmark the MAD instruction on my HD 4850. It's a simple loop of 0x20000 iterations over 120 MAD instructions working on registers only. When disassembled to R700 asm, I can see it is translated to 480 MULADD instructions using the 5 SPUs (X, Y, Z, W, T). Anyway even assuming the T SPU is not used, it should be capable of excuting at least 1 MAD (4 MULADD) per clock, right ? The HD 4850 is clocked at 625 MHz so the loop should execute in maximum 120*0x20000/625e6 = 0.025 sec. However on my system I measure almost 10 times that number: 0.220 sec. I am using the SDK 1.3-beta on Linux x86-64. I confirm I am measuring the time correctly, it's not a question of some overhead because if I execute 10 times more instructions, the kernel takes exactly 10 times longer to complete (2.2 sec). What could be the reason of realizing only 1/10th the theoretical perf of the HD 4850 ?

       

      il_ps
      dcl_output o0
      dcl_literal l0, 0x0, 0x20000, 0xffffffff, 0x0

      mov r0.x, l0.y ; counter
      mov r1.x, l0.x ; total

      ixor r2, r2, r2
      ixor r3, r3, r3
      ixor r4, r4, r4

      whileloop
      break_logicalz r0

      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4
      mad r2, r2, r2, r2
      mad r3, r3, r3, r3
      mad r4, r4, r4, r4

      iadd r0.x, r0.x, l0.z ; counter--
      endloop

      iadd r1, r1, r2
      iadd r1, r1, r3
      iadd r1, r1, r4
      mov o0, r1
      end

       

        • Measuring HD 4850 performance
          rahulgarg
          On a 4870, running 18 MAD inside loop instead of your 120, and running 0x2000000 times, I can reach about 99% theoretical performance. Something is wrong with your timing code or initialization code.
          • Measuring HD 4850 performance
            josopait

            zpdixon,

            I did a similar benchmark some time ago, see the thread titled 'strange benchmark':

            http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=100672

             

            The VLIW processors don't execute one instruction per clock cycle. Instead, they cycle between four or more threads. My analysis showed that each VLIW processor can execute 4 instructions in 5 clock cycles if it has 4 threads to work on, or it can execute 8 instructions in 8 clock cycles if it has 8 threads to work on. Therefore, you should generally parallelise your program so that every VLIW processor has at least 4, better 8, threads to work on.

            Given that, your code should execute with 20% of your calculated speed if the number of threads is small.

            My HD4870x2 board still does not run at full speed. Maybe this affects your board as well? Running

            "aticonfig --adapter=0 --od-getclocks"

            shows that the current clock is 507 Mhz instead of 750 Mhz, which is also consistent with my test results. I hope the people at ati will fix this some time.

            Ingo

             

              • Measuring HD 4850 performance
                zpdixon

                I am indeed running only a small number of threads: I call calCtxRunProgram once with a domain size of {0, 0, 1, 1} -> 4 pixels so 4 threads right ?

                Also my timing may be slightly inaccurate. I do exactly this:

                gettimeofday(tv0...);
                calCtxRunProgram(...);
                while (calCtxIsEventDone(...) == CAL_RESULT_PENDING)
                    nanosleep(...); // sleep 1ms
                gettimeofday(tv1...)

                // -> time is tv1 - tv0

                Technically I should start counting time right before calCtxIsEventDone() instead of before calCtxRunProgram(), because the kernel is scheduled for execution the first time calCtxIsEventDone() is called.

                I'll check the clock frequency of my HD 4850 with aticonfig tonight, I'll also experiment with more threads.

                Thanks.

              • Measuring HD 4850 performance
                MicahVillmow
                zpdixon,
                If you are only running 4 threads, then you are only running on 1 SIMD, so you should only run at 1/10th speed. In order to fully utilize the GPU you must run a minimum of 1280 threads as anything less will leave some SIMD's idle.


                Also, 4 threads only utilizes 1/16th of a SIMD on the 770 as the wavefront size is 64 threads. A wavefront is a hardware thread and all software threads inside of a wavefront run in parallel.
                • Measuring HD 4850 performance
                  rahulgarg
                  josopait : About the clock speeds, the GPU downclocks when not working to save power. However when working the speeds shoot up automatically to the full speed. This is similar to what happens on CPUs. The driver was just reporting the downclocked speed when you tested.
                    • Measuring HD 4850 performance
                      josopait

                      This is probably what is supposed to happen. However, even at full load, the clock speed always stays at 507 Mhz:

                       

                      # aticonfig --adapter=0 --od-getclocks

                      Adapter 0 - ATI Radeon HD 4870 X2
                                                  Core (MHz)    Memory (MHz)
                                 Current Clocks :    507           500
                                   Current Peak :    750           900
                        Configurable Peak Range : [507-800]     [500-1000]
                                       GPU load :    99%

                       

                    • Measuring HD 4850 performance
                      rick.weber

                      zpdixon,

                      Also, you have instruction dependencies that will stall your pipeline if it's anything more than a 3 stage pipeline. I.E:

                      mad r2, r2, r2, r2 < a
                      mad r3, r3, r3, r3
                      mad r4, r4, r4, r4
                      mad r2, r2, r2, r2 < depends on a
                      mad r3, r3, r3, r3
                      mad r4, r4, r4, r4

                       

                      I think I had to unroll to use 16 different registers when i did this test on a Firestream 9170. Also, as others have suggested, when I did it, I used an 8192x8192 domain to maximize thread parallelism (though, you can likely use something much smaller).

                        • Measuring HD 4850 performance
                          zpdixon

                          Ok, I finally reached ~970 GFLOPS on my 4850, or 97% of the peak theoretical perf :-) It turns out I had 2 problems:

                          1. I noticed that CAL was silently ignoring the domain size because my IL kernel did not define any input stream (!) It took me a while to figure that out...

                          2. And as you guys pointed out, one thread is far from sufficient to estimate the perf of even just one VLIW processor. In my case I need about 32x threads per VLIW processor to reach the peak perf. I use a domain size of 32x160 == 5120 threads.

                          rahulgarg: 18 MAD only allow my 4850 to attain 77% of the theoretical perf, 36 MAD 87%, 72 MAD 89%, and 144 MAD 95%. Above that it varies between 93-97% (because the CAL compiler fails to always make use of the 5 SPUs per VLIW processor). I personally unrolled the loop to execute 249 MAD per iteration as it is one of the numerous values above 150 that allows it to not waste any SPU.

                          It makes no difference where I place the first gettimeofday call (before calCtxRunProgram or before calCtxIsEventDone).

                          josopait: according to aticonfig, my 4850 runs at 500MHz when idle and automatically jumps to 625MHz when running the IL kernel. I don't know why your 4870 stays at the low clock frequency...

                          rick.weber: well, instruction dependencies don't seem to matter at all to my IL kernel. In fact I even modified it to use only 2 registers (mad on r1; mad on r2; mad on r1; ...) as I noticed it is the simplest case where the CAL compiler is always able to optimize to use the 5 SPUs (when repeating the MAD instruction on the same register r1 over and over, CAL decides to "waste" the T SPU and only uses the X,Y,Z,W SPUs). And the fact I run such a high number of threads --5120-- render instruction dependencies irrelevant as the GPU context-switches to the next one when that occurs.