12 Replies Latest reply on Sep 29, 2008 7:52 PM by ahu

    strange benchmark

    josopait

      I did a few benchmark tests and am a bit puzzled about the results. Maybe someone here can help. I have a HD 4870 X2 graphics board, running on 64 bit gentoo linux.

      I tried the following IL kernel:

       

      il_ps_2_0
      dcl_literal l1, 0,-1,0,0
      dcl_literal l10,0x33333333,0x3fe33333,0x66666666,0x3fe66666  ; l10.xy=0.6, l10.zw=0.7
      dcl_input_interp(linear) v0.xy
      dcl_output_generic o0
      dcl_cb cb0[2]              ; cb0[0]=100000
      mov r100, v0.x
      add r100, r100, v0.y
      f2d r100.xy__, r100
      mov r0.x___, cb0[0].x000

      whileloop
      break_logicalz r0.x000
      iadd r0.x___,r0.x000,l1.y000
      dmad r100.xy__, r100, l10.xy00, l10.zw00
      ...
      < repeats 100 times >
      ...
      dmad r100.xy__, r100, l10.xy00, l10.zw00
      endloop

      mov o0, r100
      ret_dyn
      end


      It is basically a loop that is executed a hundred thousand times, and in the loop there are 100 dmad instrucions. It involves no memory reads, so it should give a good estimate of the theoretical performance. The assembler code looks as expected, with 100 MULADD_64 instructions within the loop.

      First, I executed this kernel with one thread only, by using the domain size 1x1. It takes about 0.216s to run. When I increase the number of dmul instructions in the loop, the execution time increases linearly by 1.92E-3s per instruction (as long as the number of instructions is smaller than about 1000, after that the execution time increases significantly). Dividing this by the loop size 100000, this gives the absolute time of 1.92E-8s per instruction. According to the specs, the GPU clock rate is 750 MHz. Multiplication of 1.92E-8s by 750 Mhz gives 14.4, which is the number of clock cycles that are required for one instruction. Why is this so large? I remember to have read somewhere that one instruction takes only 4 clock cycles to run. By the way, the measured time is the time it takes to execute the kernel. It does not include the compilation time (I made that mistake before).

      The above result is for the catalyst driver version 8.9. Today I updated the catalyst driver from 8.8 to 8.9. It seems to be a bit more stable, though it still crashes from time to time . Before the update, with catalyst version 8.8, the test described above executed about 40% faster.

      I then increased the domain size. The execution time stays very constant at 0.216s for larger domain sizes, until I reach a domain size of 1280 (with both x and y sizes being multiples of 2). For larger domain sizes, the time increases. This seems to suggests that there are 1280 stream processors available on each gpu core. But the specs say that there are 1600 stream processors, presumably 800 for every gpu core. How does this fit together? What domain size should I choose for maximum performance?

      Thanks for any help,

      Ingo

       

        • strange benchmark
          sgratton

          Hi there,

          I wonder if your first observation could be explained by the fact that each VLIW processor runs 4 threads (a "quad") out of a wavefront interleaved. If you only run 1 thread then it makes sense that you see roughly 1/4 performance.

          On the second one, remember 800 stream cores = 160 VLIW processors. Each one of these can do a DP MAD a clock, assuming it has 4 threads to work on. So 160*4=640, a factor of 2 off. (I remember reading somewhere that wavefronts were also actually executed interleaved in pairs, which could explain this, but I'm not sure.)

          It's interesting to see how closely your results agree with what the theoretical performance is!

          Best,
          Steven.
          • strange benchmark
            MicahVillmow
            Ingo,
            It seems that when you hit the 1280 domain size you are fully utilizing the chip with no delays anywhere, but once you go to larger domain sizes, you should be at 100% load and adding more work would then add more time.
            For more information on performance analysis, check out the presentations given by Justin Hensley and Jason Yang here:
            http://coachk.cs.ucf.edu/courses/CDA6938/
              • strange benchmark
                josopait

                Steven,

                the factor 2 may come from the fact that the 4870 x2 has two gpu cores. I wasn't sure if I have to handle the board as one device or as two, but this would suggest that the crossfire link is already working and that I don't have to care about the fact that there are two gpu cores involved(?).

                I always get these stream processors, thread processors, and simd processors mixed up. But it seems to make more sense to me now. So there are 2*160 VLIW processors on the board. Each of them can do a DP MAD per clock or five single precision MADs. Five SP MADs can be regarded as 10 floating point operations. Multiplying all together gives the theoretical performance of
                (2*160 VLIW processors) * (10 FLOP/clock) * (750MHz) = 2.4 TFLOPS.
                This is also what is written on the packing.

                But then the question remains why my performance tests give so low results. In the test I describe above I use 1280 threads. Each of them does one VLIW instruction every 1.92E-8s. If I replace the DP MAD instruction by five SP MAD instruction, I get
                (10 FLOP/VLIW) * (1280 threads) / (1.92E-8s) = 667 GFLOPS.
                That's too low by a factor of 3.6. Is there any way to speed this up?

                 

                  • strange benchmark
                    lpw

                     

                    Originally posted by: josopait

                    the factor 2 may come from the fact that the 4870 x2 has two gpu cores. I wasn't sure if I have to handle the board as one device or as two, but this would suggest that the crossfire link is already working and that I don't have to care about the fact that there are two gpu cores involved(?).



                    For my 3870 X2, each core is exposed as an independent CAL device.  I can run a kernel on each in parallel.  I suspect that this would be the case for the 4870 X2 also.  Check the output of the calDeviceGetCount function.  If it shows two devices, then running your kernel simultaneously on both should take the same amount of time as running it on a single one.  Please keep us posted (I'm considering getting a 4870 X2 myself).

                    L

                      • strange benchmark
                        josopait

                        lpw,

                        it is similar in my case. Both cores show up as two independent devices. And I can run the kernel on both cores simultaneously. But then the execution time increases from 0.216s to 0.361s. This also doesn't seem quite right.

                        Ingo

                          • strange benchmark
                            josopait

                            Ok, I just fixed the problem with the two cores. It appears that, for whatever reason, the execution of a kernel does not start before the execution status is checked by calCtxIsEventDone(). Because I first polled this function in a loop for core 0 and then for core 1, core 1 started only after core 0 had finished. You amd guys may want to fix this.

                            I can now run the test kernel on both cores simultaneously, using 1280 threads each. This would correspond to a SP performance of 1.33 TFLOPS. So now I am 'only' off by a factor of 1.8.

                              • strange benchmark
                                eduardoschardong
                                One more info about latency that may help you.

                                The latency is in fact 8 cycles since each wavefront must be intercaled with another, wich is exactly 1.8 times faster than what you are achieving, with at least 2 wavefronts per SIMD core and 10 SIMD cores the minimum domain size is 2 * 64 * 10 = 1280 wich is the number you got

                                So the question now is why your kernel is taking longer than expected, try reducing the number of madds, increasing and decreasing the number of iterations, etc, how it performs with different parameters?
                                  • strange benchmark
                                    josopait

                                    I just downgraded the catalyst driver from version 8.9 to 8.8. It now runs faster. For domains of size 1280 I get 2 TFLOPS, which is still 16% slower than the theoretical result. If I double the domain size to 2560, I get exactly the theoretical result of 2.4 TFLOPS.

                                    What did you amd people do to the driver ??? 

                        • strange benchmark
                          bayoumi
                          to be clear, you are dividing your data between the two cores, and running two parallel kernels with independent data sets (as opposed to 1600 ALUs working on the same data)?
                          thanks
                          Amr
                          • strange benchmark
                            bayoumi
                            Thanks Josopait
                            Amr
                              • strange benchmark
                                ahu

                                It's great to hear that X2 multi-gpu actually works

                                Has anybody done this in Windows? I have two 4870 X2 cards and I haven't yet been able to succesfully use all four cores in GPGPU programs.

                                The best I could get from calDeviceGetCount was 3, but according to Sisoft Sandra 2009 GPGPU test, the performance was only equal to a single 4870 GPU.

                                I'm using Windows Server 2003 x64 and Catalyst 8.9. I'm beginning to think that I have a hardware problem though, because games show severe texture problems with some scenes.