strange benchmark

Discussion created by josopait on Sep 25, 2008
Latest reply on Sep 29, 2008 by ahu

I did a few benchmark tests and am a bit puzzled about the results. Maybe someone here can help. I have a HD 4870 X2 graphics board, running on 64 bit gentoo linux.

I tried the following IL kernel:


dcl_literal l1, 0,-1,0,0
dcl_literal l10,0x33333333,0x3fe33333,0x66666666,0x3fe66666  ; l10.xy=0.6,
dcl_input_interp(linear) v0.xy
dcl_output_generic o0
dcl_cb cb0[2]              ; cb0[0]=100000
mov r100, v0.x
add r100, r100, v0.y
f2d r100.xy__, r100
mov r0.x___, cb0[0].x000

break_logicalz r0.x000
iadd r0.x___,r0.x000,l1.y000
dmad r100.xy__, r100, l10.xy00, l10.zw00
< repeats 100 times >
dmad r100.xy__, r100, l10.xy00, l10.zw00

mov o0, r100

It is basically a loop that is executed a hundred thousand times, and in the loop there are 100 dmad instrucions. It involves no memory reads, so it should give a good estimate of the theoretical performance. The assembler code looks as expected, with 100 MULADD_64 instructions within the loop.

First, I executed this kernel with one thread only, by using the domain size 1x1. It takes about 0.216s to run. When I increase the number of dmul instructions in the loop, the execution time increases linearly by 1.92E-3s per instruction (as long as the number of instructions is smaller than about 1000, after that the execution time increases significantly). Dividing this by the loop size 100000, this gives the absolute time of 1.92E-8s per instruction. According to the specs, the GPU clock rate is 750 MHz. Multiplication of 1.92E-8s by 750 Mhz gives 14.4, which is the number of clock cycles that are required for one instruction. Why is this so large? I remember to have read somewhere that one instruction takes only 4 clock cycles to run. By the way, the measured time is the time it takes to execute the kernel. It does not include the compilation time (I made that mistake before).

The above result is for the catalyst driver version 8.9. Today I updated the catalyst driver from 8.8 to 8.9. It seems to be a bit more stable, though it still crashes from time to time . Before the update, with catalyst version 8.8, the test described above executed about 40% faster.

I then increased the domain size. The execution time stays very constant at 0.216s for larger domain sizes, until I reach a domain size of 1280 (with both x and y sizes being multiples of 2). For larger domain sizes, the time increases. This seems to suggests that there are 1280 stream processors available on each gpu core. But the specs say that there are 1600 stream processors, presumably 800 for every gpu core. How does this fit together? What domain size should I choose for maximum performance?

Thanks for any help,