I wrote the following IL kernel to benchmark the MAD instruction on my HD 4850. It's a simple loop of 0x20000 iterations over 120 MAD instructions working on registers only. When disassembled to R700 asm, I can see it is translated to 480 MULADD instructions using the 5 SPUs (X, Y, Z, W, T). Anyway even assuming the T SPU is not used, it should be capable of excuting **at least** 1 MAD (4 MULADD) per clock, right ? The HD 4850 is clocked at 625 MHz so the loop should execute in maximum 120*0x20000/625e6 = 0.025 sec. However on my system I measure almost 10 times that number: 0.220 sec. I am using the SDK 1.3-beta on Linux x86-64. I confirm I am measuring the time correctly, it's not a question of some overhead because if I execute 10 times more instructions, the kernel takes exactly 10 times longer to complete (2.2 sec). What could be the reason of realizing only 1/10th the theoretical perf of the HD 4850 ?

il_ps

dcl_output o0

dcl_literal l0, 0x0, 0x20000, 0xffffffff, 0x0

mov r0.x, l0.y ; counter

mov r1.x, l0.x ; total

ixor r2, r2, r2

ixor r3, r3, r3

ixor r4, r4, r4

whileloop

break_logicalz r0

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2

mad r3, r3, r3, r3

mad r4, r4, r4, r4

iadd r0.x, r0.x, l0.z ; counter--

endloop

iadd r1, r1, r2

iadd r1, r1, r3

iadd r1, r1, r4

mov o0, r1

end

zpdixon,

I did a similar benchmark some time ago, see the thread titled 'strange benchmark':

http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=100672

The VLIW processors don't execute one instruction per clock cycle. Instead, they cycle between four or more threads. My analysis showed that each VLIW processor can execute 4 instructions in 5 clock cycles if it has 4 threads to work on, or it can execute 8 instructions in 8 clock cycles if it has 8 threads to work on. Therefore, you should generally parallelise your program so that every VLIW processor has at least 4, better 8, threads to work on.

Given that, your code should execute with 20% of your calculated speed if the number of threads is small.

My HD4870x2 board still does not run at full speed. Maybe this affects your board as well? Running

"aticonfig --adapter=0 --od-getclocks"

shows that the current clock is 507 Mhz instead of 750 Mhz, which is also consistent with my test results. I hope the people at ati will fix this some time.

Ingo

I am indeed running only a small number of threads: I call calCtxRunProgram once with a domain size of {0, 0, 1, 1} -> 4 pixels so 4 threads right ?

Also my timing may be slightly inaccurate. I do exactly this:

gettimeofday(tv0...);

calCtxRunProgram(...);

while (calCtxIsEventDone(...) == CAL_RESULT_PENDING)

nanosleep(...); // sleep 1ms

gettimeofday(tv1...)

// -> time is tv1 - tv0

Technically I should start counting time right before calCtxIsEventDone() instead of before calCtxRunProgram(), because the kernel is scheduled for execution the first time calCtxIsEventDone() is called.

I'll check the clock frequency of my HD 4850 with aticonfig tonight, I'll also experiment with more threads.

Thanks.

If you are only running 4 threads, then you are only running on 1 SIMD, so you should only run at 1/10th speed. In order to fully utilize the GPU you must run a minimum of 1280 threads as anything less will leave some SIMD's idle.

Also, 4 threads only utilizes 1/16th of a SIMD on the 770 as the wavefront size is 64 threads. A wavefront is a hardware thread and all software threads inside of a wavefront run in parallel.

This is probably what is supposed to happen. However, even at full load, the clock speed always stays at 507 Mhz:

# aticonfig --adapter=0 --od-getclocks

Adapter 0 - ATI Radeon HD 4870 X2

Core (MHz) Memory (MHz)

Current Clocks : 507 500

Current Peak : 750 900

Configurable Peak Range : [507-800] [500-1000]

GPU load : 99%

zpdixon,

Also, you have instruction dependencies that will stall your pipeline if it's anything more than a 3 stage pipeline. I.E:

mad r2, r2, r2, r2 < a

mad r3, r3, r3, r3

mad r4, r4, r4, r4

mad r2, r2, r2, r2 < depends on a

mad r3, r3, r3, r3

mad r4, r4, r4, r4

I think I had to unroll to use 16 different registers when i did this test on a Firestream 9170. Also, as others have suggested, when I did it, I used an 8192x8192 domain to maximize thread parallelism (though, you can likely use something much smaller).

Ok, I finally reached ~970 GFLOPS on my 4850, or 97% of the peak theoretical perf :-) It turns out I had 2 problems:

1. I noticed that CAL was silently ignoring the domain size because my IL kernel did not define any input stream (!) It took me a while to figure that out...

2. And as you guys pointed out, one thread is far from sufficient to estimate the perf of even just one VLIW processor. In my case I need about 32x threads per VLIW processor to reach the peak perf. I use a domain size of 32x160 == 5120 threads.

rahulgarg: 18 MAD only allow my 4850 to attain 77% of the theoretical perf, 36 MAD 87%, 72 MAD 89%, and 144 MAD 95%. Above that it varies between 93-97% (because the CAL compiler fails to always make use of the 5 SPUs per VLIW processor). I personally unrolled the loop to execute 249 MAD per iteration as it is one of the numerous values above 150 that allows it to not waste any SPU.

It makes no difference where I place the first gettimeofday call (before calCtxRunProgram or before calCtxIsEventDone).

josopait: according to aticonfig, my 4850 runs at 500MHz when idle and automatically jumps to 625MHz when running the IL kernel. I don't know why your 4870 stays at the low clock frequency...

rick.weber: well, instruction dependencies don't seem to matter at all to my IL kernel. In fact I even modified it to use only 2 registers (mad on r1; mad on r2; mad on r1; ...) as I noticed it is the simplest case where the CAL compiler is always able to optimize to use the 5 SPUs (when repeating the MAD instruction on the same register r1 over and over, CAL decides to "waste" the T SPU and only uses the X,Y,Z,W SPUs). And the fact I run such a high number of threads --5120-- render instruction dependencies irrelevant as the GPU context-switches to the next one when that occurs.