cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

hendrix
Journeyman III

Brook+ 1.4 performance seems to be very limited, CAL is much faster

I simple tried the matmul samples from the ATI stream SDK 1.4 and found that the Brook+ sample "optimized matmult" with 256x256 matrices achieves only 0,146 GFLOPS. Increasing the size to 1024x1024 the result was 12,7 GFLOPS.

The CAL-sample "simple matmult" runs with 55,4 GFLOPS (size 256 x 256), which is more than 4 times faster as the best Brook+ result.

Can anyone explain, why Brook+ suffers from such a high overhead, or what else limits Brook+ efficiency ?

I am using a Radeon HD 2600 XT with 120 stream processors and 192 GFLOPS peak performance (Windows XP SP3, 32 bit). So the CAL-sample "compute_matmul" didn´t start, cause the GPU dosn´t support compute kernels.

 

 

Tags (2)
0 Likes
22 Replies
Jawed
Adept II

Brook+ 1.4 performance seems to be very limited, CAL is much faster

Take a look at the IL produced when you compile the Brook+.

Does it contain lots of mul_ieee instructions? If so, do a replace-all to make them all "mul". When this version of the IL is compiled, the GPU-specific assembly will contain loads of MAD instructions. The prior version just contains MUL_e and ADD Well that's what happens here when I experiment with Brook+ in Stream KernelAnalyzer.

Next thing to note is that the Brook+ version performs significantly less fetches from memory per loop. It only fetches 12 vec4s, whereas the IL version performs 48. You can try to increase the number of fetches in the Brook+ version. See this presentation deck which includes a presentation on optimisation of matrix multiplication. Tricky stuff involving cache-use optimisation.

http://ati.amd.com/technology/...ing/PLDI08Tutorial.pdf

Jawed

0 Likes
ryta1203
Journeyman III

Brook+ 1.4 performance seems to be very limited, CAL is much faster

AMD hasn't really provided enough information to properly optimize in Brook+. If you are happy getting "some" speedup then I think it's perfectly fine for that; however, if you want to really optimize the code then Brook+ is not the way to go, besides, OpenCL will be out soon and I doubt most people will be using Brook+ after that comes out.

AMD has not given a solid performance model and it seems that some of them are not quite sure exactly what the performance model even is for these cards.

0 Likes
rick_weber
Adept II

Brook+ 1.4 performance seems to be very limited, CAL is much faster

The CAL optimized matrix multiply uses the global buffer for computation, which allows kernels to be agressively unrolled. Since Brook+ can't directly use the global buffer in this way, kernels are limited to being unrolled 8 ways (the maximum number of outputs a kernel can have). On top of that, Brook+ presently provides far from optimal implementations of kernels using the CAL backend.

0 Likes
MicahVillmow
Staff
Staff

Brook+ 1.4 performance seems to be very limited, CAL is much faster

Except for the overhead of the Brook+ runtime, there is not much stopping someone from re-writing the Brook+ optimized_matmult example to match most of the performance of the simple_matmult example from CAL. If the kernel is written in an efficient enough manner and the data set is large enough, the overhead from the Brook+ runtime will be negligible.

0 Likes
Jawed
Adept II

Brook+ 1.4 performance seems to be very limited, CAL is much faster

Originally posted by: rick.weber The CAL optimized matrix multiply uses the global buffer for computation, which allows kernels to be agressively unrolled.


But it only writes 8 float4s to the global buffer (in memexport_matmult), which is, as you observed, also possible within Brook+'s limit of 8 output streams.

It seems to me the Brook+ could be coded with exactly the same unrolling as the IL. But the mul_ieee problem will still get in the way...

The IL uses 39 registers, which is pretty severe. I think that means only 6 wavefronts can run on each SIMD - 16384 registers per SIMD / 64 threads per wavefront / 39 = 6.6. Approaching the point at which latency-hiding through wavefront switching stops working well (though the kernel is bandwidth bound anyway)? So the IL seems to have reached the limit of unrolling, an unrolling that appears to be within the capability of Brook+.

Jawed

0 Likes
ryta1203
Journeyman III

Brook+ 1.4 performance seems to be very limited, CAL is much faster

Jawed,

  Even reducing the GPR usage does not help performance, I guess this is because it is still bandwidth bound???

0 Likes
Jawed
Adept II

Brook+ 1.4 performance seems to be very limited, CAL is much faster

See the discussion that starts here:

http://forum.beyond3d.com/show...?p=1290019#post1290019

We're mostly talking about IL rather than Brook+ implementations of matrix multiply.

There's a difference in performance between the "pixel shader" IL and the "compute shader" IL, with the latter being slower seemingly due to poor cache locality (i.e. increased cache miss rate).

I don't know how much slower the CS version is.

If you have a high GPR allocation and a low ALU:fetch then both need to change dramatically before you'll see any benefit. You're sort of stuck with a large no-man's-land to cross before things swing in your favour. Say your kernel's main loop has ~2x the ALUs as TEX instructions, e.g. 48 fetches and 120 ALUs. This is so far away from the ideal of 4x, that you have to change something radically.

Similarly if you have ~40 GPRs then that's only ~6 (or maybe 5 I'm unclear on niggly details) wavefronts. Such a low number of wavefronts makes cache hit ratio very important - it's easier to run out of ALU instructions entirely with so few wavefronts.

vasionok's proposal in post #34 is interesting. I'm currently playing with it but I'm getting different numbers: 24 GPRs, 7.3 ALU:fetch and 10 wavefronts. I think I'm doing something wrong...

But it should be possible for anyone who's interested to try this in Brook+, but it requires scatter output (C is 16 vec4s). Also as far as I can tell this technique keeps A, B and C as single matrices, rather than broken up into the parts seen in the SDK Brook+. So CPU code that handles mapping of the domain to sub-block coordinates and the incrementors for selecting from A and B are different.

I suppose I should do a Brook+ version because I can at least test that using the CPU backend whereas I can't test IL with a Stream GPU.

I presume that in using scatter output in Brook+ the cache access pattern will be sub-optimal, like IL-CS.

Jawed

0 Likes
ryta1203
Journeyman III

Brook+ 1.4 performance seems to be very limited, CAL is much faster

Which part of his post?

Are you getting those numbers, 24GPRs, 7.3ALU:Fetch, etc... using the CS example or the PS simple example?

With all your curiousity in this you should invest in a Stream GPU, since in my experience the numbers in the SKA do not often equate to expected results (as one of our previous conversations).

Yes, the ALU:Fetch ratio is 4:1 (which has you mentioned, the SKA automatically takes this into consideration when reporting the ALU:Fetch ratio); HOWEVER, there is more to that ratio in the SKA than simply ALU instructions. For example, the simple_matmult reports an ALU:Fetch ratio of 13.43 yet has 83 ALU and 24 TEX.

Personally, I tried working with the cache perf counters but had no luck, kept getting a runtime crash and couldn't get help with it. Though I will say that this perf counter would be extremely useful if I could get it to work.

Sadly, the SKA is not all it's cracked up to be and I would still love to see a good profiler or even the SKA have more detailed documentation.

0 Likes
Jawed
Adept II

Brook+ 1.4 performance seems to be very limited, CAL is much faster

Originally posted by: ryta1203 Which part of his post?


The paragraph starting "Why so many rows in B?"

Are you getting those numbers, 24GPRs, 7.3ALU:Fetch, etc... using the CS example or the PS simple example?


I have butchered the IL-CS code. I haven't done the correct address computations, but I've done dummy computations in order to ensure uniqueness of fetches.

The Brook+ version I've just coded has 28 GPRs. If I take the raw ISA (with mul_e and add) I get 7.6 ALU:fetch for the inner loop. Correcting to mads results in 4 ALU:fetch, also with 28GPRs. I suspect I've done something wrong...

With all your curiousity in this you should invest in a Stream GPU, since in my experience the numbers in the SKA do not often equate to expected results (as one of our previous conversations). 🙂

Yes, the ALU:Fetch ratio is 4:1 (which has you mentioned, the SKA automatically takes this into consideration when reporting the ALU:Fetch ratio); HOWEVER, there is more to that ratio in the SKA than simply ALU instructions. For example, the simple_matmult reports an ALU:Fetch ratio of 13.43 yet has 83 ALU and 24 TEX.



I need an entirely new PC (since mine is old) just to install the card, and it's not going to happen soon.

Simple examination of the main loop of intense kernels like this gives you the count of ALU cycles and TEX cycles, so you can derive your own ALU:fetch. The SKA number is generally meaningless if you have a loop or control flow.

Obviously there's no reasonable way to account for memory/cache performance without profiling/counters.

Really the issue here is just how complex the kernel is: how much divergence there is in control flow and how random and intense are memory operations.

Things like Brook+ never compiling MADs but always compiling mul_e+add is just annoying. I can't help wondering if that on its own would make Brook+ MM as fast as IL-PS.

Jawed

0 Likes