cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

hendrix
Journeyman III

Brook+ 1.4 performance seems to be very limited, CAL is much faster

I simple tried the matmul samples from the ATI stream SDK 1.4 and found that the Brook+ sample "optimized matmult" with 256x256 matrices achieves only 0,146 GFLOPS. Increasing the size to 1024x1024 the result was 12,7 GFLOPS.

The CAL-sample "simple matmult" runs with 55,4 GFLOPS (size 256 x 256), which is more than 4 times faster as the best Brook+ result.

Can anyone explain, why Brook+ suffers from such a high overhead, or what else limits Brook+ efficiency ?

I am using a Radeon HD 2600 XT with 120 stream processors and 192 GFLOPS peak performance (Windows XP SP3, 32 bit). So the CAL-sample "compute_matmul" didn´t start, cause the GPU dosn´t support compute kernels.

 

 

0 Likes
22 Replies
Jawed
Adept II

Take a look at the IL produced when you compile the Brook+.

Does it contain lots of mul_ieee instructions? If so, do a replace-all to make them all "mul". When this version of the IL is compiled, the GPU-specific assembly will contain loads of MAD instructions. The prior version just contains MUL_e and ADD Well that's what happens here when I experiment with Brook+ in Stream KernelAnalyzer.

Next thing to note is that the Brook+ version performs significantly less fetches from memory per loop. It only fetches 12 vec4s, whereas the IL version performs 48. You can try to increase the number of fetches in the Brook+ version. See this presentation deck which includes a presentation on optimisation of matrix multiplication. Tricky stuff involving cache-use optimisation.

http://ati.amd.com/technology/...ing/PLDI08Tutorial.pdf

Jawed

0 Likes

AMD hasn't really provided enough information to properly optimize in Brook+. If you are happy getting "some" speedup then I think it's perfectly fine for that; however, if you want to really optimize the code then Brook+ is not the way to go, besides, OpenCL will be out soon and I doubt most people will be using Brook+ after that comes out.

AMD has not given a solid performance model and it seems that some of them are not quite sure exactly what the performance model even is for these cards.

0 Likes

The CAL optimized matrix multiply uses the global buffer for computation, which allows kernels to be agressively unrolled. Since Brook+ can't directly use the global buffer in this way, kernels are limited to being unrolled 8 ways (the maximum number of outputs a kernel can have). On top of that, Brook+ presently provides far from optimal implementations of kernels using the CAL backend.

0 Likes

Except for the overhead of the Brook+ runtime, there is not much stopping someone from re-writing the Brook+ optimized_matmult example to match most of the performance of the simple_matmult example from CAL. If the kernel is written in an efficient enough manner and the data set is large enough, the overhead from the Brook+ runtime will be negligible.

0 Likes

Originally posted by: rick.weber The CAL optimized matrix multiply uses the global buffer for computation, which allows kernels to be agressively unrolled.


But it only writes 8 float4s to the global buffer (in memexport_matmult), which is, as you observed, also possible within Brook+'s limit of 8 output streams.

It seems to me the Brook+ could be coded with exactly the same unrolling as the IL. But the mul_ieee problem will still get in the way...

The IL uses 39 registers, which is pretty severe. I think that means only 6 wavefronts can run on each SIMD - 16384 registers per SIMD / 64 threads per wavefront / 39 = 6.6. Approaching the point at which latency-hiding through wavefront switching stops working well (though the kernel is bandwidth bound anyway)? So the IL seems to have reached the limit of unrolling, an unrolling that appears to be within the capability of Brook+.

Jawed

0 Likes

Jawed,

  Even reducing the GPR usage does not help performance, I guess this is because it is still bandwidth bound???

0 Likes

See the discussion that starts here:

http://forum.beyond3d.com/show...?p=1290019#post1290019

We're mostly talking about IL rather than Brook+ implementations of matrix multiply.

There's a difference in performance between the "pixel shader" IL and the "compute shader" IL, with the latter being slower seemingly due to poor cache locality (i.e. increased cache miss rate).

I don't know how much slower the CS version is.

If you have a high GPR allocation and a low ALU:fetch then both need to change dramatically before you'll see any benefit. You're sort of stuck with a large no-man's-land to cross before things swing in your favour. Say your kernel's main loop has ~2x the ALUs as TEX instructions, e.g. 48 fetches and 120 ALUs. This is so far away from the ideal of 4x, that you have to change something radically.

Similarly if you have ~40 GPRs then that's only ~6 (or maybe 5 I'm unclear on niggly details) wavefronts. Such a low number of wavefronts makes cache hit ratio very important - it's easier to run out of ALU instructions entirely with so few wavefronts.

vasionok's proposal in post #34 is interesting. I'm currently playing with it but I'm getting different numbers: 24 GPRs, 7.3 ALU:fetch and 10 wavefronts. I think I'm doing something wrong...

But it should be possible for anyone who's interested to try this in Brook+, but it requires scatter output (C is 16 vec4s). Also as far as I can tell this technique keeps A, B and C as single matrices, rather than broken up into the parts seen in the SDK Brook+. So CPU code that handles mapping of the domain to sub-block coordinates and the incrementors for selecting from A and B are different.

I suppose I should do a Brook+ version because I can at least test that using the CPU backend whereas I can't test IL with a Stream GPU.

I presume that in using scatter output in Brook+ the cache access pattern will be sub-optimal, like IL-CS.

Jawed

0 Likes

Which part of his post?

Are you getting those numbers, 24GPRs, 7.3ALU:Fetch, etc... using the CS example or the PS simple example?

With all your curiousity in this you should invest in a Stream GPU, since in my experience the numbers in the SKA do not often equate to expected results (as one of our previous conversations).

Yes, the ALU:Fetch ratio is 4:1 (which has you mentioned, the SKA automatically takes this into consideration when reporting the ALU:Fetch ratio); HOWEVER, there is more to that ratio in the SKA than simply ALU instructions. For example, the simple_matmult reports an ALU:Fetch ratio of 13.43 yet has 83 ALU and 24 TEX.

Personally, I tried working with the cache perf counters but had no luck, kept getting a runtime crash and couldn't get help with it. Though I will say that this perf counter would be extremely useful if I could get it to work.

Sadly, the SKA is not all it's cracked up to be and I would still love to see a good profiler or even the SKA have more detailed documentation.

0 Likes

Originally posted by: ryta1203 Which part of his post?


The paragraph starting "Why so many rows in B?"

Are you getting those numbers, 24GPRs, 7.3ALU:Fetch, etc... using the CS example or the PS simple example?


I have butchered the IL-CS code. I haven't done the correct address computations, but I've done dummy computations in order to ensure uniqueness of fetches.

The Brook+ version I've just coded has 28 GPRs. If I take the raw ISA (with mul_e and add) I get 7.6 ALU:fetch for the inner loop. Correcting to mads results in 4 ALU:fetch, also with 28GPRs. I suspect I've done something wrong...

With all your curiousity in this you should invest in a Stream GPU, since in my experience the numbers in the SKA do not often equate to expected results (as one of our previous conversations). 🙂

Yes, the ALU:Fetch ratio is 4:1 (which has you mentioned, the SKA automatically takes this into consideration when reporting the ALU:Fetch ratio); HOWEVER, there is more to that ratio in the SKA than simply ALU instructions. For example, the simple_matmult reports an ALU:Fetch ratio of 13.43 yet has 83 ALU and 24 TEX.



I need an entirely new PC (since mine is old) just to install the card, and it's not going to happen soon.

Simple examination of the main loop of intense kernels like this gives you the count of ALU cycles and TEX cycles, so you can derive your own ALU:fetch. The SKA number is generally meaningless if you have a loop or control flow.

Obviously there's no reasonable way to account for memory/cache performance without profiling/counters.

Really the issue here is just how complex the kernel is: how much divergence there is in control flow and how random and intense are memory operations.

Things like Brook+ never compiling MADs but always compiling mul_e+add is just annoying. I can't help wondering if that on its own would make Brook+ MM as fast as IL-PS.

Jawed

0 Likes

Originally posted by: JawedThings like Brook+ never compiling MADs but always compiling mul_e+add is just annoying. I can't help wondering if that on its own would make Brook+ MM as fast as IL-PS.

Jawed

Here I have changed brcc source to solve this, iltranscriber.hpp in line 372 replacing the MUL_e to MUL, I also changed a few oher lines, not very improtant.

This could be a compiler option, a bit less precise?

 

After doing it there is another trick, in brook+ do:

a = a+ b*c + d*e

and not:

a += b*c + d*e

The compiler won't reorder operations, maybe another point to improve, but I'm too lazy to do this modification to the brcc source...

 

0 Likes

Originally posted by: eduardoschardongHere I have changed brcc source to solve this, iltranscriber.hpp in line 372 replacing the MUL_e to MUL, I also changed a few oher lines, not very improtant.

This could be a compiler option, a bit less precise?



Neat.

I've just successfully done a test compile of brcc.exe (had to go find flex and bison ) so now I'll have a play.

I guess MAD is more precise in matrix multiplication than MUL_e + ADD as there's only 1 rounding not two. It seems that MUL_e only differs from MUL in using IEEE rules for multiplication by 0, but I'm a noob.

After doing it there is another trick, in brook+ do:

a = a+ b*c + d*e

and not:

a += b*c + d*e

The compiler won't reorder operations, maybe another point to improve, but I'm too lazy to do this modification to the brcc source...



Yes, I've tripped up on that a little and found the same solution.

I think general re-ordering is a tricky subject - e.g. the programmer's chosen ordering for maximum precision would get undone if the compiler fully optimises.

Precision in matrix multiplication is a tricky subject I guess. It's not something I've ever worked with in any detail, but I can imagine single-precision MM is quite imprecise on huge matrices simply due to the quantity of computations per result cell. Additionally, it seems to me that the order/size of sub-blocks that are fetched and computed before summing into the result cell has a noticeable effect on precision.

Jawed

0 Likes

Yes, you can count the number of ALU and TEX yourself, why bother if the SKA does it for you (not talking about the ALU:Fetch, just the # of instr)?

Your Brook+ example uses pixel shader? So this is more like the simple_matmult?

Personally, I'm not seeing any performance change with the lowering of GPRs, allowing for 1-3 more wavefronts being run.

Yes, it's unfortunate you don't have a card since you can't get performance results.

0 Likes

Originally posted by: ryta1203 Yes, you can count the number of ALU and TEX yourself, why bother if the SKA does it for you (not talking about the ALU:Fetch, just the # of instr)?


You have to count them yourself when you're only interested in the inner loop. Just take the final ALU instruction number (e.g.267) subtract the first ALU instruction number in the loop (e.g. 16) subtract the count of TEX instructions in the loop and add 1. Almost too trivial to describe

Your Brook+ example uses pixel shader? So this is more like the simple_matmult?


Normal Brook+ kernels are compiled to produce il_ps_2_0 shaders, which are "pixel shaders".

If you put "Attribute[GroupSize(64)]" before the kernel definition (multiples of 64 upto 1024), then that produces il_cs_2_0, which is a compute shader.

I'm still trying to understand how to get the matrix C back from this, so I think I will have to convert to a compute shader rather than pixel shader.

I am going to try testing it today. I'm also fairly sure I got it wrong earlier, but that's because I was fetching rows from B in the wrong order, so should see a much higher ALU:fetch

Personally, I'm not seeing any performance change with the lowering of GPRs, allowing for 1-3 more wavefronts being run.


Do you also have ALU:fetch >4?

Jawed

0 Likes

You can do both ps and cs in  Brook+ which is why I was asking, I was curious to compare your GPR with what I had gotten using IL for the same examples, the CS GPR usage is much higher than the PS GPR usage, which is why I wanted to know.

I haven't played with the ALU count at all. According to the SKA the bottleneck is the ALU Ops, so I'm a little confused on how adding more ALU instructions is going to help increase performance (unless it increases cache hits). Either way, it's worth a try so I will look into it and get some results.

0 Likes

"SKA bottleneck" is completely useless here because we're dealing with a loop. You have no choice but to count the ALU and TEX instructions.

Having played a bit, I think I have 136 ALUs and 18 fetches giving 7.6 ALU:fetch with 26GPRs and 2Ts = 9 wavefronts. So 7.5 FLOPs per cycle results in 905GFLOPs. But I want to try to get it working before I post the code. I could be a while...

Jawed

0 Likes

Jawed,

  That's ~75% of theoritical peak for 4870 (peak is 1200 GFLOPs), I would be interested to see if your actual results are anywhere near your expected results.

  I'm assuming you are still using pixel shader.

0 Likes

I've looked at this more closely and can't find the ALU:fetch ratios he was alluding to, let alone the even higher ones I thought would result.

Maybe he'll clarify what he meant. I'm all out of ideas.

LDS, when broadcast reads work properly, should prove a boon. I might have a fiddle with that...

I have noticed that CS can use 1 less register than PS, a couple of times. But if there's a rasterisation-pattern penalty that causes code to get bogged down in the cache then it may not be worth it.

Jawed

0 Likes

Jawed,

  Would you mind showing the CS kernel that allows you to get 1 less GPR than the PS kernel? I'd be interested to see that.

0 Likes

Attribute[GroupSize(64)]

kernel void
thin_matmult(int AWidth, int BHeight,
        float4 A[][],
        float4 B[][],
        out float4 C[][])
{
    // Setting zero
    float4 zero = float4(0.0f, 0.0f, 0.0f, 0.0f);

    // Declaring and initializing accumulators, left then right columns
    float4 acc00 = zero;
    float4 acc01 = zero;

    // Row number of output position
    int i = instance().y ;

    // Column number of output position - 2 columns per thread
    int j = instance().x * 2;

    int k = 0;
 int l ;
    for(; k < AWidth; k+=2)
    {

  for (l = 0; l < BHeight; l++)
  {
   // B, left then right columns
   float4 B0 = B;
   float4 B1 = B[j+1];

   // A, both columns, one row
   float4 A0 = A
   float4 A1 = A
[k+1];
   acc00 = acc00 + A0.x * B0 + A0.y * B0 + A0.z * B0 + A0.w * B0 ;
   acc00 = acc00 + A1.x * B0 + A1.y * B0 + A1.z * B0 + A1.w * B0 ;
   acc01 = acc01 + A0.x * B1 + A0.y * B1 + A0.z * B1 + A0.w * B1 ;
   acc01 = acc01 + A1.x * B1 + A1.y * B1 + A1.z * B1 + A1.w * B1 ;
  }
 }

    C[i  ][j  ] = acc00;
    C[i  ][j+1] = acc01;
}

Note this is junk (but conveniently short for posting), it was just a way to get 64 MADs. 10 GPRs as pixel shader (delete the Attribute) or 9 GPRs as compute shader.

In the PS version, it appears register R0 contains the position of the "pixel" and it is adjusted by subtracting 0.5 from x and y. In the CS no adjustment is performed.

Additionally the PS contains a superfluous write, EXP_DONE, which to me implies that memory is allocated somewhere to hold the results of this kernel, even though the Brook+ specifies no stream output. Notice in this case that R2 is written as the final instruction:

11 ENDLOOP i0 PASS_JUMP_ADDR(2)
12 ALU: ADDR(130) CNT(14) KCACHE0(CB0:0-15)
     34  z: ADD_INT     T1.z,  R7.x,  1     
         t: MULLO_INT   ____,  0.0f,  KC0[2].x     
     35  x: MOV         R2.x,  0.0f     
         y: MOV         R2.y,  0.0f     
         z: MOV         R2.z,  0.0f     
         w: MOV         R2.w,  0.0f     
         t: MULLO_INT   T0.z,  PS34,  KC0[2].x     
     36  t: MULLO_INT   ____,  R8.x,  KC0[2].x     
     37  x: ADD_INT     ____,  T0.z,  PS36     
     38  y: ADD_INT     ____,  T1.z,  PV37.x     
         w: ADD_INT     ____,  R7.x,  PV37.x     
     39  x: LSHL        R0.x,  PV38.w,  (0x00000002, 2.802596929e-45f).x     
         t: LSHL        R1.x,  PV38.y,  (0x00000002, 2.802596929e-45f).x     
13 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R4, ELEM_SIZE(3)
14 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R5, ELEM_SIZE(3)
15 EXP_DONE: PIX0, R2
END_OF_PROGRAM

after first setting R2 = 0. So, I guess if your kernel's performance doesn't depend on scatter then setting GroupSize might be a useful tweak. I'm not sure if setting GroupSize like this artificially restricts the number of wavefronts that can be in flight on each SIMD. I know defining a shared variable with a size (i.e. using LDS) does restrict the wavefront count, but not sure what happens if no reference to a shared variable exists in the kernel.

Jawed

0 Likes

Jawed,

  Yeah, it's interesting to look at some of these things but at the same time I'm still interested in results, which sadly you can't get on your end.

0 Likes

In the end it's just one of those variables that's pretty subtle - perhaps 95% of the time it'll make no difference or CS will be overall worse for other reasons.

Jawed

0 Likes

Well, in the end it's actually just going to be AMD/ATI who suffers because they fail to bring an accurate predictable performance model to the table so that developers can at least try and get the best performance out of their cards without having to throw darts in the dark.

0 Likes