cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ryta1203
Journeyman III

Calculating the Bottleneck!?

I'm not sure I'm doing this correctly so I just want to make sure:

There are three equations on page 1-24 of the SDK Guide describing how to calculate theoretical performance for stream core instructions, fetch instructions and memory instructions.

I have a few questions:

1. If you have an input stream and an output stream and these are both float4 to the kernel is the input considered toward both the fetch AND memory calculations OR just the fetch calculations.

2. The Stream Core Instructions for RV770 is 160, correct? 16*10

3. The Fetch Instructions for RV770 is 40, correct? 4*10

So, if I have a kernel with 4 inputs and 1 output (pixel shader, no global buffer, etc)... simple kernel, ALU:Fetch of 1.0 (16 ALU ops) then for the RV770 for a domain of 256x256 (2D)....

Stream Core should be:

((256*256)*(16)) / ((160*750MHz))

Fetch should be:

((256*256)*(4)) / ((40)*(750MHz))

Memory should be:

((256*256)*(128)) / ((256)*(900MHz*2DDR)) This assumes only output is used, not input also and float4 (32*4=128 bits).

In this case the memory is the bottleneck (if I am calculating correctly) but it shouldn't be, the ALU Ops (according to the SKA) should be the bottleneck.

I guess I figure that I am calculating incorrectly here and am asking what am I doing wrong?

0 Likes
38 Replies
ryta1203
Journeyman III

And on top of that I'm not getting anywhere close to the expected times even for very simple kernels (I'm only timing from RunProgram to EventDone).

0 Likes

Does the memory work such that writing out a float is the same as writing out a float4?? So that instead of 128 bit for float4 I should be using 32 bit? I get results closer to SKA with this but I want to make sure.

0 Likes

For me it looks like:

4870 contains 160 ALUs each can perform up to 5 32-bit integer or SPFP instructions per clock (not sure about DPFP, probably only 1).

 All memory reads/writes performs on 128-bit operands, it's not possible to read anything less than 128-bit vector.

 

Actually these can be figured out from SKA output.

0 Likes

If you look at the equations in the SDK Guide though, the Stream Core says "# VLIW stream core instructions/thread", which to me implies that the 5 wide is inclusive.

Also, the memory equation is in BITS and the example they give in the SDK Guide has a one byte in and one byte out (for a total of 16 bits).

If you look at page 1-24 you will see this.

Anyone have any idea on the specifics?

0 Likes

Anyone from AMD have any advice since the docs don't explicitly (or not at least clearly) state this information?

The only way I was able to recreate the bottleneck given in the SKA was to make the bits for the input/output memory read/writes 32 and not 128 (as they should be).

Help please!

0 Likes

I'm really just looking for clarification on the documentation... this should be a pretty straightforward thing to find out.

0 Likes

Is it safe to assume that AMD can't understand their own documentation enough to give an accurate explanation?

0 Likes

The "# VLIW stream core instructions/thread" is inclusive of the 5 way core.

As for your calculations, you are correct, memory is the bottleneck. Unless you have a lot of ALU/TEX, simple kernels are almost always memory bound.

Whether a read/write is 32bits or 128bits is up to your kernel, the hardware can do both.

SKA is good for getting the exact numbers of ALU/TEX/CF ops for non-looping programs and very helpful in calculating them in looping programs, but does not have enough knowledge about the execution to currently give 100% correct analysis. It still requires knowledge from the developer using the formula's in the documentation.

Another example is the following PDF file.
http://coachk.cs.ucf.edu/cours...8/s08/PerfModeling.pdf
0 Likes

So the bits in the memory are correct? If you are writing a float4 then it should be 128 and if you are writing a float it should be 32?

Also, I had assumed that the SKA assumed a write of float4 for IL kernels, but it seems quite the opposite, that it assumes a write of float.

0 Likes

The presentation shows nothing that isn't already in the docs... unfortunately. Looking at the title I had really hoped for something deeper, it really only deals with how to calculate the bottleneck (which the slides are essentially copy and pasted from the docs, or the other way around, whatever) and that accessing system memory is faster.

I do have one question about the slides though: it talks about the access pattern for rasterization; HOWEVER, the docs say this is done transparent to the user.... ??

0 Likes

It is transparent to the user, but the knowledge of the rasterization pattern is a must to correctly optimize your texture accesses in pixel shader mode. For example, the rasterization pattern is tiled, but if you are using a linear buffer you probably are going to have bad cache performance. If the rasterization pattern is linear, as it is in compute shader, and you use a linear buffer, then you can reach close to peak performance. So a simple rule of thumb is to use tiled surfaces for pixel shader and linear surfaces for compute shader(in CAL this is done via setting the global buffer flag on calResAlloc). That way you can get the best performance.
0 Likes

Is there an example of this somewhere we can look at? I'm not sure what you mean by "correctly optimize your texture access in ps mode".

Is there somewhere in the AMD docs or samples that gives an example of doing this?

0 Likes

simple_matmult and compute_matmult is most likely the best example.
simple_matmult was written with rasterization and caching kept in mind, and the compute_matmult version is a fairly direct port to compute shader. The memory access pattern for these two examples work good when rasterization and texture is tiled(simple_matmult), but not good when rasterization is linear and texture is tiled(compute_matmult).

Micah
0 Likes

Micah,

  Is AMD planning on releasing more/better documentation for the compute mode shader? There really isn't that much out there to go off of, unfortunately, and it seems that it's better (or at least more straightforward) for GPGPU?

  Also, is there any way to know the max # of threads that can be run per SIMD at one time? I believe you said for CS it's 1024 but what about PS? It's up to the driver? How can we find this information out?

0 Likes

The maximum number of threads in PS mode is basically set by the number of registers per thread.

An estimate for this is:

wavefront_size * floor(256 / registers_per_thread) 

RV770, HD4870 for example, has a wavefront size of 64, some GPUs have smaller wavefronts.  A thread cannot use more than 127 registers, I believe. Registers allocated beyond this limit are spilled to memory automatically (performance suffers).

That doesn't account for temporary registers. The kernel can use from 0 to 4 temporary registers, something you can't control - you can only identify this by searching through the assembly shown by SKA for assignments to registers T0, T1, T2 and T3.

Including temporary registers:

wavefront_size * floor(256 - (2 * temp_registers_count)) / registers_per_thread)

If you are using global shared registers (only possible with IL) then you also need to subtract the total count of those from 256 (no need to multiply by 2).

0 Likes

Jawed,

  Yes, I understand how to calculate the threads; however, is it true that it's only limited by the resources allowed?

  If that were the case I think you would see good improvement with the reduction of register pressure; however, for certain sizes this is not what I have noticed.... for example, going from 64 to 32 does reduce execution time but going from 64 to say 58 does not and going from 33 all the way down to 10 doesn't seem to either. I'm just a little curious about this because in CUDA register pressure is a big deal but it doesn't seem to be as big a deal in FIRESTREAM (if I remember correctly the ATI register file is much larger).

  EDIT: This assumes the same number of: fetch ops (inputs), outputs, ALU ops, CF and according to the SKA all the kernels I ran also had the same Min, Max, Avg, Est Cycles, ALU:Fetch (obviuosly).

0 Likes

Originally posted by: ryta1203 Jawed,

  Yes, I understand how to calculate the threads; however, is it true that it's only limited by the resources allowed?



I can't think of any other restrictions specifically impacting the number of threads per SIMD. Maybe someone else can?

In graphics the vertex shader and the pixel shader are both running, time-sliced, on the GPU. So both types of kernel have their separate register allocations and the counts of wavefronts of each type can vary over the lifetime of those kernels (simple load-balancing). That makes it much trickier to ascertain the number of threads.

Under Brook+ there's presumably a vertex shader that forms the domain of pixels that are executed by the kernel you write. This vertex shader only needs to produce the 3 corners of a triangle - so it is trivial and disappears before pixel shading (i.e. your kernel) starts.

If that were the case I think you would see good improvement with the reduction of register pressure; however, for certain sizes this is not what I have noticed.... for example, going from 64 to 32 does reduce execution time but going from 64 to say 58 does not and going from 33 all the way down to 10 doesn't seem to either. I'm just a little curious about this because in CUDA register pressure is a big deal but it doesn't seem to be as big a deal in FIRESTREAM (if I remember correctly the ATI register file is much larger).


Yes, NVidia is rather restrictive, which causes all sorts of problems. It probably needs to be doubled again (GT200 doubled it over G80) to get to a reasonable level.

Going from 64 to 58 shouldn't really make any difference since in both cases that's 4 wavefronts.

In general if you have an algorithm with an inherently-high ALU:fetch ratio (this is counting cycles of hardware execution) then you'll be hard-pushed to discern any variation in performance. In other words, if the kernel is arithmetically intense then only a few (4 or 😎 wavefronts are required to hide any latency.

But dynamic control flow and scatter both add latency that is often forgotten when trying to define ALU:fetch. In a way it should be called "ALU:everything-else".

Cache access patterns and the whole rasterisation versus linear topic also have very real effects, resulting in performance that you can only derive empirically.

Jawed

0 Likes

Ryta,
There are some restrictions in the driver that limit the number of threads in a pixel shader. I'm not sure what that number is however or how it is calculated. In compute shader, if using LDS the max number of threads is 1024, but without LDS it is determined by the normal resource constraints. Once of the things that is difficult is the fact that lowering register count to increase the number of wavefronts in flight can also lead to cache thrashing because more data is being read in. For example, assume you have a cache of 12 elements, and 8 wavefronts in flight using 32 registers each reading two sequential elements. Each thread reads neighboring pixel so you have a slight overlap in data, which means your cache will be very useful, as you will only read in 9 elements, which all fit in cache. We are only considering 1 thread per wavefront. Now, you figure out how to recode your algorithm so you only use 16 registers, which doubles the number of wavefronts in flight. However, this also doubles the amount of data you are reading in, or 17 elements. 5 of those 17 elements do not fit into cache and thus evict the first 5 elements. Because of cache thrashing your performance would not necessarily increase with the increase in the number of wavefronts in flight. In this specific case, the optimal number of wavefronts would be 11 since you would fill the cache once each time you read and only the new element in the cache would evict the old element.

Now, this is a very simple case that is easy to understand, but this phenomenom can explain why you do not see the performance gains you were expecting.
0 Likes

Micah,

The cache counter returns 0, so I'm assuming there are no cache hits. My experiments run a domain of 768x768 and an ALU:Fetch of slightly less than 1.0 (for Jawed, that's 3.87xxxx) or better put 247 ALU Ops and 64 Texture Fetches.

 

Jawed, yes you are correct that there is no difference between 64 and 58, it was just an example. What about the difference between 17 and 10? 17 GPRs should allow 15 wavefronts while 10 GPRs should allow 25, a difference of 10 wavefronts yet I don't see a difference in performance, again the cache counter is ZERO, the ALUs are 247 and the texture fetches are 64.

EDIT: each input is accessed only one time, however, the cache counter does return some values (~30 for GPR <= 33) for larger domain sizes and larger ALU:Fetch ratios; however, the performance is the same result. Sorry, I also forgot to mention that each kernel uses 2 T registers, so really that should be 19 and 12 GPRs, equally 13 and 21 respectively.

0 Likes

Micah,

   So is it possible that there is something wrong with the cache counter?

0 Likes

Originally posted by: ryta1203The cache counter returns 0, so I'm assuming there are no cache hits. My experiments run a domain of 768x768 and an ALU:Fetch of slightly less than 1.0 (for Jawed, that's 3.87xxxx) or better put 247 ALU Ops and 64 Texture Fetches.


Aha, so you've got counters working! 

Does this kernel have a loop or loops? The loop that consumes the most run-time cycles is probably the place to focus on. What are the ALU and fetch counts for such a loop, if there is one?

Jawed, yes you are correct that there is no difference between 64 and 58, it was just an example. What about the difference between 17 and 10? 17 GPRs should allow 15 wavefronts while 10 GPRs should allow 25, a difference of 10 wavefronts yet I don't see a difference in performance, again the cache counter is ZERO, the ALUs are 247 and the texture fetches are 64.


If 15 wavefronts are capable of hiding all the latency of non-ALU operations, and having 25 wavefronts doesn't radically increase latency, then you won't see a worthwhile performance improvement. You could try altering the size of the domain and/or how "square" it is, to see what kind of effects there are on performance.

If 15 wavefronts couldn't hide that latency, yet 25 could, then there'd be a performance gain.

The variations in performance arise solely due to shifting of bottlenecks amongst the types of operations the GPU is performing.

As your kernel is so evenly matched on both ALU and fetch, you're left to investigate writes to memory, cache access patterns and clause structure. The structure of the clauses (and count of them) can have an effect on latency.

EDIT: each input is accessed only one time, however, the cache counter does return some values (~30 for GPR <= 33) for larger domain sizes and larger ALU:Fetch ratios; however, the performance is the same result. Sorry, I also forgot to mention that each kernel uses 2 T registers, so really that should be 19 and 12 GPRs, equally 13 and 21 respectively.


Is that 30% cache hit rate? I don't know which cache that's referring to - I suspect it means a hit in L2. I don't know what the latency of a fetch from L2 (through L1) into registers is. 16 cycles? Anyway, whatever it is, much less than ~250 cycles (bit of a guess) for a fetch from video memory.

This talks about shuffling the kernel code in order to tune cache access patterns:

http://www.research.ibm.com/people/h/hind/pldi08-tutorial_files/GPGPU.pdf

The 2 T registers in your code will not result in 13 and 21 wavefronts, they will result in 14 (252/17) and 25 (252/10) wavefronts.

How does cache hit vary with varying wavefront count?

0 Likes

Jawed,

1. Thanks for the toughtful response.

2. Yes, got counters working after someone pointed out what was not in the docs... sadly, you should be able to simply call the API functions, but instead you have to kind of setup them up also.

3. There are no loops.

4. The number of CF is the same for every kernel, despite the number of GPRs. As far as the structure of them, each ALU clause has the max number (~124 count) of ALU ops and each TEX clause has 8 fetches.

5. Yes, once I thought about it, it seems that either 1) the bottleneck has been shifted to the ALU OR 2) texture fetch bandwidth has been saturated.

6. I've run A LOT of experiments with all different ALU:Fetch ratios and domain sizes.... even though the cache hits change the overall performance does not. I assume this is because the bottleneck has changed at that point.

7. Yes, I need to look at how the cache works more closely.

8. I don't know which cache that's refering to either, since the docs don't specify. This could be to help AMD hide the cache sizes, etc, or it could just be bad docs, I don't really know.

9. 252? So if a T register is used then all 4 are? Shouldn't that be 254 (256-2?)

0 Likes

The T registers count twice, because the assigned number of Ts has to be allocated for "odd" and "even" wavefronts.

The ALUs execute a pair of wavefronts at any one time, in a pattern of cycles that goes AAAABBBBAAAA....

So the T registers, which are private to the clause that's being executed, have to be allocated for A and B.

See sections 2.5 and 2.6:

http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf

Presumably you've also experimented with a compute shader version of your kernel. Though it appears with the performance-insensitivity that you currently have, that CS might not be any use. Unless there's a radical gain in cache hit to be found somewhere?...

0 Likes

I have not run these in compute shader mode yet. Since I started in pixel shader mode I would like to get a full understanding of that first, then when I move to compute shader mode, things should move a little faster for me.

As far as the temp clause registers, it would seem you are correct. I had read those sections already but didn't give it much thought.

Sadly, the SDK Guide does not descirbe very well (or rather accurately) how these WFs are executed or scheduled on the ALUs. This does bring up another question I have.

Does the term "slot" refer to the wavefront's position in the queue?

As a note to AMD: It would be very helpful if the T registers were accounted for in the reporting of GPRs in the SKA since they effect performance and effect the overall amount of GPRs used. Not sure why they weren't included to begin with.

0 Likes

"Slot" in 2.6.1.2 is merely referring to the wavefront pairing that is created for execution of instructions. A is the odd slot and B is the even slot in that AAAABBBBAAAA.... description I gave earlier.

Jawed

0 Likes

So really what I'm asking is this:

Is every wavefront paired with another wavefront to execute in parallel? What if there is not an even number of wavefronts?

0 Likes

Ryta,
If you can find a test case that fails reliably you can send it to stream computing email address to have the CAL team look into why it isn't working as expected.
0 Likes

Micah,

  OK, thanks. I get it, I just need to look at how the cache is configured, I guess I'm hitting the same cache lines (though I thought sequential access would avoid this)... either way I just need to take a better look at my numbers and try to figure out that cache, thanks again for your time.

0 Likes

Ryta,
Wavefront's are not actually 'paired' but they will execute when there is room to execute. So if there is not wavefront able to fit in the odd slot then only half of the SIMD will be used.
0 Likes

Micah,

  Thanks, that helps.

0 Likes

Micah,

  Sorry, this brings up another question: Why is only half the SIMD used? According to the docs, for the RV770, there are 16 SPs, a quad for each SP (4 instr over 4 cycles, for a total of 64) with a wavefront per SIMD, so if there was only 1 wavefront why would only half the SIMD be used?

0 Likes

Also, my original question stands:

When calculating those formulas, for inputs, am I to include them both in the memory calculations and the fetch calculations?

0 Likes

Ryta,
Fetch instructions are to be included as fetch only, because that is a seperate unit(texture) and it is possible to be bound by this unit. Only the amount of data is included in memory calculations.
As for the wavefronts, although it is seen as 4 instr over 4 cycles, that assumes that both wavefronts are executing in parallel. So, wavefront A executes 4 instr over 4 cylces, then wavefront B executes, then A, then B. If B does not exist, A only executes every 8 cycles and not every 4.
0 Likes

Ok. I'm not quite understanding the fetch thing fully.

If I have 3 inputs and 1 output and the generated code give 3 fetch instructions (for the 3 inputs), do the inputs go into both formulas? From your answer I'll assume no unless you tell me otherwise.

The only problem I'm having is that I'm looking at some kernels that are obviuosly memory bound but according to the calculations they are ALU bound (I'm pretty sure they aren't ALU bound since the execution time remains the same with an increasing number of ALU ops). I only have this problem when dealing with float4s not floats, so I'm not sure if my formula calculations are correct.

0 Likes

Originally posted by: MicahVillmow Ryta, Fetch instructions are to be included as fetch only, because that is a seperate unit(texture) and it is possible to be bound by this unit. Only the amount of data is included in memory calculations. As for the wavefronts, although it is seen as 4 instr over 4 cycles, that assumes that both wavefronts are executing in parallel. So, wavefront A executes 4 instr over 4 cylces, then wavefront B executes, then A, then B. If B does not exist, A only executes every 8 cycles and not every 4.


This thread revisited:

The texture fetch formula does not include "bytes" so therefore I assumed the latency should be the same for float4 and float.... by experiment this does NOT seem to be the case at all. In fact, the formula follow the experiment almost exactly for float but only follows float4 once ALU becomes the bottleneck. If FETCH is the bottleneck for float4 the latency is much higher than it is for float, though the docs don't seem to talk about this.

Should the fetch latencies be the same? If not, then what should the float2, float3 and float4 formulas look like??

 

0 Likes

Ryta,
Here is a thread you might find interesting as it gives some good information for HD48XX performance and optimizations.
http://forum.beyond3d.com/showthread.php?t=54842
0 Likes

Ok thanks, I read it once but I probably need to go back, read it again and think about it, though I'm not sure it answers my questions, it's a useful thread so thanks.

0 Likes

I don't think it will answer all of them, but it does have useful low level information in it(i.e. peak L1 achievable is 444GB/s and peak is 480GB/s and local data share isn't faster), etc...
0 Likes