I'm not sure I'm doing this correctly so I just want to make sure:
There are three equations on page 1-24 of the SDK Guide describing how to calculate theoretical performance for stream core instructions, fetch instructions and memory instructions.
I have a few questions:
1. If you have an input stream and an output stream and these are both float4 to the kernel is the input considered toward both the fetch AND memory calculations OR just the fetch calculations.
2. The Stream Core Instructions for RV770 is 160, correct? 16*10
3. The Fetch Instructions for RV770 is 40, correct? 4*10
So, if I have a kernel with 4 inputs and 1 output (pixel shader, no global buffer, etc)... simple kernel, ALU:Fetch of 1.0 (16 ALU ops) then for the RV770 for a domain of 256x256 (2D)....
Stream Core should be:
((256*256)*(16)) / ((160*750MHz))
Fetch should be:
((256*256)*(4)) / ((40)*(750MHz))
Memory should be:
((256*256)*(128)) / ((256)*(900MHz*2DDR)) This assumes only output is used, not input also and float4 (32*4=128 bits).
In this case the memory is the bottleneck (if I am calculating correctly) but it shouldn't be, the ALU Ops (according to the SKA) should be the bottleneck.
I guess I figure that I am calculating incorrectly here and am asking what am I doing wrong?
And on top of that I'm not getting anywhere close to the expected times even for very simple kernels (I'm only timing from RunProgram to EventDone).
Does the memory work such that writing out a float is the same as writing out a float4?? So that instead of 128 bit for float4 I should be using 32 bit? I get results closer to SKA with this but I want to make sure.
For me it looks like:
4870 contains 160 ALUs each can perform up to 5 32-bit integer or SPFP instructions per clock (not sure about DPFP, probably only 1).
All memory reads/writes performs on 128-bit operands, it's not possible to read anything less than 128-bit vector.
Actually these can be figured out from SKA output.
If you look at the equations in the SDK Guide though, the Stream Core says "# VLIW stream core instructions/thread", which to me implies that the 5 wide is inclusive.
Also, the memory equation is in BITS and the example they give in the SDK Guide has a one byte in and one byte out (for a total of 16 bits).
If you look at page 1-24 you will see this.
Anyone have any idea on the specifics?
Anyone from AMD have any advice since the docs don't explicitly (or not at least clearly) state this information?
The only way I was able to recreate the bottleneck given in the SKA was to make the bits for the input/output memory read/writes 32 and not 128 (as they should be).
Help please!
I'm really just looking for clarification on the documentation... this should be a pretty straightforward thing to find out.
Is it safe to assume that AMD can't understand their own documentation enough to give an accurate explanation?
So the bits in the memory are correct? If you are writing a float4 then it should be 128 and if you are writing a float it should be 32?
Also, I had assumed that the SKA assumed a write of float4 for IL kernels, but it seems quite the opposite, that it assumes a write of float.
The presentation shows nothing that isn't already in the docs... unfortunately. Looking at the title I had really hoped for something deeper, it really only deals with how to calculate the bottleneck (which the slides are essentially copy and pasted from the docs, or the other way around, whatever) and that accessing system memory is faster.
I do have one question about the slides though: it talks about the access pattern for rasterization; HOWEVER, the docs say this is done transparent to the user.... ??
Is there an example of this somewhere we can look at? I'm not sure what you mean by "correctly optimize your texture access in ps mode".
Is there somewhere in the AMD docs or samples that gives an example of doing this?
Micah,
Is AMD planning on releasing more/better documentation for the compute mode shader? There really isn't that much out there to go off of, unfortunately, and it seems that it's better (or at least more straightforward) for GPGPU?
Also, is there any way to know the max # of threads that can be run per SIMD at one time? I believe you said for CS it's 1024 but what about PS? It's up to the driver? How can we find this information out?
The maximum number of threads in PS mode is basically set by the number of registers per thread.
An estimate for this is:
wavefront_size * floor(256 / registers_per_thread)
RV770, HD4870 for example, has a wavefront size of 64, some GPUs have smaller wavefronts. A thread cannot use more than 127 registers, I believe. Registers allocated beyond this limit are spilled to memory automatically (performance suffers).
That doesn't account for temporary registers. The kernel can use from 0 to 4 temporary registers, something you can't control - you can only identify this by searching through the assembly shown by SKA for assignments to registers T0, T1, T2 and T3.
Including temporary registers:
wavefront_size * floor(256 - (2 * temp_registers_count)) / registers_per_thread)
If you are using global shared registers (only possible with IL) then you also need to subtract the total count of those from 256 (no need to multiply by 2).
Jawed,
Yes, I understand how to calculate the threads; however, is it true that it's only limited by the resources allowed?
If that were the case I think you would see good improvement with the reduction of register pressure; however, for certain sizes this is not what I have noticed.... for example, going from 64 to 32 does reduce execution time but going from 64 to say 58 does not and going from 33 all the way down to 10 doesn't seem to either. I'm just a little curious about this because in CUDA register pressure is a big deal but it doesn't seem to be as big a deal in FIRESTREAM (if I remember correctly the ATI register file is much larger).
EDIT: This assumes the same number of: fetch ops (inputs), outputs, ALU ops, CF and according to the SKA all the kernels I ran also had the same Min, Max, Avg, Est Cycles, ALU:Fetch (obviuosly).
Originally posted by: ryta1203 Jawed,
Yes, I understand how to calculate the threads; however, is it true that it's only limited by the resources allowed?
I can't think of any other restrictions specifically impacting the number of threads per SIMD. Maybe someone else can?
In graphics the vertex shader and the pixel shader are both running, time-sliced, on the GPU. So both types of kernel have their separate register allocations and the counts of wavefronts of each type can vary over the lifetime of those kernels (simple load-balancing). That makes it much trickier to ascertain the number of threads.
Under Brook+ there's presumably a vertex shader that forms the domain of pixels that are executed by the kernel you write. This vertex shader only needs to produce the 3 corners of a triangle - so it is trivial and disappears before pixel shading (i.e. your kernel) starts.
If that were the case I think you would see good improvement with the reduction of register pressure; however, for certain sizes this is not what I have noticed.... for example, going from 64 to 32 does reduce execution time but going from 64 to say 58 does not and going from 33 all the way down to 10 doesn't seem to either. I'm just a little curious about this because in CUDA register pressure is a big deal but it doesn't seem to be as big a deal in FIRESTREAM (if I remember correctly the ATI register file is much larger).
Yes, NVidia is rather restrictive, which causes all sorts of problems. It probably needs to be doubled again (GT200 doubled it over G80) to get to a reasonable level.
Going from 64 to 58 shouldn't really make any difference since in both cases that's 4 wavefronts.
In general if you have an algorithm with an inherently-high ALU:fetch ratio (this is counting cycles of hardware execution) then you'll be hard-pushed to discern any variation in performance. In other words, if the kernel is arithmetically intense then only a few (4 or 😎 wavefronts are required to hide any latency.
But dynamic control flow and scatter both add latency that is often forgotten when trying to define ALU:fetch. In a way it should be called "ALU:everything-else".
Cache access patterns and the whole rasterisation versus linear topic also have very real effects, resulting in performance that you can only derive empirically.
Jawed
Micah,
The cache counter returns 0, so I'm assuming there are no cache hits. My experiments run a domain of 768x768 and an ALU:Fetch of slightly less than 1.0 (for Jawed, that's 3.87xxxx) or better put 247 ALU Ops and 64 Texture Fetches.
Jawed, yes you are correct that there is no difference between 64 and 58, it was just an example. What about the difference between 17 and 10? 17 GPRs should allow 15 wavefronts while 10 GPRs should allow 25, a difference of 10 wavefronts yet I don't see a difference in performance, again the cache counter is ZERO, the ALUs are 247 and the texture fetches are 64.
EDIT: each input is accessed only one time, however, the cache counter does return some values (~30 for GPR <= 33) for larger domain sizes and larger ALU:Fetch ratios; however, the performance is the same result. Sorry, I also forgot to mention that each kernel uses 2 T registers, so really that should be 19 and 12 GPRs, equally 13 and 21 respectively.
Micah,
So is it possible that there is something wrong with the cache counter?
Originally posted by: ryta1203The cache counter returns 0, so I'm assuming there are no cache hits. My experiments run a domain of 768x768 and an ALU:Fetch of slightly less than 1.0 (for Jawed, that's 3.87xxxx) or better put 247 ALU Ops and 64 Texture Fetches.
Aha, so you've got counters working!
Does this kernel have a loop or loops? The loop that consumes the most run-time cycles is probably the place to focus on. What are the ALU and fetch counts for such a loop, if there is one?
Jawed, yes you are correct that there is no difference between 64 and 58, it was just an example. What about the difference between 17 and 10? 17 GPRs should allow 15 wavefronts while 10 GPRs should allow 25, a difference of 10 wavefronts yet I don't see a difference in performance, again the cache counter is ZERO, the ALUs are 247 and the texture fetches are 64.
If 15 wavefronts are capable of hiding all the latency of non-ALU operations, and having 25 wavefronts doesn't radically increase latency, then you won't see a worthwhile performance improvement. You could try altering the size of the domain and/or how "square" it is, to see what kind of effects there are on performance.
If 15 wavefronts couldn't hide that latency, yet 25 could, then there'd be a performance gain.
The variations in performance arise solely due to shifting of bottlenecks amongst the types of operations the GPU is performing.
As your kernel is so evenly matched on both ALU and fetch, you're left to investigate writes to memory, cache access patterns and clause structure. The structure of the clauses (and count of them) can have an effect on latency.
EDIT: each input is accessed only one time, however, the cache counter does return some values (~30 for GPR <= 33) for larger domain sizes and larger ALU:Fetch ratios; however, the performance is the same result. Sorry, I also forgot to mention that each kernel uses 2 T registers, so really that should be 19 and 12 GPRs, equally 13 and 21 respectively.
Is that 30% cache hit rate? I don't know which cache that's referring to - I suspect it means a hit in L2. I don't know what the latency of a fetch from L2 (through L1) into registers is. 16 cycles? Anyway, whatever it is, much less than ~250 cycles (bit of a guess) for a fetch from video memory.
This talks about shuffling the kernel code in order to tune cache access patterns:
http://www.research.ibm.com/people/h/hind/pldi08-tutorial_files/GPGPU.pdf
The 2 T registers in your code will not result in 13 and 21 wavefronts, they will result in 14 (252/17) and 25 (252/10) wavefronts.
How does cache hit vary with varying wavefront count?
Jawed,
1. Thanks for the toughtful response.
2. Yes, got counters working after someone pointed out what was not in the docs... sadly, you should be able to simply call the API functions, but instead you have to kind of setup them up also.
3. There are no loops.
4. The number of CF is the same for every kernel, despite the number of GPRs. As far as the structure of them, each ALU clause has the max number (~124 count) of ALU ops and each TEX clause has 8 fetches.
5. Yes, once I thought about it, it seems that either 1) the bottleneck has been shifted to the ALU OR 2) texture fetch bandwidth has been saturated.
6. I've run A LOT of experiments with all different ALU:Fetch ratios and domain sizes.... even though the cache hits change the overall performance does not. I assume this is because the bottleneck has changed at that point.
7. Yes, I need to look at how the cache works more closely.
8. I don't know which cache that's refering to either, since the docs don't specify. This could be to help AMD hide the cache sizes, etc, or it could just be bad docs, I don't really know.
9. 252? So if a T register is used then all 4 are? Shouldn't that be 254 (256-2?)
The T registers count twice, because the assigned number of Ts has to be allocated for "odd" and "even" wavefronts.
The ALUs execute a pair of wavefronts at any one time, in a pattern of cycles that goes AAAABBBBAAAA....
So the T registers, which are private to the clause that's being executed, have to be allocated for A and B.
See sections 2.5 and 2.6:
http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf
Presumably you've also experimented with a compute shader version of your kernel. Though it appears with the performance-insensitivity that you currently have, that CS might not be any use. Unless there's a radical gain in cache hit to be found somewhere?...
I have not run these in compute shader mode yet. Since I started in pixel shader mode I would like to get a full understanding of that first, then when I move to compute shader mode, things should move a little faster for me.
As far as the temp clause registers, it would seem you are correct. I had read those sections already but didn't give it much thought.
Sadly, the SDK Guide does not descirbe very well (or rather accurately) how these WFs are executed or scheduled on the ALUs. This does bring up another question I have.
Does the term "slot" refer to the wavefront's position in the queue?
As a note to AMD: It would be very helpful if the T registers were accounted for in the reporting of GPRs in the SKA since they effect performance and effect the overall amount of GPRs used. Not sure why they weren't included to begin with.
"Slot" in 2.6.1.2 is merely referring to the wavefront pairing that is created for execution of instructions. A is the odd slot and B is the even slot in that AAAABBBBAAAA.... description I gave earlier.
Jawed
So really what I'm asking is this:
Is every wavefront paired with another wavefront to execute in parallel? What if there is not an even number of wavefronts?
Micah,
OK, thanks. I get it, I just need to look at how the cache is configured, I guess I'm hitting the same cache lines (though I thought sequential access would avoid this)... either way I just need to take a better look at my numbers and try to figure out that cache, thanks again for your time.
Micah,
Thanks, that helps.
Micah,
Sorry, this brings up another question: Why is only half the SIMD used? According to the docs, for the RV770, there are 16 SPs, a quad for each SP (4 instr over 4 cycles, for a total of 64) with a wavefront per SIMD, so if there was only 1 wavefront why would only half the SIMD be used?
Also, my original question stands:
When calculating those formulas, for inputs, am I to include them both in the memory calculations and the fetch calculations?
Ok. I'm not quite understanding the fetch thing fully.
If I have 3 inputs and 1 output and the generated code give 3 fetch instructions (for the 3 inputs), do the inputs go into both formulas? From your answer I'll assume no unless you tell me otherwise.
The only problem I'm having is that I'm looking at some kernels that are obviuosly memory bound but according to the calculations they are ALU bound (I'm pretty sure they aren't ALU bound since the execution time remains the same with an increasing number of ALU ops). I only have this problem when dealing with float4s not floats, so I'm not sure if my formula calculations are correct.
Originally posted by: MicahVillmow Ryta, Fetch instructions are to be included as fetch only, because that is a seperate unit(texture) and it is possible to be bound by this unit. Only the amount of data is included in memory calculations. As for the wavefronts, although it is seen as 4 instr over 4 cycles, that assumes that both wavefronts are executing in parallel. So, wavefront A executes 4 instr over 4 cylces, then wavefront B executes, then A, then B. If B does not exist, A only executes every 8 cycles and not every 4.
This thread revisited:
The texture fetch formula does not include "bytes" so therefore I assumed the latency should be the same for float4 and float.... by experiment this does NOT seem to be the case at all. In fact, the formula follow the experiment almost exactly for float but only follows float4 once ALU becomes the bottleneck. If FETCH is the bottleneck for float4 the latency is much higher than it is for float, though the docs don't seem to talk about this.
Should the fetch latencies be the same? If not, then what should the float2, float3 and float4 formulas look like??
Ok thanks, I read it once but I probably need to go back, read it again and think about it, though I'm not sure it answers my questions, it's a useful thread so thanks.