Archives Discussions

ryta1203 · ‎02-12-2009

Are wavefronts put in the run-queue, as one presentation suggests OR are wavefronts run in parallel (some-type of switching mechanism or what?) as some other presentation suggests??

AMD is still having a hard time getting all of it's terminology on the same page it seems. I have VERY curious about this question.

If Wavefronts are put in a queue, then that suggests that they are not run in parallel. If they are run in parallel, what purpose does the queue serve? A queue by nature is FIFO and therefore should not serve as some switching structure.

The docs on this are a little confusing, IMO. I have also looked at a few presentations and they some to conflict, or at least to me (I'm sure they make perfect sense to people who already know how the hardware works).

The runtime suggests that the lower the GPR the better, which then suggests that the wavefronts are running in parallel (switching) to hide memory latency (on top of the threads in a quad running in parallel to hide memory latency. So is this the same mechanism working on two fronts?? (threads in a quad and wavefronts?)

MicahVillmow · ‎02-12-2009

Ryta,
Can you post the links to the presentations and what docs you are referencing so I can work to get them cleaned up and clear up any ambiguity?
Thanks,

ryta1203 · ‎02-12-2009

Originally posted by: MicahVillmow

Ryta,

Can you post the links to the presentations and what docs you are referencing so I can work to get them cleaned up and clear up any ambiguity?

Thanks,

http://developer.amd.com/media...ets/Rubin-CGO2008.pdf

Page 11

http://ati.amd.com/technology/...ng/PLDI08Tutorial.pdf

Page 9 and 10

THIRDLY, these forums. You have stated that Wavefronts run in parallel; however, I see no evidence of that in the docs, but obviously they do, otherwise, register usage would not be so important.

SO, Can you answer my original question?

rahulgarg · ‎02-12-2009

Wavefronts do run in parallel if i understand correctly.
However a SIMD can only handle upto 1024 threads or upto 1024/64=16 wavefronts and that is if enough registers are available.
So if you have lets say 2**20=1024*1024 threads in your application, then 1024*10 threads will be running on the RV770 at any given time (assuming enough registers) while rest of them will b e waiting to be dispatched by the ultrathreaded dispatcher. Wavefront is in itself composed of quads of threads but those quads map to a single a thread processor containing 5 stream procesosrs (but I am not sure on this point).

ryta1203 · ‎02-12-2009

Originally posted by: rahulgarg

Wavefronts do run in parallel if i understand correctly.

However a SIMD can only handle upto 1024 threads or upto 1024/64=16 wavefronts and that is if enough registers are available.

So if you have lets say 2**20=1024*1024 threads in your application, then 1024*10 threads will be running on the RV770 at any given time (assuming enough registers) while rest of them will b e waiting to be dispatched by the ultrathreaded dispatcher. Wavefront is in itself composed of quads of threads but those quads map to a single a thread processor containing 5 stream procesosrs (but I am not sure on this point).

rahulgarg, thanks. In which doc did you find this information?

I understand the quads aspect (although I'm still not sure why they are 2x2 and just not 4x1 and how that improves performance) and how 2x2 threads are run in parallel on a single thread processor to help hide memory latency. Since the SIMD engine is just that, you only need one thread switching mechanism (I'm assume the "Ultra-Threaded Dispatcher") per SIMD engine. This is pretty straightforward in the documentation I believe.

I am having a hard time understanding where the Run-Queue comes into play and how that allows the Wavefronts to run in parallel!?

ryta1203 · ‎02-18-2009

Originally posted by: rahulgarg

Wavefronts do run in parallel if i understand correctly.

However a SIMD can only handle upto 1024 threads or upto 1024/64=16 wavefronts and that is if enough registers are available.

So if you have lets say 2**20=1024*1024 threads in your application, then 1024*10 threads will be running on the RV770 at any given time (assuming enough registers) while rest of them will b e waiting to be dispatched by the ultrathreaded dispatcher. Wavefront is in itself composed of quads of threads but those quads map to a single a thread processor containing 5 stream procesosrs (but I am not sure on this point).

1) Your first point would suggest that you can have up to 160 Wavefronts running in "parallel" (i.e. their data is "live" in the register file and they are not finished but have begun their execution, meaning they have been issued previously).

2) The second part is correct and is more easily understood (since it's more well documented).

3) What I would like to see documented more is the effect of the number of wavefronts on performance and the usage of cache on performance. It's fairly easy to use some of the more obvious optimization techniques pointed out by the docs to get some good improvement, but what is not obvious is squeezing the remainder of the performance out of the GPUs. For example, when comparing to CUDA (who makes it very easy to fully optimize with their exposed memory layout, etc) it is harder to optimize those last little bits with AMD.

MicahVillmow · ‎02-12-2009

Ryta,
The answer is it is a little bit of both. Yes wavefronts do run in parallel, but only as much as the hardware can handle at once, and those that are either stalled or waiting to execute are in the thread queue/run-queue/ultra threaded dispatcher, or however you want to call it.

As for the terminology, we are working on that, but the compute world and the graphics world many times have different terms for the same thing.

As for how many threads can run on the SIMD, the slides CGO2008 pages 10 and 11 give out information on that. In compute shader mode w/ LDS the limit is 1024.

Just to break it down how it works as simple as possible:
Your execution domains is broken into blocks of 64 threads, called wavefronts, and schedules them to execute on a SIMD.
When executing on a SIMD, each wavefront is broken into 4 groups of 16, with each group executing on the four 2x2 blocks of thread processors per SIMD
Each thread processor processes 5 instructions for a single thread, also called an ALU clause
A wavefront continues executing on a simd for that ALU CF clause, where it then returns to the thread dispatcher until it is schedule to execute again.

ryta1203 · ‎02-12-2009

Ok, still a little foggy. The wavefronts switch per instruction group or per clause?

Because:

1) An instruction group does not necessarily have 5 instructions in it, sometimes it only has 1, sometimes it has up to 5.
2) Can't a clause (which is different from an instruction group according to the docs) have more than 5 instruction groups in it?

MicahVillmow · ‎02-12-2009

An instruction group, also known as an ALU bundle or an ALU Clause(per the R600_Assembly_Language_Format.pdf) doc can have between 1 and 5 instructions, although ideally you want 5. A ALU CF clause, which is different than an ALU clause can have up to 128 ALU bundles in it. On a clause break, a wavefront switches either to the TEX/VTX or back to the queue to be run on ALU.

Understanding the ISA will help with understanding this concept.

ryta1203 · ‎02-13-2009

Micah,

Thanks again for your time.

Yes, the ISA I have looked at.

For example, on page 4-1 of the ISA R600 docs, first paragraph where it calls an "ALU Clause with one of the CF_INST_ALU* control-flow instructions, all of which use the CF_ALU_DWORD[0,1] microcode formats". Then compare that with the second paragraph on 4-2 "Software issues ALU instructions in variable-length groups called instruction groups."

This is all under "Chapter 4 ALU Clauses". There are also chapters called "Vertex-Fetch clauses" and "Texture-Fetch clauses".

I'm just not seeing where they call an instruction group an ALU Clause and an ALU Clause and ALU CF Clause, it appears that the docs (or at least the R600 ISA doc) calls an ALU CF Clause an ALU Clause and an ALU Clause an instruction group. If the other doc is calling them the other way then it's a good idea to get them all on the same page so there isn't any mix up.

MicahVillmow · ‎02-13-2009

Yeah, we've spent a lot of time getting many of the other docs to be uniform in their terminology. It seems that the ISA doc needs some work to.

ryta1203 · ‎02-13-2009

Yeah, it just gets confusing when you start to talk about the details. It's a language all to itself and so when the wrong terms are used (or the right terms are misunderstood) it becomes more difficult to convey the message.

Also, I noticed that the ISA output by KSA is a "shorthand" of the ISA from the doc. This isn't a big deal by any stretch of the imagination, just didn't know if this was done on purpose or not. For example, RCP_e in KSA is actually RECIP_ieee in docs, etc, etc... like I said, it's more or less irrelevant.

ryta1203 · ‎02-15-2009

Micah,

Also, I forgot, the SCUG also calls them "Clauses" and NOT "CF Clauses", per 1.3.1 page 1-22.

There are also other places in the docs I have seen them called "Clauses" and not "CF Clauses". If AMD is planning on calling them "CF Clauses" then they should also change the SCUG to reflect this. Thanks.

Archives Discussions

Wavefronts Question