I'm working on developing a Cholesky matrix factorization routine. While thinking about how to do this I've come up with a few questions. If it's okay I'll post them as separate threads in case someone else has already thought about or is interested in particular ones.
To start with, I'm afraid I find it a bit unclear from the beta documentation simply how best to think about the cards. (If all this'll be in the 1.0 documentation please just say!) Unfortunately I don't have a graphics background. My impression from the documentation and from other pdf's (with diagrams) I've seen about the r600 is that a 3870 should be thought of as 64 processors arranged in four groups of sixteen, with each of the 64 processors being composed of 5 subprocessors. Is this correct so far? And how does this tie into "simds", "simd arrays", "pipelines", "alu units", "spus" etc?
Relatedly, how "independent" is each processor in a group? Do they all execute the same instruction on different data, or can they independently proceed through the same program? Presumably they can't each run a different program; in fact in cal do all processors (even in different groups) have to run the same program? Funnily enough I feel most confident I understand how things work at a processor/subprocessor level, with all subprocessors within a processor having their operation directed in a synchronized manner by "VLIW" instructions in a single "r600isa" prgram.
Then, is an execution domain split into blocks of 64 "elements"? Are the 64 elements processed between the 16 processors in a group, 4 each, or are all elements processed just by one processor? And does a thread mean what I've called a block? I read somewhere threads are operated on in pairs: does each group of 16 processors then process two blocks each in an interleaved manner? And so does a thread group mean the pair of threads assigned to what I've called a group of 16 processors? ( Or does a thread mean one of the 64 elements and a thread group mean the set of 64 elements? )
Finally, some of the lower end cards seem to have different numbers of processors per group as well as a different number of groups; which concepts are invariant? E.g. do they still operate on blocks of 64 elements?
Sorry for the confusion and thanks a lot for any help!
sgratton, I'll try to answer your questions, but if I miss something or confuse you, feel free to ask more. On the RV670(Radeon 3870), there are four SIMD's on the card. Therefor the SIMD's are in what is called a simd array, a pipeline can also be considered a SIMD, but should not be confused with the graphics pipeline or compute pipelines. Inside of each SIMD are 16 alu processors, or shader processing units(spu's), grouped into quads. Each of these alu processors are 5 wide scalar processors(or a 4 element vector + 1 scalar), easily labeled as X, Y, Z, W and T. The XYZW are the vector element and the T is scalar. Each SIMD processes 16 elements a cycle over four cycles, giving a granularity of 64 elements on 670. The wavefront size, which is different from chip to chip, is this granularity and all instructions in a kernel are executed on group of elements at the wavefront size. This is why it is bad to have flow control in a shader that has a branching pattern that doesn't match up with the wavefront granularity, part of the wavefront goes down one branch, part of the wavefront goes down the other branch and no optimizations can be done and both branches are executed for all threads. When you decide what your execution domain does, the GPU launches a wavefront for every 64 elements that should be processed. If the final element does not fill a wavefront, it is launched partially full, possibly loosing some performance. When a wavefront is launched, they go in pairs to the SIMD, i.e. thread A and thread B and operate on their respective elements in parallel, alternating between using ALU and Texture clauses.
As for your usage of threads, blocks and groups. These aren't explicitly defined in our current GPGPU paradigm. Although threads are used synonymously with pixels, blocks and groups are not directly mapped to pixel shaders. In the pixel shader paradigm, what is important to keep track of is the wavefront size as this is the execution granularity of your pixels.
Thanks very much for your helpful post. To check I've understood, let's take a 2400 card. Does it have 2 simd's, each consisting of 4 spu's, for a wavefront size of 16? If so, it is interesting to know that different cards have different granularities.
Two quick follow-up questions: What is the significance of the grouping of spu's into quads? And does each simd have to run the same kernel, in principle and in practice?
(By the way, I think I got the notion of threads and thread groups from the definitions in the cal programming guide and also in the r600isa guide. I'm not sure these documents describe things quite the same way as you do, e.g. they talk about a simd pipeline as having 5 alus, whereas I think you'd say a simd pipeline has 16 alu processors/spu's for a 3870 at least.)
Steven, I just reread the relevant portions of the r600isa guide and you are right, I mixed up the simd, alu, and spu definitions. Each SPU has 4 simd's, and the simd each has 4 alu.vector and 1 alu.trans. I hope I got the information across although the terminology was a little mixed up.
1) Since the SIMD's are grouped into quads, the output mask granularity is larger than a single element. This mainly has to do with tiling formats and how the hardware handles that and cache behavior. I also remember someone stating that the smallest granularity in the DX world is 4 elements or a quad, but I can't find where it was referenced in our documentation. In the r600isa, it talks about things like valid_pixel_mode and whole_quad_mode in section 3.6, and this is one location where grouping into quads comes in useful.
2) Currently we only support one kernel running at a time, but it is feasible for me to see that kind of support being made possible in future hardware. CPU's went from sequential program execution to multi program execution, so I'm sure GPU's could follow the same path.
I was kind of hoping your terminology was correct; it was certainly clearer and seemed to fit more closely what was used in e.g. a presentation at Stanford about the r600 by Eric Demers that I got off the internet than the officical r600isa documentation did!
I don't quite know what you mean with the new numbers: How many "new" spu's does a 3870 have? Would you mind re-enumerating everything again in the new scheme?
The quad stuff sounds pretty complicated; maybe I can leave that for the moment!
We are actually trying to clean up the terminology a bit to be less graphics-speak and more compute-speak.
Try this on for size... 🙂
On the Radeon HD 3870, we have 4 SIMD arrays.
Each of those SIMD arrays has 16 thread processors (used to be called SPUs or shader processing units but not in AMD Stream Computing).
Each of those thread processors can be viewed as a VLIW processor.
Each of those VLIW processors have 5 stream cores (or in VLIW-speak, scalar processors). We name them X, Y, Z, W, and T. Each of those stream cores can do integer ops and single precision floating point ops (with minor exceptions as noted in ISA docs). The T unit can also do a few transcendental operations like sin() and cos(). A double-precision floating point operation is performed by using the circuitry for the X, Y, Z, and W stream cores and performing a single double-precision floating point operation.
All of the thread processors in a SIMD array must execute the same instruction path. In reality, multiple instruction paths are time multiplexed on a SIMD array in order to cover up memory and ALU latency. I believe the number that is multiplexed is 4, but I can't be 100% certain about that detail.
Different generations of ASICs may vary the number of SIMD arrays and thread processors as well as some other capabilities. But for the time being, this architectural view should hold.
Each stream core in a thread processor can have a different instruction as long as the instruction mix is the same across all of the thread processors in a SIMD array at a particular time slice (once again, because it is single instruction multiple data).
The ability to maximally utilize all of the stream cores, does, of course, depend on your application and how well it maps to this architecture (as expected).
Does this make more sense? I know there may be some stuff that is out of sync with the technical docs, but this simplified compute view is something that we have been trying to clean up and haven't fully propagated into all of our docs yet... I apologize for the confusion!
That sounds very good and clear! (And very close to Micah's original description.)
I think I mentioned to AMD via email some time ago about the relevance of the documentation for GPGPU and was informed you were working on a new description; I guess I got a bit worried when the 1.0 beta docs were pretty similar to the alpha ones!
You'll see a progressive cleanup (or perhaps more dramatic cleanup depending on how fast I can get the tech writer to work :-)) over the next few releases. I can't promise it'll make it for the next release, but the changes should be coming definitely the release after that and in the form of additional overview docs as well.