cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

cconvey
Journeyman III

Newbie question: "stream core" vs. "simd processor" ?

What's the difference? What supports MIMD?

I'm new to GPU programming, but I understand the MIMD-SIMD distinction.

I have an existing algorithm that's data-parallel, but the algorithm it applies to a given datum is quite complicated and contains lots of branching.

For this reason, I'm considering using a FireStream board as a highly parallel MIMD device, regardless of what it's SIMD capabilities might be.

AMD advertizes its 9370 board's GPU as having 1600 stream cores and 20 SIMD processors.  but I'm not clear on what they mean by "stream core" vs. "SIMD processor".  Can anyone explain (or point me to a good document?)

If I'm going to use that board as a MIMD device, I'm trying to understand if this thing offers me 20x MIMD parallelism, 1600x MIMD parallelism, or something else.

0 Likes
5 Replies
bridgman
Staff

I don't have a handy link to good documentation, but the basic idea is :

- GPU has 20 SIMD engines

- each SIMD works on 16 data elements at a time, where each data element consists of 4 32-bit values (in graphics a data element might be a vertex or a pixel, each of which have multiple components)

- on any given cycle a SIMD runs the same instructions on all 16 data elements, but another SIMD might be running a different instruction from a different program (on 16 *different* data elements) on the same clock

- the "instructions" run on each data element per clock allow 5 different operations simultaneously, for a total of 20 SIMDs x 16 elements per SIMD x 5 operations per element per clock, or 1600 simultaneous ALU operations

Stream core in this context would refer to a single ALU, while SIMD refers to a bank of 16x5 ALUs performing up to 5 instructions simultaneously on each of 16 data elements.

Clear as mud ?

0 Likes

- on any given cycle a SIMD runs the same instructions on all 16 data elements, but another SIMD might be running a different instruction from a different program (on 16 *different* data elements) on the same clock

- the "instructions" run on each data element per clock allow 5 different operations simultaneously, for a total of 20 SIMDs x 16 elements per SIMD x 5 operations per element per clock, or 1600 simultaneous ALU operations



So what does it mean when you say the same "instruction" is running on all data eleements (first bullet point), but 5 different "operations" can be applied to the data elements?

Is this where the VLIW comes in?  I.e., a single VLIW "instruction" can indicate different specific operations (add, subtract, compare, etc.) on the different data elements? 

Also, in these clusters of 5 elements, can a single instruction perform both a trig function and some other 64-bit floating point operation (addition, etc.) using the very same instruction?  Or must trig functions be done in their own instructions?

0 Likes

Same "instructions" actually, not "instruction", but yes this is where VLIW comes in.

I haven't gone through the latest ISA guide in detail but I believe trig instructions are separate operations from other math ops. IIRC one of the 5 instructions can be trig or integer ops, the others are 32-bit float ops. I haven't looked at 64-bit float but my guess is that you can only run one or two instructions per VLIW if you are using 64-bit (rather than 5 for 32-bit ops), will check.

EDIT : looking at the ISA guides at :

http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx

... 64-bit operations use either 2 or 4 instruction slots of the 5 slots available on pre-Cayman (VLIW-5) GPUs / 4 slots available on Cayman (VLIW-4).

0 Likes

What does it mean complex- and simple stream processor? What is the different? When i run an openCl kernel: all the stream processors are working or there are some case, when one of them not working?

0 Likes

Where are you seeing the complex vs simple description ?

For now I'm guessing that "complex" refers to the 5th ALU block on a VLIW-5 processor, which can handle some additional functions such as trig and integer operations.

If so, when you are running an OpenCL program then it's likely that only 4 of the 5 ALUs would be working on a VLIW-5 processor. On the VLIW-4 processors such as Cayman (HD69xx) all 4 ALUs are identical, ie the complex/simple distinction goes away, which makes it easier to get full utilization on OpenCL and other compute applications.

0 Likes