5 Replies Latest reply on Sep 27, 2011 3:18 PM by bridgman

    Newbie question: "stream core" vs. "simd processor" ?

    cconvey
      What's the difference? What supports MIMD?

      I'm new to GPU programming, but I understand the MIMD-SIMD distinction.

      I have an existing algorithm that's data-parallel, but the algorithm it applies to a given datum is quite complicated and contains lots of branching.

      For this reason, I'm considering using a FireStream board as a highly parallel MIMD device, regardless of what it's SIMD capabilities might be.

      AMD advertizes its 9370 board's GPU as having 1600 stream cores and 20 SIMD processors.  but I'm not clear on what they mean by "stream core" vs. "SIMD processor".  Can anyone explain (or point me to a good document?)

      If I'm going to use that board as a MIMD device, I'm trying to understand if this thing offers me 20x MIMD parallelism, 1600x MIMD parallelism, or something else.

        • Newbie question: "stream core" vs. "simd processor" ?
          bridgman

          I don't have a handy link to good documentation, but the basic idea is :

          - GPU has 20 SIMD engines

          - each SIMD works on 16 data elements at a time, where each data element consists of 4 32-bit values (in graphics a data element might be a vertex or a pixel, each of which have multiple components)

          - on any given cycle a SIMD runs the same instructions on all 16 data elements, but another SIMD might be running a different instruction from a different program (on 16 *different* data elements) on the same clock

          - the "instructions" run on each data element per clock allow 5 different operations simultaneously, for a total of 20 SIMDs x 16 elements per SIMD x 5 operations per element per clock, or 1600 simultaneous ALU operations

          Stream core in this context would refer to a single ALU, while SIMD refers to a bank of 16x5 ALUs performing up to 5 instructions simultaneously on each of 16 data elements.

          Clear as mud ?

            • Newbie question: "stream core" vs. "simd processor" ?
              cconvey

               

              - on any given cycle a SIMD runs the same instructions on all 16 data elements, but another SIMD might be running a different instruction from a different program (on 16 *different* data elements) on the same clock

              - the "instructions" run on each data element per clock allow 5 different operations simultaneously, for a total of 20 SIMDs x 16 elements per SIMD x 5 operations per element per clock, or 1600 simultaneous ALU operations



              So what does it mean when you say the same "instruction" is running on all data eleements (first bullet point), but 5 different "operations" can be applied to the data elements?

              Is this where the VLIW comes in?  I.e., a single VLIW "instruction" can indicate different specific operations (add, subtract, compare, etc.) on the different data elements? 

              Also, in these clusters of 5 elements, can a single instruction perform both a trig function and some other 64-bit floating point operation (addition, etc.) using the very same instruction?  Or must trig functions be done in their own instructions?

                • Newbie question: "stream core" vs. "simd processor" ?
                  bridgman

                  Same "instructions" actually, not "instruction", but yes this is where VLIW comes in.

                  I haven't gone through the latest ISA guide in detail but I believe trig instructions are separate operations from other math ops. IIRC one of the 5 instructions can be trig or integer ops, the others are 32-bit float ops. I haven't looked at 64-bit float but my guess is that you can only run one or two instructions per VLIW if you are using 64-bit (rather than 5 for 32-bit ops), will check.

                  EDIT : looking at the ISA guides at :

                  http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx

                  ... 64-bit operations use either 2 or 4 instruction slots of the 5 slots available on pre-Cayman (VLIW-5) GPUs / 4 slots available on Cayman (VLIW-4).

                • Newbie question: "stream core" vs. "simd processor" ?
                  szabi_h

                  What does it mean complex- and simple stream processor? What is the different? When i run an openCl kernel: all the stream processors are working or there are some case, when one of them not working?

                    • Newbie question: "stream core" vs. "simd processor" ?
                      bridgman

                      Where are you seeing the complex vs simple description ?

                      For now I'm guessing that "complex" refers to the 5th ALU block on a VLIW-5 processor, which can handle some additional functions such as trig and integer operations.

                      If so, when you are running an OpenCL program then it's likely that only 4 of the 5 ALUs would be working on a VLIW-5 processor. On the VLIW-4 processors such as Cayman (HD69xx) all 4 ALUs are identical, ie the complex/simple distinction goes away, which makes it easier to get full utilization on OpenCL and other compute applications.