13 Replies Latest reply on Apr 30, 2015 5:40 AM by nou

    GCN SIMD vs SIMT

    savage309

      Hey,

      I know that this topic has been upraised a lot, but I want to do that one more time ..

      I am dealing with some GPGPU apps and I need to make them run fast on all kind of hardware, so for a quite some time now I am doing a bit in-depth research about the hardware and what it does exactly.

      I have more experience with CUDA and nVidia GPUs (since nVidia OpenCL implementation is really bad), but I am more and more interested in the AMD GPUs, since I can see so much great potential in them, that is not being used properly today at our side.

      I am trying to find if there is any difference between the SIMD model in GCN 1.2 and the SIMT model (in let say, Maxwell), or the SIMT is just a marketing buzz word used by nVidia (honestly, I don't see any much of a difference; if there is any it has to be in the way branching is handled). If there is difference, how does all this compares to the Intel GPUs ?

       

      Further more, we lack good video lectures on GCN (or at least I can't find any; on the other side, we have the Stanford nVidia lectures which are quite good). The GCN white paper also could use a bit of refining (I am not hardware expert, but I have read quite a few white papers and I have some view on hardware, but at some point it got me lost).

       

      Thanks !

        • Re: GCN SIMD vs SIMT
          Raistmer

          AFAICT AMD GPUs use SIMD registers and corresponding SIMD asm commands.

          That is, it's true SIMD architecture like SSE in CPU world for example. That is, each thread (workitem) can issue hardware SIMD instruction that operates on 4 float numbers.

          From other side, nVidia's device aways was scalar one. It can operates many threads/workitems simultaneously (in this aspect I see no big difference between AMD and nVidia, both are SIMT) but its IA is scalar. That is, float4 "emulated" by using 4 threads (or 4 serial operations in one thread, depends on actual implementation) while AMD chip has single vector instruction.

           

          Would be interesting to get more info from experienced peoples indeed, especially about iGPUs that relatively new and uncovered.

            • Re: GCN SIMD vs SIMT
              savage309

              If I have ...

              float4 a = global_data_0[thread_id];

              float4 b = global_data_1[thread_id];

              float4 res = a + b;

              And I run N threads, each one of them will execute (a.x + b.x), than (a.y + b.y) and so on, right ? It is SIMD, because I have the single instruction (sum) executed on multiple data fields (from global_data), not because it will do SIMD sum on float4, right (the execution is being fragmented on those computation units, which are going in lock step and so on and so forth) ? And this is exactly what nVidia SIMT does, so they should be (roughly) the same.

                • Re: GCN SIMD vs SIMT
                  maxdz8

                  Sounds right.

                  It's very simple: what you write in a kernel definition is what the ALU assigned to the WI will do.

                  If you want to do SIMD in the sense of having multiple ALUs "collaborating" in adding vectors you will most likely go through LDS and write a kernel definition which has explicit notion of a WI running "in tandem" with it.

                • Re: GCN SIMD vs SIMT
                  maxdz8

                  Most definitely not. As far as I can tell, all instructions in AMD GCN are scalar instructions in nature. Note: I'm not an expert at ISA level. I've just rapidly skimmed the ISA manual various times.

                  AMD GCN classifies "vector" and "scalar" instructions on whatever they run on the vector ALUs or the scalar unit. They are not "vector" in the sense "they mangle vectors" but rather they are "executed across a vector of ALUs".


                  Your interpretation was sort of correct for VLIW. AFAIK.

                • Re: GCN SIMD vs SIMT
                  tzachi.cohen

                  From a high perspective AMD's GCN architecture is a scalar SIMT design much like our green competitor.

                  There are, of course, several implementation differences, for example, our SIMT execution unit is called a wavefront and is 64 threads wide while theirs is called a warp and is 32 threads wide.

                  AMD has a unique scalar engine. While observing kernel code we noticed that some instructions have identical data-sets across all threads of a wavefront, these cases can be detected by the compiler and it will issue an instruction to the scalar engine instead of the vector engine. A scalar instruction will be executed once for all threads instead of being executed identically 64 times. This helps to improve power consumption. Moreover, since the scalar and vector engines are independent of each other they can process instruction from different wavefronts in parallel.

                   

                  AMD GPUs can execute different OCL kernels in a parallel. When initializing several OCL queues they can be bound to different GPU entry points and submit kernels for execution concurrently. This helps to saturate the GPU when launching small kernels that do not fully occupy the machine individually.

                  1 of 1 people found this helpful
                    • Re: GCN SIMD vs SIMT
                      Raistmer

                      Thanks for correction, yes, I spoke about older architectures.

                      BTW, from what family this

                      tzachi.cohen написал(а):

                      AMD GPUs can execute different OCL kernels in a parallel. When initializing several OCL queues they can be bound to different GPU entry points and submit kernels for execution concurrently. This helps to saturate the GPU when launching small kernels that do not fully occupy the machine individually.

                       

                      feature supported (and has exposed support in OCL runtime) ?

                      AFAIK nVidia does the same since FERMI family, not?

                        • Re: GCN SIMD vs SIMT
                          tzachi.cohen

                          Async compute is supported on all GCN GPU with varying number of ACEs (Async Computed Engine). SI devices have two ACEs , Hawaii has 8.

                          You can read more about it on :

                          AMD Dives Deep On Asynchronous Shading

                           

                          According to Anandtech, this feature is not supported on Fremi/ Kepler, not while the graphics queue is active.

                            • Re: GCN SIMD vs SIMT
                              Raistmer

                              Thanks for link.

                               

                              FERMI had mention about being able to run 2 kernels simultaneously. And for Kepler they anounce this:

                               

                              Hyper-Q – Hyper‐Q enables multiple CPU cores to launch work on a single GPU simultaneously, thereby dramatically increasing GPU utilization and significantly reducing CPU idle times. Hyper‐Q increases the total number of connections (work queues) between the host and the GK110 GPU by allowing 32 simultaneous, hardware‐managed connections (compared to the single connection available with Fermi). Hyper‐Q is a flexible solution that allows separate connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process. Applications that previously encountered false serialization across tasks, thereby limiting achieved GPU utilization, can see up to dramatic performance increase without changing any existing code.

                              http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

                               

                              And for FERMI too:

                              Concurrent kernel execution I At the GPU level, GPU functions can be executed simultaneously I Architecture 2.x: 16 kernels [1] I Architecture 3.x: 32 kernels [1]

                              https://rcc.its.psu.edu/education/seminars/pages/advanced_cuda/AdvancedCUDA5.pdf

                              (slide 11/77)

                               

                              Could you comment in what part it's different from GCN's ACE?

                               

                              As I understood the main benefit highlighted in that Anandtech article is the GCN's ability to run few shaders (specifically, graphic shaders) simultaneously. Definitely it's great advantage but has little to do with GPGPU area. Here we mostly use compute shaders. And even that Anandtech article lists 32 compute queues for Kepler.

                              Also, IMHO this conclusion

                              So pre-Maxwell 2 GPUs have to either execute in serial or pre-empt to move tasks ahead of each other, which would indeed give AMD an advantage

                              directly contradicts both with listed table (32 compute queues for Kepler ) and with nVidia claims about their hardware and its ability (starting from FERMI ) to execute warps from 2 different kernels simultaneously.

                              BTW, reading that sentense in environment has much more sense, it becomes obvious they speaking again about merging compute and graphics tasks(shaders), not compute and compute ones. But again, while being able to compute and to render simultaneously is good thing, it's irrelevant for many pure GPGPU, computational applications.

                          • Re: GCN SIMD vs SIMT
                            savage309

                            Thanks, that is some helfpuf information, especially for the purpose of the scalar unit. I've read about it in the whitepaper, however your answer made it much more clear than the explanation there.

                            So, is it true that the scheduling of the instructions can be a bit more flexible in GCN ? For example, it is not always SIMT (every thread has one lane of the vector unit), but the instructions can be scheduled over the lanes in a SIMD manner ?

                            In other words, do you support the optional __attribute__((vec_type_hint(<type>))) ?

                              • Re: GCN SIMD vs SIMT
                                tzachi.cohen

                                Hi ,

                                 

                                GCN GPUs are strict SIMT.

                                Decorating a kernel with '__attribute__((vec_type_hint(<type>)))' will not influence GPU compilation artifacts.

                                Since GCN is a scalar architecture, i.e. each thread can at most execute a single component ALU operation in a cycle there is no point in trying to horizontally-vectorize the code.


                                Tzachi


                                 

                                1 of 1 people found this helpful