4 Replies Latest reply on Mar 25, 2013 1:06 PM by LeeHowes

    SIMD - some more explanation

    sajis997

      Hello forum,

       

      While going through the book "Programming massively parallel processor" i found the following statement in chapter 4 to be confusing :

       

      "At any

      instant in time, one instruction is fetched and executed for all threads

      in the warp."

       

      I believe that the concept of warp is not part of either CUDA or OpenCL specification. Let me explain the warp here. Once a block is assigned to streaming multiprocessor, it is further divide into 32 threads units -called WARPS.

       

      What exactly do we mean by the instruction here? Does one instruction mean one kernel function or one of the instruction statements inside the kernel function?

       

       

      Some lights into the matter will be appreciated.

       

       

      Regards

      Sajjad

        • Re: SIMD - some more explanation
          himanshu.gautam

          WARP is a CUDA Term. AMD Analogue is the "Wavefront".

           

          Be it wavefront/warp, these bunch of threads/worik-tems always execute together. They just need one program counter to track them. So, given a workgroup of 128 workitems on AMD (64 workitems per wavefront), there will be 2 wavefronts in progress - during execution. 2 program counters are enough to track these workitems

           

          Instruction - means an assembly language instruction here.

          • Re: SIMD - some more explanation
            LeeHowes

            I strongly recommend a purchase of "Heterogeneous Computing with OpenCL" because I made a conscious effort to do a good job of explaining this concept and the way architectures deal with it and I hope I did a good job of it I had a good number of pages and diagrams to explain over there.

             

            Himanshu is right. Although I'd explain it a little differently.

             

            Picture an x86 CPU. On the CPU we have SSE and AVX units. We usually program those with intrinsics, or we let the compiler infer the vectorisation. But there is one thread that drives through that AVX pipeline. Note that what you have there is one thread (one program counter) with 8 SIMD lanes on an AVX unit. Call that a "wavefront".

             

            Then look at the modern AMD GPU. We have one scalar unit with a wide vector pipeline. That vector pipeline has 16 lanes but it takes 4 cycles to execute, so you can view it as 64 lanes (sometimes AVX is implemented on top of a narrower SSE unit in the same way). That there is your wavefront again. There is one program counter.

             

            Separately from this we have the programming model on top, in this case CUDA or OpenCL. These models assume that there is a set of "work-items" with a certain degree of independence - indeed you are banned from doing certain types of communication between work-items. The reason for doing this is that it allows a single OpenCL work-item to map either to an underlying hardware thread or to a single lane of that SIMD vector. So when we compile OpenCL down to the GPUs we take 64 work-items and map them to a single program counter, a single hardware thread, and indeed a single wavefront.

             

            On that single thread we emulate the behaviour of branches that diverge across the wavefront and therefore we simulate the existence of multiple "threads", which is why CUDA uses that terminology. In reality those threads are merely simulated on top of that single program counter. When you diverge you need to execute both halves of the divergence, and that is why divergence is expensive.

             

            As there is only one program counter, pointing to just one machine code instruction (which is roughly analogous to a statement in C but clearly a complicated statement would be split into multiple instructions depending on the complexity of the instruction set) all "threads" in the warp have a single instruction loaded, and execute that single instruction. Some may be masked out because they did not take that branch, others will commit results from the instruction, and later the masking will be reversed. You could map OpenCL similarly onto AVX - 8 work-items onto a single AVX unit - but it works less well than it does on the GPU because AVX is less capable than the GPU vector unit.

            1 of 1 people found this helpful
              • Re: SIMD - some more explanation
                sajis997

                Hello ,

                 

                 

                I shall grab the book as soon as possible. What are the pre-requisites to understand the concepts of the book that you recommend.

                Currently i am going through the latest edition of "Programming massively parallel processor". If i have this basics from this book, am i good to go with the book you recommend?

                 

                 

                Regards

                Sajjad

                  • Re: SIMD - some more explanation
                    LeeHowes

                    Well, I only recommend it because I wrote it and I know I wanted to address this kind of question. It also gives you an OpenCL point of view rather than a CUDA one and describes a mapping to AMD hardware, which as you're asking questions on AMD's OpenCL forum might be useful It starts at the beginning so you don't need any extra background.

                     

                    However, the book you're reading is an excellent book. It just has a different point of view on the subject.