8 Replies Latest reply on Jun 21, 2010 12:00 AM by niravshah00

    A couple of question

    niravshah00

      1.
      When you set the BRT_RUNTIME to CPU does it emulates the parallelization or it jsut runs the code serially.

      Well i execute a serial code and execute it (i have a written a C program ) it takes less time while the BROOK+ code when executed with CPU backend take a very very long time . Input range being the same.

      I would really like to know the answer for the same.

      2.

      I read the hard architecture provided in brook+. It says stream processors have SIMD engine and SIMD engines have thread processors which have stream core. So how many threads are executed at time. I am using FireStream 9170. The document says wavefron size is 64. I checked the specification of 9170 on AMD website doesn have the full description as they have for 9270.

      Would really want answers to this question

        • A couple of question
          ryta1203

          1. Not sure, I believe serialization but don't quote me on that.

          2. SIMD engine = 16 TP x 5 wide VLIW, 2 wavefront slots per SIMD engine, simul (switching) wavefronts are based on resources available (aka registers, etc), WF size is 64 organized into 16 quads of 2x2 threads.

          Someone please correct me if I'm wrong.

            • A couple of question
              niravshah00

              Thanks ryta1203,

              Well when i present my code and describe the architecture of stream processors the next question from the panel would so you are using 9170 card so how many SIMD engines it has  and how many threads can be executed in parallel?

              If it is serializing the code then wh does it takes more time than the simple C code ?? any explanation for this

                • A couple of question
                  ryta1203

                  Last question first: just a guess but I'm sure there is more overhead. This could be due to several issues: emulation, extra code, doesn't optimize code, etc....

                  http://ati.amd.com/products/streamprocessor/specs.html

                  320 cores. 320/5 = 64/16 = 4, so 4 SIMD engines? (going off what I posted about about 5 wide VLIW per thread processor and 16 TPs per SIMD engine)  Just going off of that, that's what makes sense to me, particularly considering the 9170 came out the same time as the 3870, which also has 4 SIMD engines (320 "stream" processing cores).

                  I don't think the above formula applies to all AMD GPUs; however, it does apply to the "higher" end models, at least for now.

                    • A couple of question
                      niravshah00

                       

                      Originally posted by: ryta1203 Last question first: just a guess but I'm sure there is more overhead. This could be due to several issues: emulation, extra code, doesn't optimize code, etc....

                       

                      http://ati.amd.com/products/streamprocessor/specs.html

                       

                      320 cores. 320/5 = 64/16 = 4, so 4 SIMD engines? (going off what I posted about about 5 wide VLIW per thread processor and 16 TPs per SIMD engine)  Just going off of that, that's what makes sense to me, particularly considering the 9170 came out the same time as the 3870, which also has 4 SIMD engines (320 "stream" processing cores).

                       

                      I don't think the above formula applies to all AMD GPUs; however, it does apply to the "higher" end models, at least for now.

                       

                      Ok U said emulation ??? And there no question of extra code except the fact that i filter the result from the stream (i.e each element of the stream is checked) after each kernel call.

                       

                      About the number of threads that execute I get it

                      4 SIMD engines each containing 16 Thread Processors  so it give the wavefront size of 64 .

                      Thanks a lot

                        • A couple of question
                          ryta1203

                          1. Extra code is probably added in the emulation, I didn't mean by you, I mean by the compilation.

                          2. Yes, 64 threads per wavefront. 2 wavefronts (odd/even slot) per SIMD engine at ONE time; however, you can have many many wavefronts running simultaneously.. this is based on your resource usage.

                           For example, if there are 256 registers allocated per SIMD engine and your kernel uses all T registers (clause temp registers taken from the Gen. Pool of Regs) then that leaves 252 per SIMD engine. Now, if your kernel uses 10 GPR then you have 252/10 = 25.2, so you have 25 wavefronts running simultaneously.

                          Now, at EXACTLY one time you can only have 2 WFs per SIMD engine (odd/even slot); however, when these wavefronts stall (memory read/write/etc) then that wavefront can be switched out with another wavefront (to help hide latency).

                          Something like this, in general, from my understanding.

                            • A couple of question
                              niravshah00

                               

                              Originally posted by: ryta1203 1. Extra code is probably added in the emulation, I didn't mean by you, I mean by the compilation.

                               

                              2. Yes, 64 threads per wavefront. 2 wavefronts (odd/even slot) per SIMD engine at ONE time; however, you can have many many wavefronts running simultaneously.. this is based on your resource usage.

                               

                               For example, if there are 256 registers allocated per SIMD engine and your kernel uses all T registers (clause temp registers taken from the Gen. Pool of Regs) then that leaves 252 per SIMD engine. Now, if your kernel uses 10 GPR then you have 252/10 = 25.2, so you have 25 wavefronts running simultaneously.

                               

                              Now, at EXACTLY one time you can only have 2 WFs per SIMD engine (odd/even slot); however, when these wavefronts stall (memory read/write/etc) then that wavefront can be switched out with another wavefront (to help hide latency).

                               

                              Something like this, in general, from my understanding.

                               

                               

                              Cool this was very helpful just another question where can one get all this information .
                              I mean none of the documents say this  or it just from the genereal architecture of GPU u understand this