5 Replies Latest reply on Sep 14, 2009 11:56 PM by Gipsel

    Slow Reduction Kernel

    dinaharchery

      Please help,

      I apologize that this topic is so similiar to the other that I posted but this is a very specific question I hope someone who has maybe ran into the same problem could help me with. It is driving me insane and I am hoping that it is just a simple issue us newbies run into.

      I am implementing a matrix-vector multiplication operation (similar to the one included with Brook+ samples) and everything seems to work great except a large bottleneck at the reduction kernel. Is there a way to speed up the reduction kernel or maybe I should create my own? And if so, how (hints/ideas - both my code and I are slow)?

      Relevent code is attached. Thank you to anyone with any ideas or simple code example(s)

      // Call Kernel(s): gatherMult(aStrm, xStrm, indices, tmpMat); // THIS IS THE SLOW-DOWN ====> The "REDUCTION" Kernel sumRows(tmpMat, yStrm); kernel void gatherMult(float a<>, float b[], float index<>, out float result<>) { result = a*b[index]; } reduce void sumRows(float nzValues<>, reduce float result<>) { result += nzValues; }

        • Slow Reduction Kernel
          gaurav.garg

          Whar are the sizes of each stream that you are passing inside kernel?

            • Slow Reduction Kernel
              dinaharchery

              Thank you for the reply.

              I am assuming you mean the streams to the 'sumRows' kernel. The 'nzValues' stream has 75,996 elements and the 'result' has 6,333 elements.

                • Slow Reduction Kernel
                  riza.guntur

                  I think you should take optimized_matmult approach. Split the vector to 8 streams, and do gather then.

                  If you don't do rapid kernel call (I mean iteratively called the kernel), you won't benefit from reduction on such size.

                  For 100 to 1 reduction, it will definitely faster than any approach.

                    • Slow Reduction Kernel
                      dinaharchery

                      Thank you for the advice.

                      Is the Reduction kernel always slow - relative to the other basic kernels? Does anyone know the exact algorithm that is being used for the reduce kernel in Brook+ ?

                      Also, I know this is the wrong forum, but I just wanted to get an opion of NVIDIA CUDA versus Brook+ - what is the general state of CUDA version of a reduction kernel.

                       

                      Once again, thank you.

                        • Slow Reduction Kernel
                          Gipsel

                           

                          Originally posted by: dinaharchery

                          Also, I know this is the wrong forum, but I just wanted to get an opion of NVIDIA CUDA versus Brook+ - what is the general state of CUDA version of a reduction kernel.



                          You have to write it yourself in CUDA, afaik. There is no integrated reduction as in Brook if I'm not mistaken.