Please help,
I apologize that this topic is so similiar to the other that I posted but this is a very specific question I hope someone who has maybe ran into the same problem could help me with. It is driving me insane and I am hoping that it is just a simple issue us newbies run into.
I am implementing a matrix-vector multiplication operation (similar to the one included with Brook+ samples) and everything seems to work great except a large bottleneck at the reduction kernel. Is there a way to speed up the reduction kernel or maybe I should create my own? And if so, how (hints/ideas - both my code and I are slow)?
Relevent code is attached. Thank you to anyone with any ideas or simple code example(s)
// Call Kernel(s): gatherMult(aStrm, xStrm, indices, tmpMat); // THIS IS THE SLOW-DOWN ====> The "REDUCTION" Kernel sumRows(tmpMat, yStrm); kernel void gatherMult(float a<>, float b[], float index<>, out float result<>) { result = a*b[index]; } reduce void sumRows(float nzValues<>, reduce float result<>) { result += nzValues; }
Whar are the sizes of each stream that you are passing inside kernel?
Thank you for the reply.
I am assuming you mean the streams to the 'sumRows' kernel. The 'nzValues' stream has 75,996 elements and the 'result' has 6,333 elements.
I think you should take optimized_matmult approach. Split the vector to 8 streams, and do gather then.
If you don't do rapid kernel call (I mean iteratively called the kernel), you won't benefit from reduction on such size.
For 100 to 1 reduction, it will definitely faster than any approach.
Thank you for the advice.
Is the Reduction kernel always slow - relative to the other basic kernels? Does anyone know the exact algorithm that is being used for the reduce kernel in Brook+ ?
Also, I know this is the wrong forum, but I just wanted to get an opion of NVIDIA CUDA versus Brook+ - what is the general state of CUDA version of a reduction kernel.
Once again, thank you.
Originally posted by: dinaharchery
Also, I know this is the wrong forum, but I just wanted to get an opion of NVIDIA CUDA versus Brook+ - what is the general state of CUDA version of a reduction kernel.
You have to write it yourself in CUDA, afaik. There is no integrated reduction as in Brook if I'm not mistaken.