cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

dinaharchery
Journeyman III

Slow Reduction Kernel

Please help,

I apologize that this topic is so similiar to the other that I posted but this is a very specific question I hope someone who has maybe ran into the same problem could help me with. It is driving me insane and I am hoping that it is just a simple issue us newbies run into.

I am implementing a matrix-vector multiplication operation (similar to the one included with Brook+ samples) and everything seems to work great except a large bottleneck at the reduction kernel. Is there a way to speed up the reduction kernel or maybe I should create my own? And if so, how (hints/ideas - both my code and I are slow)?

Relevent code is attached. Thank you to anyone with any ideas or simple code example(s)

// Call Kernel(s): gatherMult(aStrm, xStrm, indices, tmpMat); // THIS IS THE SLOW-DOWN ====> The "REDUCTION" Kernel sumRows(tmpMat, yStrm); kernel void gatherMult(float a<>, float b[], float index<>, out float result<>) { result = a*b[index]; } reduce void sumRows(float nzValues<>, reduce float result<>) { result += nzValues; }

0 Likes
5 Replies
gaurav_garg
Adept I

Whar are the sizes of each stream that you are passing inside kernel?

0 Likes

Thank you for the reply.

I am assuming you mean the streams to the 'sumRows' kernel. The 'nzValues' stream has 75,996 elements and the 'result' has 6,333 elements.

0 Likes

I think you should take optimized_matmult approach. Split the vector to 8 streams, and do gather then.

If you don't do rapid kernel call (I mean iteratively called the kernel), you won't benefit from reduction on such size.

For 100 to 1 reduction, it will definitely faster than any approach.

0 Likes

Thank you for the advice.

Is the Reduction kernel always slow - relative to the other basic kernels? Does anyone know the exact algorithm that is being used for the reduce kernel in Brook+ ?

Also, I know this is the wrong forum, but I just wanted to get an opion of NVIDIA CUDA versus Brook+ - what is the general state of CUDA version of a reduction kernel.

 

Once again, thank you.

0 Likes

Originally posted by: dinaharchery

Also, I know this is the wrong forum, but I just wanted to get an opion of NVIDIA CUDA versus Brook+ - what is the general state of CUDA version of a reduction kernel.



You have to write it yourself in CUDA, afaik. There is no integrated reduction as in Brook if I'm not mistaken.

0 Likes