First of all, thanks a lot for your advices, reading this forum helps me a lot. But now I am facing a problem and I am not sure which could be the best approach:
I have a kernel that solves a linear system of equations. the system is subdetermined, so the kernel generates many solutions. The vast majority of them are bad and a little subset are good.
So I have a second kernel that receives all the solutions from the first kernel and checks if a solution is good or bad, adding a mark indicating good or bad. So the final output is a buffer filled with all the solutions plus the good/bad mark for everyone.
This scheme generates large amounts of memory transfer from GPU to Host, so I am looking for a way to transfer only the good solutions.
My first shoot is to generate a third kernel that copies only the good solutions from the output of the second kernel to an output buffer and puts the total number of good solutions in the first place of the output buffer. And in the host I will generate 2 memory transfers, one to read the total number of solutions, and the second one to transfer the output buffer with only the good solutions.
I am not sure if it is a better approach to deal with this problem, so any insight about this will be very appreciated.