I have three nested loops .
Now the question is to have the nested loop in the kernel so that each thread has the nested loop or to have the nested loop on host code and launching the kernel multiple time
Which would yield better performance?
FYI : its 3 nested FOR loops and i have an option of putting them in the kernel or host code.
Please give me some insight on this .
I have to present it and I am sure there would be questions on this .
I am very thankful to the members on this forum for being kind enough to reply to my questions .
I couldn't have done without the help from the forum.
In my view, putting the loop in the kernel may be better, since launching a kernel may have some overhead on CAL API. But it depends on lot of factors, for instance, the memory access pattern.