Please give me some insight on this .
I have to present it and I am sure there would be questions on this .
I am very thankful to the members on this forum for being kind enough to reply to my questions .
I couldn't have done without the help from the forum.
In my view, putting the loop in the kernel may be better, since launching a kernel may have some overhead on CAL API. But it depends on lot of factors, for instance, the memory access pattern.