I have a for loop (host side) which calls the same kernel many times, such that for each call an input and output buffers are passed as kernel arguments.
Using the AMD profiler, it turns out that there is a huge overhead caused by multiple kernel calls, buffer writes at the beginning, and buffer reads at the end of each for loop iteration.
I would like to replace the multiple kernel calls with a single kernel call which has a single buffer array as an input argument and a single buffer array as an output argument. That "batch" kernel should run a for loop, on the GPU hardware (instead of on the host CPU), and call the existing single buffer kernel multiple time. This batch call should replace the huge number of data transfers between host and GPU with a single (larger) transfer of a single buffer array.
How do I declare and pass a memory buffer array as an argument to the kernel? how do I read that buffer array at the kernel? Any code sample?
(I do not believe that calling clCreateSubBuffer is a good option)