I have a opencl program which abstractly looks like this
int batch_size = X << num;
kernel1.setArg(dynamic local memory size)
kernel2.setArg(dynamic local memory size)
for (int done = 0; done < num; done += batch_size)
do kernel1, offset = done, global_size = batch_size
do kernel2, offset = done, global_size = batch_size
The kernels process over a large dataset, and has to do small batches at a time, otherwise the global_tmp buffer becomes to large.
When this loop runs suddenly my allocated memory skyrockets, followed by massive disk access (which I assume is paging).
I'm guessing this happens because the opencl driver stores the configuration for each kernel invocation in order to be asynchronous, but I have 16GB RAM and it gets filled.
The only explanation I can think of is that global_tmp and/or the local memory is also stored in RAM.
Is this true? If so, I don't see the reason why one ever would want that kind of behaviour.
If the local memory is saved, why exactly? Once a kernel is executed, it is run to completion, no interruption, right?
If the global memory is saved, the driver assumes prior knowledge on how the application uses that memory, which here is shared between kernel1 and kernel2, therefor kernel2 must see the output of kernel1, not the state global_tmp was in when kernel2 was enqueued.
I know that kernel2 does see the correct output of kernel1, because the algorithm still works properly, so is the global_tmp only saved for each kernel1?
Exactly what happens here?
Note: there is no problem when the queue.finish() is used, I am just wondering about the finer details on how the driver works.
I am running this on a 7970 with the latest driver.