Hi! I have posted this question on Stack overflow, and I thought I would post it here too, since I am using AMD's SDK for OpenCL development, and the solution could be implementation defined.
You can read the full question above, but I will summarize here. Given a pipeline of kernel operations like:
data -> kernel1 -> data1 -> kernel2 -> data2 -> kernel3 -> data3 etc.
I need all the intermediate results to be copied back to the host as well. I want to make everything as asynchronous as possible by specifying the minimal event dependencies (so reads only depend previous kernel execution, and kernels don't care about reads).
I have a few questions about managing the memory objects:
- Do I have to keep references to all cl_mem objects in the long chain of actions and release them after everything is complete?
- Importantly, how does OpenCL handle the case when the sum of all memory objects exceeds that of the total memory available on the device? At any point a kernel only needs the input and output kernels (which will fit in memory), but what if 4 or 5 of these buffers exceed the total, how does OpenCL allocate/deallocate these memory objects behind the scenes? How does this affect the reads/DMAs?
So the general question is, how do large task trees interact with large memory objects?
I would be grateful if someone could clarify what happens in these situations, and perhaps there is something relevant to this in the OpenCL spec.