When I run my application, I first queue up around 50 sets of kernels, each set containing around 10 kernels.
The queued kernels wait for a user event before beginning. I am finding that simply queuing the kernels into OpenCL
queues eats up around 1.5 GB of host memory, and even after the kernels have been executed, the memory does not
get cleaned up.
How can I trouble shoot this issue? And why does the queue eat up so much memory? Each set of kernels waits for a host to device
transfer of a 9 MB buffer before they execute, but I maintain a pool of these buffers, so only a handful are allocated.