I am trying to process a huge amount of data. Essentially it's a buffer with some GB in size. Each of my kernel executions will only need a small (some hundred kB) fraction of it. Each execution needs a different part. I'd like to enqueue it with one command with a global work size of couple of ten thousands as this seems to be much faster than enqueueing smaller fractions with a smaller global work size.
Within my kernel execution I then access the rigght part of the data by using the global id to calculate the index range.
Now my question: is there a way to gradually transfer the data just in time so I never exceed my CL_DEVICE_MAX_MEM_ALLOC_SIZE?
Thanks for your help in advance!