I am trying to process a huge amount of data. Essentially it's a buffer with some GB in size. Each of my kernel executions will only need a small (some hundred kB) fraction of it. Each execution needs a different part. I'd like to enqueue it with one command with a global work size of couple of ten thousands as this seems to be much faster than enqueueing smaller fractions with a smaller global work size.
Within my kernel execution I then access the rigght part of the data by using the global id to calculate the index range.
Now my question: is there a way to gradually transfer the data just in time so I never exceed my CL_DEVICE_MAX_MEM_ALLOC_SIZE?
Thanks for your help in advance!
you must split it into smaller buffers. and you can alloc buffers up to maximum of the RAM on host. but according to my experiment buffers are transfered to device after they are needed. but they clog a device. you can for example create ten 128MB buffers. so far all good. but when you enqueue a kernel which write/read some data from this buffer you got after fourth buffer a CL_OU_OF_RESOURCES error.