What is the global memory caching policy on the ATI Stream platform? clCreateBuffer() does not take a device argument. This would imply either that calling clCreateBuffer() causes *all* OpenCL devices to cache the corresponding memory, or that any caching of memory only happens on specific devices when the kernel using that memory is enqueued to run on the devices. Which of these represents the actual scenario?
The second scenario is correct.Memory buffers are created only if they are really needed by the kernel executing.
Thanks Himanshu. That leads to my next question. Is the buffering eager or lazy, i.e, is all memory buffered at the start or is buffering done piecewise on an as-needed basis? In either case, what happens if the device memory is already full?
E.g. when you call clEnqueueWriteBuffer the transaction is enqueued to the command queue, not just a part of the data. If memory is already full, you get CL_OUT_OF_RESOURCES error. See OpenCL spec, there is also mentioned what errors can give api function calls.
buffers are created eagerly as you say it.If any element of buffer is needed the whole buffer is transferred to the global memory of gpu in one go.
Thanks to both of you. Karbous, you are right about the CL_OUT_OF_RESOURCES error in case of clEnqueueWriteBuffer() as specified in the spec. But my question was more platform specific, imagining maybe there is some kind of a buffer eviction policy that exists on the ATI Stream platform that on seeing insufficient device memory would take the least recently used buffer out of device memory and into system memory (a bit like how CPU caches work). Is there some such mechanism? Or do buffers on device memory keep occupying device memory till they are released?
There isn't such mechanism, it is up to the programmer to check how much memory the card has and choose how to divide work. You've got clGetDeviceInfo function to retrieve information about the OpenCL device (how much memory it has, what supports and so on).
but what if run kernel which take for example 400MB buffer and then run another or same kernel which are take another 400MB buffer will implementation swap this buffer from to GPU memory?
i cn understand restriction that buffers needed to execute kernel must fit into device memory. but will it swap?
appereantly AMD remove that restriction that all allocated buffer must fit in device memory.
AMD implementation has persistent global memory.The buffers remain in the global memory unless released properly.