OpenCL

ep-98d · ‎06-22-2022

From what I understand clEnqueueMigrateMemObjects() explicitly copies memory buffer objects from one device to another (GPU to GPU, CPU to GPU, etc.). What I do not understand is the function's input parameters are not as explicit. It only takes in command queues that are associated to a specific device, and the buffer objects that we would like to copy it to a command queue, NOT to the other buffer objects that is associated with a different device. For example, clEnqueueReadBuffer() and clEnqueueWriteBuffer() explicitly need the sender & receiver of the memory objects. While clEnqueueMigrateMemObjects() does not. Once clEnqueueMigrateMemObjects() is called, and migrate a memory buffer object is now 'owned' by a different command queue, how do I 'use' this memory buffer object? Should I set the kernel arguments again?

So, my question is....

What is the proper way of transferring a memory objects from one device to another (NOT in between the device and the host), given that we are using two GPUs and excluding CPU as a OpenCL device, and all the devices are under a same OpenCL context. I would also like to know how to transfer memory buffer objects in between GPUs that are assigned to different contexts.

dipak · ‎06-29-2022

>>Should I set the kernel arguments again?

No need to set the kernel arguments again.

>>What is the proper way of transferring a memory objects from one device to another (NOT in between the device and the host), given that we are using two GPUs and excluding CPU as a OpenCL device, and all the devices are under a same OpenCL context.

The performance depends on how data is transferred between two GPU devices, i.e. whether the transfer happens GPU-to-GPU directly (say, DMA over PCIe ) or it uses system memory to copy, like GPU1 to system memory to GPU2.

You can check if there is any vendor specific extensions or optimizations available that support direct GPU-GPU transfer. For example, AMD DirectGMA technology allows direct peer-to-peer transfer between two GPUs. For more information about DirectGMA, please refer the links below:

http://developer.amd.com/wordpress/media/2014/09/DirectGMA_Web.pdf

https://github.com/GPUOpen-LibrariesAndSDKs/DirectGMA_P2P

>>I would also like to know how to transfer memory buffer objects in between GPUs that are assigned to different contexts.

Again, you need to check if there is any vendor specific extensions available that support this transfer. Otherwise, you have to transfer the data explicitly like this: buffer1 (context1) -> host -> buffer2 (context2)

By the way, here is an example that demonstrates how DirectGMA can be used to transfer data between two GPUs with different contexts: https://github.com/GPUOpen-LibrariesAndSDKs/DirectGMA_P2P/tree/master/GPUtoGPU_OpenCL/GPUtoGPU_OpenC...

Thanks.

ep-98d · ‎07-02-2022

Thank you for the reply!

Could you elaborate on your answer of "No need to set the kernel arguments again." ? Once clEnqueueMigrateMemObject() is called, how do I use the buffer (that is now owned by a different command queue for a different device) with the other command queue? Since you said I do not need to set the kernel arguments again, what should happen (in order for me to use this transferred memory buffer object with the other command queue) next once clEnqueueMigrateMemObject() is called?

Like I mentioned in my original post, clEnqueueMigrateMemObject() does not have the sender/receiver memory buffer objects specified for the function inputs like how clEnqueueWriteBuffer() or clEnqueueReadBuffer() are defined. This is why I am confused.

And, I recently found that I can implicitly transfer memory buffer objects in between GPUs, that are under the same contexts, just by passing a memory buffer objects into a kernel arguments (ex: passing a memory buffer objects that are defined for a command queue for GPU1 into the kernel that are defined for command queue for GPU2). For this type of implicit operation, does the data transfer in between GPUs involve the data going through the host?, like you explained.

Lastly, how is DirectGMA that you linked is different from clEnqueueCopyBufferP2PAMD() (link:https://github.com/jlgreathouse/test_cl_amd_copy_buffer_p2p)? Are they basically the same thing? If so, which one should I use?

dipak · ‎07-06-2022

As described here, typically, memory objects are implicitly migrated to a device for which enqueued commands are targeted. clEnqueueMigrateMemObjects allows this migration to be explicitly performed ahead of the dependent commands. Once the event, returned from clEnqueueMigrateMemObjects, is marked as CL_COMPLETE, the memory objects have been successfully migrated to the device associated with the command queue. Then, application can enqueue a kernel or other command that uses the memory objects.

Please note, managing the event dependencies is important to avoid overlapping access to memory objects. Because the spec says that:

"The user is responsible for managing the event dependencies, associated with this command, in order to avoid overlapping access to memory objects. Improperly specified event dependencies passed to clEnqueueMigrateMemObjects could result in undefined results."

OpenCL

Proper usage of clEnqueueMigrateMemObjects()