Hi everybody, I have the following situation:
I have an AMD Radeon Pro WX 7100 running on a Windows 10 OS.
I can successfully utilize DirectGMA technology by allocating a buffer on the GPU, making the buffer resident using clEnqueueMakeBuffersResidentAMD, handing the bus_address to a 3rd party capture device, and have that device DMA directly to GPU memory.
Next, I try to make the GPU write directly to an FPGA. The FPGA maps a memory region to a PCIe BAR, I can obtain the backing physical address to that BAR from the FPGA driver.
// allocation stage
addr.surface_bus_address = remote_bus_address;
addr.marker_bus_address = remote_bus_address;
cl_int create_buff_err = CL_SUCCESS;
cl_mem remote_buffer = clCreateBuffer(context, CL_MEM_EXTERNAL_PHYSICAL_AMD | CL_MEM_WRITE_ONLY , byteSize, &addr, &create_buff_err);
assert(create_buff_err == CL_SUCCESS);
What I see next puzzles me, clCreateBuffer is always successful. In fact, it is successful as long as 'remote_bus_address' is aligned to a page size (it can even be a random number), which is expected because there is no actual allocation being done. Yet, when trying to copy content to the returned cl_mem, I always get CL_MEM_OBJECT_ALLOCATION_FAILURE failure.
I would expect the opencl driver to copy the data "no questions asked" (maybe cause a blue screen on the way), yet I get an allocation failure.
can anyone explain this? How can I tell why these functions failed?
cl_int err = clEnqueueWriteBuffer(queue.get(), remote_buffer, CL_TRUE, 0, vec.size() * sizeof(uint32_t),vec.data(), 0, nullptr, nullptr); // returns CL_MEM_OBJECT_ALLOCATION_FAILURE
cl_int err = clEnqueueCopyBuffer(queue.get(), deviceVec.get_buffer().get(), remote_buffer, 0,0, vec.size() * sizeof(uint32_t), 0, nullptr, nullptr); // returns CL_MEM_OBJECT_ALLOCATION_FAILURE
Just wanted to share couple of suggestions if they work.
cl_mem remote_buffer = clCreateBuffer(context, CL_MEM_EXTERNAL_PHYSICAL_AMD, ..);
clEnqueueMigrateMemObjects(queue, 1, &remote_buffer, 0, 0, NULL, NULL);
clEnqueueCopyBuffer(queue, local_buffer, remote_buffer, ..);
Thanks for the reply,
I tried your first suggestion, and it did not work. clEnqueueMigrateMemObjects also returned CL_MEM_OBJECT_ALLOCATION_FAILURE.
As for your second suggestion, could you elaborate on how exactly I can make sure those addresses are set correctly?
My understanding is that these addresses are what the I can see in the 'device manager' resources property. This is also the 'physical addresses'
returned from my driver. when my driver loads, it map these addresses to my virtual address space and I can memcpy / clEnqueueReadBuffer to that address.
Besides that, are there any additional steps my driver should perform in order to support this?
Regarding the cl_bus_address_amd structure, I meant to say that you should use physical bus address of the buffer, not any mapped user space pointer/address as described here: DirectGMA between a FPGA and GPU . From your last reply, it seems you've already used the physical bus address to set the structure.
Adding dmitryk, who is an expert in this domain, if he can provide any suggestion.
As long as the remote buffer is created using the physical address of the buffer on your FPGA you should be fine.
Don't forget we also need the marker address.
The allocation can fail if the size of your buffer is bigger than the aperture you have.
Also we do lazy allocation, so during clCreateBuffer we don't really allocate anything. So not surprised it fails during the migrate or copy call
Thanks for the reply,
It seems to me I follow the requirement, yet it just does not work. Just to be sure I don't miss anything, I have a few further questions.
1. How can the allocation possibly fail?
My understanding is that CL_MEM_EXTERNAL_PHYSICAL_AMD implies that the buffer has already been allocated externally by the FPGA device. What additional stages can the AMD driver do?
2. How can the AMD driver validates the supplied buffer address supplied by the application? Does my FPGA driver has to somehow communicate with the AMD driver? In my case, my driver does nothing other than mapping the FPGA DDR memory to a BAR and exposing the physical address.
3. Does the "aperture" has a certain minimal size? This is the last thing I can think of that might be responsible for the failure.
The hardware I currently use exposes a relatively small memory range (8 PAGES, to be exact) and will only support larger buffers in the future.