You CAN do asynchronous Data transfers to and from the GPU (or any compute device) and system RAM, as well as from one compute device to another. The boolean parameter (CL_FALSE or CL_TRUE) indicates whether or not the read/write call is blocking. False, enqueue's the read/write and immediately returns BEFORE the read/write has completed. True, causes the function to block, and NOT return until the copy/transfer is complete.
When using asynchronous calls, synchronization is accomplished through the use of event's. You call the enqueue method and then later on when you need to use the data being copied, you call:
This function will block and NOT return until the copy/transfer is complete. If the transfer has already completed, the function will return immediately.
--- Spec Doc ---
C++ Bindings: http://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdf
--- C++ --- Read (From Compute Device to System RAM): queue.enqueueReadImage(image, CL_FALSE, origin, region, 0, 0, destPTR, NULL, &evnt); Write (From System RAM to Compute Device): queue.enqueueWriteImage(image, CL_FALSE, origin, region, 0, 0, srcDataPTR, NULL, &evnt); Copy (From One Compute Device to another): queue.enqueueCopyImage(srcImage, destImage, srcOrigin, destOrigin, region, NULL, &evnt); --- C --- Read (From Compute Device to System RAM): clEnqueueReadImage(queue, image, CL_FALSE, origin, region, 0, 0, destPTR, 0, NULL, &evnt); Write (From System RAM to Compute Device): clEnqueueWriteImage(queue, image, CL_FALSE, origin, region, 0, 0, srcDataPTR, 0, NULL, &evnt); Copy (From One Compute Device to another): clEnqueueCopyImage(queue, srcImage, destImage, srcOrigin, destOrigin, region, 0, NULL, &evnt);
Originally posted by: sir.um You CAN do asynchronous Data transfers to and from the GPU (or any compute device) and system RAM, as well as from one compute device to another. The boolean parameter (CL_FALSE or CL_TRUE) indicates whether or not the read/write call is blocking.
Indeed, OpenCL allows async. transfers but the main problem is the AMD implementation. The transfer rate over the PCIe bus is extremely low.
I get a transfer rate of 1.3-1.6 Gb/secs on my 58xx while NVIDIA offers a steady 5.5 Gb/sec on GPUs of the same level. It is a huge difference and serious problem for kind some application.
Can someone from the devteam comment on why the transfer rates are so low?
And if something will be done to better them?
Now I'm confused...
I was under the impression that when transferring between GPUs [enqueueCopyImage()], the data was transmitted over the CrossFire bridge bypassing the motherboard and the PCIe bus which is far beyond 1.6GB/s. Is this not the case? (Dev team)
As far as the PCIe transfer rates go, I would think that they are somewhere in the range of the stated spec for the card: (Radeon HD 5870)
( http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5870/Pages/ati-radeon-hd-5870-specifications.aspx )
- Memory data rate: 4.8 Gbps
- Memory bandwidth: 153.6 GB/sec
Perhapse this is where my Hardware knowledge starts to trail off, but I beleive these refer to external transfer rates rather than just on chip transferrs. If that be the case, then the bottleneck is not the GPU.
Additionaly, since PCIe v2.2 x16 is rated at 8 GB/s I would hope that ATI isn't just barely pushing 1.6 GB/s.
That said, I would assume that anything lower than those speeds is due to inefficiencies of code or hardware. Such as PCIe bus version incompatabilities. Ex) your motherboard has a PCIe v1.x slot, and you have a PCIe v2.x card. It is my understanding that in those circumstances, stuff slows down.
How did you do that benchmark to come up with your numbers?
I'd really like an answer to this if possible. Regarding my implementation of ATI based OpenCL, High bandwidth from Host RAM to GPU RAM is important. If this truly is a downside to ATI, it could potentially influence our preference of ATI. We are implementing math operations dealing with large matrices, and currently we are overwhelmingly swayed by the superior Core count ATI offers, however based on the benchmarks stated above, it appears to be a serious issue.
I had trouble recreating these benchmark numbers, or getting an accurate benchmark at all for that matter. Is there any provided way/code to benchmark the bandwidth from host RAM to GPU RAM and back? I tried coding up a test using enqueueWriteBuffer() with a 1GB array of floats, but the copy seemed to be instantaneous, even when I queued the copy many times in a row. I assume there is some sort of optimization in the runtime which seems to be “optimizing out” my benchmark.
I personally greatly prefer ATI to NVidia, and would be very disappointed if this bottleneck causes a sway. If this is a misunderstanding or if there is a way in OpenCL to “get around” this, any help would be appreciated.
currently OpenCL most likely doesn't use DMA. hope in next release will have DMA.
DMA support is a driver level issue, not an OpenCL one, and will appear in an upcoming catalyst update.