Im in the need to create an application that will introduce only a very low latency in the data flow as I have to process a stream.
Could any of you please confirm what transfer rates can be achived? I know it depends on the card/chipse/op.system and everything, I would like only to get a feeling of its order of magnitude.
Currently we plan on using GV-R927X under Linux on a Dell 310 with Xeon 3400 cpu .
I need to achive at leat 1k trasnfers a sec.
Is there any special techniques needed? Which Linux would you offer?
Thank you really much in advance!
What I can tell you is that I have measured round-trip latencies usually ~35ms. This includes dispatching the EnqueueNDRangeKernel call and waiting for the results with a full stop. I've seen this go lower than 20ms on some occasions. In general, the rule of thumb is to consider "a frame" as in interactive entertainment and those measurements seem to be on the same ballpark. It seems OpenCL is more optimal than generic graphics in terms of latencies but still not so much more optimal you can forget about data transfer optimization.
Keep in mind that those measurement were taken with non-trivial kernels which included some computation. So I'm talking on something hopefully resembling a pessimistic scenario.
Those measurements were taken on an old AMD K10 architecture and I'd expect it to be lower on Intel or modern AMD systems.
Most importantly, if you can pipeline your transfers you could likely have half as much effective latency. Avoid full stops at all costs.
If your transfers are less than 4k each you are likely on safe ground but I'd rather reduce the amount of transfers than the transfer size. Be sure to understand how the driver manages this. AMD APP manual contains some hints on how drivers use pinned memory to handle transfer requests.
EDIT: making clear this is not just a simple round trip.
thank you very much for your answer, however sad it seems for me.
It is more likely that I need low transfer sizes (nx10k) but very high count. Like 1kHz.
however, I can keep the resources allocated,
Isnt there a way to stream to and from the GPU? Like a FIFO (pipe)?
Which one do you call the APP Manual? This AMD Accelerated Parallel Processing OpenCL Programming Guide (rev 2.7) ?
My data arrives from the network, have to do processing on it, then forwarding it to further devices. Latency target is 0, of course, but 2-5 ms is probably acceptable.
Please, I would really appreciate more responses
32ms, OK, but is it true if I dont want to re allocate everything always?
I can keep my buffers allocated and use memcpy on them any times, cannot I?
Like the bandwidth tester app:
// standard host alloc
h_data = (unsigned char *)malloc(memSize);
for(unsigned int i = 0; i < memSize/sizeof(unsigned char); i++)
h_data = (unsigned char)(i & 0xff);
// MAPPED: mapped pointers to device buffer and conventional pointer access
void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);
if(memMode == PINNED )
h_data = (unsigned char*)clEnqueueMapBuffer(cqCommandQueue, cmPinnedData, CL_TRUE, CL_MAP_READ, 0, memSize, 0, NULL, NULL, &ciErrNum);
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
memcpy(dm_idata, h_data, memSize);
// Exiting program
ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);
So I create my buffers at the beginning then keep copying data into them.
Isnt that so?
This piece of code generates 1M transfers of 8k data in 696 seconds (including overheads).
This is quite much like what I need! (I need 1k transfers only.)
Please give me just a little affirmation.
Thank you in advance!
clEnqueueUnmapMemObject APIs expect a valid memory object. As described in this page clEnqueueUnmapMemObject :
clEnqueueMapBuffer and clEnqueueMapImage increments the mapped count of the memory object. The initial mapped count value of a memory object is zero. Multiple calls to clEnqueueMapBuffer or clEnqueueMapImage on the same memory object will increment this mapped count by appropriate number of calls.
clEnqueueUnmapMemObjectdecrements the mapped count of the memory object.
So, you can allocate a memory buffer once and then map and up-map the same buffer multiple times.
Addition to this, I would like to suggest you following points:
1. Performance of data transfer also differs on how the memory buffer been allocated (i.e. memory flags used during allocation APIs). Generally the selection is made depending on how the buffer will be used by the application. Please refer sections 4.5 OpenCL Memory Objects and 4.6 OpenCL Data Transfer Optimization under Chapter 4: OpenCL Performance and Optimization in AMD Accelerated Parallel Processing OpenCL Programming Guide. You will get an idea that may be helpful for you.
2. You can go through the AMD APP SDK OpenCL sample "AsyncDataTransfer" and see how asynchronous memory transfer can be achieved and a better GPU utilization can be done.
I'm also trying to achieve low latency rates circa 2-3ms for AsyncDataTransfer. I'm running on the HSA Beta.
Thanks for the SDK example mentioned above - I had a good look at it yesterday.
The Kaveri OpenCL Programmer Guide.docx suggests that producer/consumer patterns are allowed, is it possible to have an example?
What can be used as a semaphore between the producer and consumer?
Thanks in advance.
sorry, somehow I missed your response.
"2. You can go through the AMD APP SDK OpenCL sample "AsyncDataTransfer" and see how asynchronous memory transfer can be achieved and a better GPU utilization can be done. "
I got none with that name, did you mean TransferOverlap example?
By now I have a lill bigger understanding of OpenCL and realized that I wont try to do the streaming. There is no way to not kill performance with it.
I found it in C:\Users\<etc>\AMD APP SDK\2.9\samples\opencl\cpp_cl\AsyncDataTransfer
I got the AMD APP SDK from http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-tools-sdks/amd-accelerated-parallel-proce...
Hope that helps.