8 Replies Latest reply on Jul 25, 2014 12:36 PM by semir

    Transfer rate question

    semir

      Hello,

       

         Im in the need to create an application that will introduce only a very low latency in the data flow as I have to process a stream.

       

         Could any of you please confirm what transfer rates can be achived? I know it depends on the card/chipse/op.system and everything, I would like only to get a feeling of its order of magnitude.

         Currently we plan on using GV-R927X under Linux on a Dell 310 with Xeon 3400 cpu .

         I need to achive at leat 1k trasnfers a sec.

         Is there any special techniques needed? Which Linux would you offer?

       

      Thank you really much in advance!

       

      Bests,

      Semirke

        • Re: Transfer rate question
          maxdz8

          What I can tell you is that I have measured round-trip latencies usually ~35ms. This includes dispatching the EnqueueNDRangeKernel call and waiting for the results with a full stop. I've seen this go lower than 20ms on some occasions. In general, the rule of thumb is to consider "a frame" as in interactive entertainment and those measurements seem to be on the same ballpark. It seems OpenCL is more optimal than generic graphics in terms of latencies but still not so much more optimal you can forget about data transfer optimization.

          Keep in mind that those measurement were taken with non-trivial kernels which included some computation. So I'm talking on something hopefully resembling a pessimistic scenario.

           

          Those measurements were taken on an old AMD K10 architecture and I'd expect it to be lower on Intel or modern AMD systems.

           

          Most importantly, if you can pipeline your transfers you could likely have half as much effective latency. Avoid full stops at all costs.

           

          If your transfers are less than 4k each you are likely on safe ground but I'd rather reduce the amount of transfers than the transfer size. Be sure to understand how the driver manages this. AMD APP manual contains some hints on how drivers use pinned memory to handle transfer requests.

           

          EDIT: making clear this is not just a simple round trip.

            • Re: Transfer rate question
              semir

              Hi,

                thank you very much for your answer, however sad it seems for me.

               

                It is more likely that I need low transfer sizes (nx10k) but very high count. Like 1kHz.

                however, I can keep the resources allocated,

                Isnt there a way to stream to and from the GPU? Like a FIFO (pipe)?

               

                Which one do you call the APP Manual? This AMD Accelerated Parallel Processing OpenCL Programming Guide (rev 2.7) ?

               

                My data arrives from the network, have to do processing on it, then forwarding it to further devices. Latency target is 0, of course, but 2-5 ms is probably acceptable.

               

               

                Thank you!

               

              Bests,

              Semir

                • Re: Transfer rate question
                  semir

                  Hi Guys,

                   

                  Please, I would really appreciate more responses

                  32ms, OK, but is it true if I dont want to re allocate everything always?

                   

                  I can keep my buffers allocated and use memcpy on them any times, cannot I?

                  Like the bandwidth tester app:

                   

                          // standard host alloc

                          h_data = (unsigned char *)malloc(memSize);

                   

                   

                          //initialize

                          for(unsigned int i = 0; i < memSize/sizeof(unsigned char); i++)

                          {

                              h_data[i] = (unsigned char)(i & 0xff);

                          }

                          // MAPPED: mapped pointers to device buffer and conventional pointer access

                          void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

                    oclCheckError(ciErrNum, CL_SUCCESS);

                    if(memMode == PINNED )

                    {

                    h_data = (unsigned char*)clEnqueueMapBuffer(cqCommandQueue, cmPinnedData, CL_TRUE, CL_MAP_READ, 0, memSize, 0, NULL, NULL, &ciErrNum);

                              oclCheckError(ciErrNum, CL_SUCCESS);

                          }

                   

                   

                          for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)

                          {

                              memcpy(dm_idata, h_data, memSize);

                          }

                   

                            // Exiting program

                          ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);

                          oclCheckError(ciErrNum, CL_SUCCESS);

                   

                   

                  So I create my buffers at the beginning then keep copying data into them.

                  Isnt that so?

                   

                  This piece of code generates 1M transfers of 8k data in 696 seconds (including overheads).

                  This is quite much like what I need! (I need 1k transfers only.)

                   

                  Please give me just a little affirmation.

                   

                  Thank you in advance!

                  Bests,

                  Semir    

                    • Re: Transfer rate question
                      dipak

                      Hi,

                      The clEnqueueMapBuffer and clEnqueueUnmapMemObject APIs expect a valid memory object. As described in this page clEnqueueUnmapMemObject   :

                      clEnqueueMapBuffer and clEnqueueMapImage increments the mapped count of the memory object. The initial mapped count value of a memory object is zero. Multiple calls to clEnqueueMapBuffer or clEnqueueMapImage on the same memory object will increment this mapped count by appropriate number of calls. clEnqueueUnmapMemObject decrements the mapped count of the memory object.

                       

                      So, you can allocate a memory buffer once and then map and up-map the same buffer multiple times.

                       

                      Addition to this, I would like to suggest you following points:

                       

                      1. Performance of data transfer also differs on how the memory buffer been allocated (i.e. memory flags used during allocation APIs). Generally the selection is made depending on how the buffer will be used by the application. Please refer sections 4.5 OpenCL Memory Objects and 4.6 OpenCL Data Transfer Optimization under Chapter 4: OpenCL Performance and Optimization in AMD Accelerated Parallel Processing OpenCL Programming Guide. You will get an idea that may be helpful for you.

                      2. You can go through the AMD APP SDK OpenCL sample "AsyncDataTransfer" and see how asynchronous memory transfer can be achieved and a better GPU utilization can be done.

                       

                      Regards,

                      1 of 1 people found this helpful