9 Replies Latest reply on Feb 12, 2010 1:03 PM by nou

    Memory buffer retaining or re-creating

    Raistmer
      what effective?

      My app use some buffer on GPU ~4MB size
      its size always the same but time to time new data from host memory should be uploaded then it will used few times in kernels before next update from host memory.
      The question is:
      what course of action is better:

      1)delete and recreate buffer each time when update from host needed via clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,..

      or

      2)allocate buffer for lifetime of app and do updates via mapping it to host adress space when needed

      ?

      And additional question about case 2.
      Can I update GPU buffer in this case directly, that is, write results of CPU computations in buffer one by one or still will need to write results in some addiional host memory based buffer and only after completion of host buffer update upload it completely to GPU buffer? What way of actions would give better performance?
        • Memory buffer retaining or re-creating
          ibird

          Creating an application that call a lot of consecutive times the kernel, i finded i with a profiler that in my situation delete and recreate the buffers is a bottleneck for the performances, so i retain the buffers and reuse them, i recreate the buffer only if the buffer retained is not big enough. ( like the std::vector class )

           

          I do not use CL_MEM_USE_HOST_PTR

          but i use clEnqueueWriteBuffer and clEnqueueReadBuffer to load the input data and read the output

          • Memory buffer retaining or re-creating
            Raistmer
            Thanks!
            What about mapping buffer to host address space and write element by element directly to GPU memory when update from CPU required? Is it possible/faster than having additional buffer in host memory?
              • Memory buffer retaining or re-creating
                genaganna

                 

                Originally posted by: Raistmer Thanks! What about mapping buffer to host address space and write element by element directly to GPU memory when update from CPU required? Is it possible/faster than having additional buffer in host memory?


                It is possible to map buffer to host address space to read from or write to.  Mapping does not say any thing about performance.

                  • Memory buffer retaining or re-creating
                    Raistmer
                    Originally posted by: genaganna

                    It is possible to map buffer to host address space to read from or write to.  Mapping does not say any thing about performance.



                    LoL, sure mapping doesn't say anything about performance, but I hope there are some peopels who tried such variant and can give some info about performance versus other possible methods
                  • Memory buffer retaining or re-creating
                    gaurav.garg

                     

                    What about mapping buffer to host address space and write element by element directly to GPU memory when update from CPU required? Is it possible/faster than having additional buffer in host memory?


                    Direct mapping/unmapping is usually slower than using writeBuffer and copyBuffer.

                    I think the best way for you would be to create two CL buffers one on GPU memory and another on host address space (use CL_MEM_ALLOC_HOST_PTR or CL_USE_HOST_PTR). Now do mapping/unmapping on the host buffer and then use clEnqueueCopyBuffer to copy data from host to GPU.

                    This approach will make sure you have the fastest data transfer (transferring data using pinned memory) and has no overhead of creating and destroying CL buffers again and again.

                      • Memory buffer retaining or re-creating
                        nou

                        i tried oclBandwidthTest on ATI

                        --access=mapped

                        Host to Device Bandwidth, 1 Device(s), Paged memory, mapped access
                           Transfer Size (Bytes)    Bandwidth(MB/s)
                           33554432            3325.4

                         Device to Host Bandwidth, 1 Device(s), Paged memory, mapped access
                           Transfer Size (Bytes)    Bandwidth(MB/s)
                           33554432            3226.4

                         Device to Device Bandwidth, 1 Device(s)
                           Transfer Size (Bytes)    Bandwidth(MB/s)
                           33554432            43258.2

                        --access=direct

                         Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
                           Transfer Size (Bytes)    Bandwidth(MB/s)
                           33554432            2793.0

                         Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
                           Transfer Size (Bytes)    Bandwidth(MB/s)
                           33554432            757.9

                         Device to Device Bandwidth, 1 Device(s)
                           Transfer Size (Bytes)    Bandwidth(MB/s)
                           33554432            43329.9

                        so mapping buffer is comparable with PCIeSpeedTest numbers. so clEnqueueRead is significant slower than other approachs.

                          • Memory buffer retaining or re-creating
                            gaurav.garg

                             

                            i tried oclBandwidthTest on ATI

                            --access=mapped

                            Host to Device Bandwidth, 1 Device(s), Paged memory, mapped access
                               Transfer Size (Bytes)    Bandwidth(MB/s)
                               33554432            3325.4

                             Device to Host Bandwidth, 1 Device(s), Paged memory, mapped access
                               Transfer Size (Bytes)    Bandwidth(MB/s)
                               33554432            3226.4

                             Device to Device Bandwidth, 1 Device(s)
                               Transfer Size (Bytes)    Bandwidth(MB/s)
                               33554432            43258.2

                            --access=direct

                             Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
                               Transfer Size (Bytes)    Bandwidth(MB/s)
                               33554432            2793.0

                             Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
                               Transfer Size (Bytes)    Bandwidth(MB/s)
                               33554432            757.9

                             Device to Device Bandwidth, 1 Device(s)
                               Transfer Size (Bytes)    Bandwidth(MB/s)
                               33554432            43329.9

                            so mapping buffer is comparable with PCIeSpeedTest numbers. so clEnqueueRead is significant slower than other approachs.



                            Just reviewed the oclBandwidthTest code, timing for maaped access doesn't seem correct. The unmap call is asynchronous and there is no wait for this call before timer stop. I am not sure if implementation provides this API as asynchronous but, if it does then the timing is wrong.

                            I usually take these benchmarks with a pinch of salt until I see the source code. This benchmark is released by Nvidia and might not be best for AMD's platform. e.g. If I have to benchmark pinned data transfer, I would never do it the way it is done in this benchmark. The right way would be to directly use clEnqueuqCopyBuffer rather than first mapping pinned buffer on host and then copying pointer via clEnqueueWriteBuffer.

                             

                            Wow, why so asymmetric read and write? While write slower too, but read speed just terrible .... Looks more like AGP then PCI-E...


                            As OpenCL is implemented on top of CAL, I can guess what might be the reason. In CAL, there is no way to directly copy data from host pointer to GPU local memory and hence we have to copy data in two steps. First host pointer to CAL remote resource and then remote to local resource. And vice-versa for device to host data transfer. Usually, PCIe speed in both direction is same. The performance diffrence comes when we copy data from remote resource to host pointer.

                              • Memory buffer retaining or re-creating
                                nou

                                but this number quite corespond with my own test when i map and then unmap buffer.

                                GPU->CPU: 3021.56 MiB/s
                                CPU->GPU: 2162.36 MiB/s

                                float *ptr = (float*)clEnqueueMapBuffer(queue, buff[0], CL_FALSE, CL_MAP_WRITE, 0, 16*1024*1024, 0, NULL, &e_write, &err_code); clWaitForEvents(1, &e_write); clGetEventProfilingInfo(e_write, CL_PROFILING_COMMAND_START, sizeof(long long), &start, NULL); clGetEventProfilingInfo(e_write, CL_PROFILING_COMMAND_END, sizeof(long long), &end, NULL);

                        • Memory buffer retaining or re-creating
                          Raistmer
                          Thanks for replies!

                          Is it via clEnqueueRead, yes?:
                          Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
                          Transfer Size (Bytes) Bandwidth(MB/s)
                          33554432 757.9

                          Wow, why so asymmetric read and write? While write slower too, but read speed just terrible .... Looks more like AGP then PCI-E...