7 Replies Latest reply on Sep 18, 2015 4:16 AM by dipak

    Optimization guide memory allocation

    nibal

      I'm trying to implement the guide's suggestions for memory allocation, p 1-9. Can't understand what it means, though :-(

      In the CL_MEM_ALLOC_HOST_PTR, without VM case it states zero copy between (PC's?) CPU and what?

      In my case I need it for transfers between my PC's CPU and my card's GPU where i run the kernel.  Is that the right case?

      No speed improvements, and wrong results due to synchronization issues.

      What is the difference between that and the default case without any flags?

        • Re: Optimization guide memory allocation
          dipak

          Now a days, VM is almost always enabled (may check the clinfo output which reports as "Driver version: 1800.8 (VM)"). So, the CL_MEM_ALLOC_HOST_PTR flag and the default case have different scenarios as described in that chart.

          However, remember that, in some cases specially for d-GPUs, the runtime may choose to copy the data from zero copy pinned host memory to device memory as needed to improve the data accessing. That's why, after the kernel execution, it is expected to do the map/unmap operation on that memory. In that case, the runtime will ensure that everything is up to date.

          You may check the section "1.4.2.4 Application Scenarios and Recommended OpenCL Paths" in optimization guide which describes various application scenarios, and the corresponding paths in the OpenCL API that are known to work well on AMD platforms.

          To know more about zero copy and various access patterns, you may refer this old but  nice article  http://developer.amd.com/wordpress/media/2013/06/1004_final.pdf

           

          Regards,

            • Re: Optimization guide memory allocation
              nibal

              Hi Di,

               

              Ty for your fast response.

              Indeed my clinfo reports: Driver version: 1445.5 (VM)

              I'm using, though, SDK 2.9-1 and ocl 1.2, which doesn't support VM. I would be surprised if it supports it implicitly, but not explicitly...

               

              The guide is not clear. In the case I mentioned, it doesn't specify where the zero copy takes place. Mentions from CPU (app?) to where?

               

              I'm well aware of zero-copy, kernel space and user space programming, since the days of TUX, the kernel web browser;-)

               

              Hmmm. You are suggesting that I map/unmap the space just before and after kernel execution. Have seen that in the samples as well, was gonna try it next. Up till now, I mapped in setupCL and unmapped in shutdownCL, doing synchronizations by clFlush, clFinish and events. Trying the map/unmap case I get from CodeXL's  profiler, ~200 ms, the second largest delay, for 7020 mapBuffers @.0283 ms each. Doesn't quite seem to me like 0-copy :-(.

                • Re: Optimization guide memory allocation
                  dipak

                  Hi,

                  I'm using, though, SDK 2.9-1 and ocl 1.2, which doesn't support VM. I would be surprised if it supports it implicitly, but not explicitly..

                  Please don't confuse it with shared virtual memory (SVM) which is an OpenCL 2.0 feature and requires OpenCL 2.0 compatible setup to use it.

                   

                   

                  The guide is not clear. In the case I mentioned, it doesn't specify where the zero copy takes place. Mentions from CPU (app?) to where?

                  Sorry, I overlooked it last time. Yes, it seems the info is missing. Thanks for pointing out. I guess, it would be host memory in case of CPU.

                   

                  Up till now, I mapped in setupCL and unmapped in shutdownCL, doing synchronizations by clFlush, clFinish and events.

                  clFinish  ensures that a kernel execution is completed, but it does not ensure that the updated data has been mapped to the host-address space. That's why a mapping is required. In-case of dGPU, this mapping may take longer time due to copy of updated data from the device. As the optimization guide says:

                   

                  For CL_MEM_USE_HOST_PTR and the CL_MEM_ALLOC_HOST_PTR cases that use copy map mode, the runtime tracks if the map location contains an up-to-date copy of the memory object contents and avoids doing a transfer from the device when mapping as CL_MAP_READ. This determination is based on whether an operation such as  clEnqueueWriteBuffer/clEnqueueCopyBuffer or a kernel execution has modified the memory object.

                   

                  Regards,

                    • Re: Optimization guide memory allocation
                      nibal

                      >>> I'm using, though, SDK 2.9-1 and ocl 1.2, which doesn't support VM. I would be surprised if it supports it implicitly, but not explicitly..

                      > Please don't confuse it with shared virtual memory (SVM) which is an OpenCL 2.0 feature and requires OpenCL 2.0 compatible setup to use it.

                       

                      Indeed, I thought it was Virtual memory. Is it Virtual Machine? A glossary would be nice...

                       

                      >>The guide is not clear. In the case I mentioned, it doesn't specify where the zero copy takes place. Mentions from CPU (app?) to where?

                      >Sorry, I overlooked it last time. Yes, it seems the info is missing. Thanks for pointing out. I guess, it would be host memory in case of CPU.

                       

                      It took me a while, but I finally figured it out. The table is complete, but mislabeled. I would suggest a couple of changes:

                       

                      1) Title should be changed to "Use of mappings with different kernel devices". In the preceding text it could be explained as an example of memory objects and their properties.

                      2) Column "Device Type" Should be changed to "Kernel Device"!!!

                      3) Column "Location" should be changed to "Used memory"

                      4) Column "Map Location" should be changed to "Memory returned by mapper"

                       

                      >>Up till now, I mapped in setupCL and unmapped in shutdownCL, doing synchronizations by clFlush, clFinish and events.

                      > clFinish ensures that a kernel execution is completed, but it does not ensure that the updated data has been mapped to the host-address space. That's why a mapping is required. > In-case of dGPU, this mapping may take longer time due to copy of updated data from the device. As the optimization guide says:

                       

                      Sure, but this is guaranteed by the following clFlush and event.

                       

                      >For CL_MEM_USE_HOST_PTR and the CL_MEM_ALLOC_HOST_PTR cases that use copy map mode, the runtime tracks if the map location contains an up-to-date copy of the >memory object contents and avoids doing a transfer from the device when mapping as CL_MAP_READ. This determination is based on whether an operation such as >clEnqueueWriteBuffer/clEnqueueCopyBuffer or a kernel execution has modified the memory object.

                       

                      Thanks for your advise. I was ready to give up on maps but I will pursue it further. Will update ticket when I get it working.

                        • Re: Optimization guide memory allocation
                          nibal

                          Hi Dipak,

                           

                          Using CodeXL's profiler I was able to analyze 2 versions of my program. The first with writeBuffer and readBuffer calls (no mappings) and the other with mapBuffer and unmap buffer (no read/write/fillBuffers). The rest of the code is identical, except for the extra CL_MEM_ALLOC_HOST_PTR flag in the CreateBuffer for case (B). Kernel used is a modification of your 1024 size FFT sample. Therefore, all data transfers except the result, are 1024 float long, from my application to the GPU:

                           

                          A) Read/Write Buffers

                          WriteBuffer: 173 ms for 6241 calls each@.02773 ms

                          ReadBuffer: 122 ms for 390 calls each@.312 ms

                           

                          B) Map/Unmap Buffers

                          MapBuffer: 193 ms for 6630 calls each@.02907 ms

                          UnmapBuffer: 120 ms for 6630 calls each@0.01811 ms

                           

                          I notice 3 things:

                          1) ReadBuffer is more expensive than WriteBuffer in (A) since both copy a buffer and readBuffer is used for FFT result which has a size of 36 B (instead of 4 B that WriteBuffer uses)

                          2) WriteBuffer is faster than MapBuffer, even though I use MapBuffer in async mode and with INVALIDATE_REGION at input maps. Strange, since it should do just a memory allocation. No synchronization needed. Maybe because it is used to read results, which are longer than input.

                          3) According to the guide, this should be 0-copy. It doesn't look that way. Am I missing something?

                           

                          What does VM mean?