12 Replies Latest reply on Mar 18, 2010 10:21 PM by Raistmer

    Question about pinned memory implementation in OpenCL

    Raistmer
      For current SDK 2.01

      1) does it exist at all ? (that is, does current SDK 2.01 obey CL_MEM_ALLOC_HOST_PTR flag or just ignores it? )

      2) if it exist (that is, if some memory really allocated from non-paged pool), is it mapped on uncached region or "cacheability" of this region changed when map/unmap procedures used?
      That is, can CPU effectively work with this memory region (cacheable memory when mapped into host adress space) or it will suffer when do read/write on this memory region?
        • Question about pinned memory implementation in OpenCL
          Illusio

          I'm not sure I understand your questions, but I'll give it a shot:

          1) What makes you think CL_MEM_ALLOC_HOST_PTR allocates pinned memory? The standard says this flag requests "host accessible memory", not necessarily from the nonpaged pool.(Although I guess it's pretty likely that, whatever the type of memory, it will be locked down by the driver while the GPU is using the buffer.)

          2) Given that 1 seems to arise from a misunderstanding of the spec: If you want nonpaged memory, with specific caching properties, why don't you just allocate it yourself on the host and create your buffer using CL_MEM_USE_HOST_PTR?

           

            • Question about pinned memory implementation in OpenCL
              Raistmer
              Originally posted by: Illusio

              I'm not sure I understand your questions, but I'll give it a shot:




              1) What makes you think CL_MEM_ALLOC_HOST_PTR allocates pinned memory? The standard says this flag requests "host accessible memory", not necessarily from the nonpaged pool.(Although I guess it's pretty likely that, whatever the type of memory, it will be locked down by the driver while the GPU is using the buffer.)




              2) Given that 1 seems to arise from a misunderstanding of the spec: If you want nonpaged memory, with specific caching properties, why don't you just allocate it yourself on the host and create your buffer using CL_MEM_USE_HOST_PTR?




               


              1) I've seen mention that CL_MEM_ALLOC_HOST_PTR could result in pinned memory allocation in one of "Best practices"gides, maybe from NVidia. So the question is - how it implemented in ATI SDK, will it allocate pinned memory or will not?

              2) If I create buffer with CL_MEM_USE_HOST_PTR it will mean that data copied from my host buffer to newly allocated GPU buffer.
              That is, if I want to update data I need to re-create GPU buffer again, using CL_MEM_USE_HOST_PTR again. I suppose re-creating GPU buffer is more costly operation than just to copy from host buffer to already created GPU buffer.

                • Question about pinned memory implementation in OpenCL
                  Illusio

                  1. Ahh. I have no idea what the default is, but you have no guarantees of any specific behavior in that respect.

                   

                  2. No. When you create a buffer using CL_MEM_USE_HOST_PTR you pass along a pointer to a memory area in host memory that you have allocated previously. The contents is not copied(but can be cached on the GPU during operations). You also do not have to re-create any buffers. The two flags are almost identical, except that CL_MEM_ALLOC_HOST_PTR will do the allocation of host memory for you. Which means that USE_HOST is an optimization if you already have the data available in host memory somewhere, then you don't need to copy anything to a new OpenCL buffer - it also gives you freedom to tweak paging settings if required.

                  You're probably confusing it with CL_MEM_COPY_HOST_PTR btw.

                   

                  Anyway, am I to understand that you really don't care whether the memory is pinned or not - you're just worried about a potential performance hit if the memory is uncachable? In that case, allocate the memory yourself and use CL_MEM_USE_HOST_PTR or try to write your host code in a way that minimize the potential problem(such as using streaming SSE instructions that read/write huge chucks that aren't cached anyway).

                  Given that the spec is entirely silent on the pinning issue, I'd think that this is something that could change from hardware to hardware in both nVidia and ATIs implementations anyway.

                    • Question about pinned memory implementation in OpenCL
                      Raistmer
                      Originally posted by: Illusio
                      You're probably confusing it with CL_MEM_COPY_HOST_PTR btw.



                      Probably yes, thanks for correction.


                      Anyway, am I to understand that you really don't care whether the memory is pinned or not - you're just worried about a potential performance hit if the memory is uncachable?

                      Actually I'm worried about both issues:
                      1)to get optimal memory transfer speed between GPU and host memory (currently such transfers took most of time for my app, I described this issue in another thread)
                      2)not to harm CPU performance by using uncached memory regions during buffer updates on CPU side.


                      (such as using streaming SSE instructions that read/write huge chucks that aren't cached anyway).



                      SSE reads/writes are cached until one uses non-temporal versions (streaming writes and non-temporal prefetches)

                      That is, my question could be reformulated in this way:
                      What better, to use pinned (if any, it's question for AMD staff probably) memory to copy from it to GPU buffer (using host memory as single buffer and hope for caching it on GPU from runtime side is not an option actually too - it will be implementation-specific and implementation could decide to update host buffer between kernel calls, when data absolutely unneeded on host but surely needed on GPU. Best is to have explicit control on data placement and movement ) and use some separate buffer in host memory for data preparation (few reads/writes needed per element so in uncached state CPU will suffer much) or it's safe to prepare data right in "pinned" host buffer ?

                        • Question about pinned memory implementation in OpenCL
                          Illusio

                          Yes, I was thinking of explicitly using the noncached SSE instructions(I thought Intel refered to those as "streaming", but it's been a while since I coded on that level myself), but sadly it sounds like that would be a worthless option for you from what you write.

                          You can prepare data right in the host buffer if you want, but the buffer must be mapped using clEnqueueMapBuffer before you start modifying it. This is needed to notify the runtime that it needs to invalidate any GPU-side caching of the data(or possibly copy back data if the GPU can modify the buffer).

                          You then have to do a clEnqueueUnmapMemObject after your modifications.

                          Obviously, you can't do such changes while the buffer is in use by the GPU. I'm not sure if this clarified anything with regards to the caching issue? The GPU will cache host side buffers on any realistic hardware, and it will be able to do so because of the synchronization mechanisms above.

                          The main issue with performance when updating is probably the amount of memory you have to map to complete your host-side modifications, because all of the mapped memory must be invalidated from the cache on the GPU. If you need to map it all, the performance may possibly be worse than a full buffer copy due to stalls on the GPU during cache misses being more costly when it has to fetch data from host memory. Depending on your application's cache friendliness it might be simpler to just copy the modified buffer to the GPU though.

                          It sounds like a bad warning sign that the transfer of data should dominate your application's execution time by the way. Have you tried running it on the CPU device? When memory transfer is a bottleneck, it's quite possible that the CPU may be faster than the GPU if you have a new CPU with lots of cores, and then you don't have to worry about all the issues with anti-social pinning of memory and the like.

                           

                  • Question about pinned memory implementation in OpenCL
                    Raistmer
                    I'm porting CPU app to GPU so using CPU only not an option

                    For now I use sequence described in best practices guide and also proposed by Gaurav Garg in this thread:
                    http://forums.amd.com/devforum...threadid=126084

                    "
                    Direct mapping/unmapping is usually slower than using writeBuffer and copyBuffer.

                    I think the best way for you would be to create two CL buffers one on GPU memory and another on host address space (use CL_MEM_ALLOC_HOST_PTR or CL_USE_HOST_PTR). Now do mapping/unmapping on the host buffer and then use clEnqueueCopyBuffer to copy data from host to GPU.

                    This approach will make sure you have the fastest data transfer (transferring data using pinned memory) and has no overhead of creating and destroying CL buffers again and again.
                    "
                    But picture you draw supposes that both buffers will be cached on GPU ?? Or I misunderstood you?
                    The reason I returned to this question is very bad performance. Mapping/unmapping of host buffer take more time than copy buffer itself (described in this thread:
                    http://forums.amd.com/devforum...d=129722&enterthread=y )


                    CL_MEM_USE_HOST_PTR
                    This flag is valid only if host_ptr is not NULL. If
                    specified, it indicates that the application wants the
                    OpenCL implementation to use memory referenced by
                    host_ptr as the storage bits for the memory object.
                    OpenCL implementations are allowed to cache the buffer
                    contents pointed to by host_ptr in device memory. This
                    cached copy can be used when kernels are executed on a
                    device.
                    The result of OpenCL commands that operate on multiple
                    buffer objects created with the same host_ptr or
                    overlapping host regions is considered to be undefined.

                    CL_MEM_ALLOC_HOST_PTR
                    This flag specifies that the application wants the OpenCL
                    implementation to allocate memory from host accessible
                    memory.
                    CL_MEM_ALLOC_HOST_PTR and
                    CL_MEM_USE_HOST_PTR are mutually exclusive.

                    Nothing more about CL_MEM_ALLOC_HOST_PTR Will it shadowed on device or not

                    I think reference too blurred here. I need info from ATI staff how these allocations implemented in their current SDK version. Because having additional hided host<->GPU memory transfers too important to not know about them or ignore them. Too much impact on performance here!
                      • Question about pinned memory implementation in OpenCL
                        omkaranathan

                        Raistmer,

                        Current implementation does not use pinned memory. You can expect the support in one of the upcoming releases.

                        • Question about pinned memory implementation in OpenCL
                          Illusio

                          When I say "cached" I do not mean that the entire thing would be copied and stored on the GPU, just that any hardware that isn't completely useless will have some kind of cached access to hostside memory.

                          I also agree with the thing you quoted, and that's why I mentioned the stuff about "cache friendliness and that it be simpler to just copy". The cost of cache misses will probably be ridiculous when the GPU references host memory, so unless you've tuned your code to use prefetches to hide latency, you may well benefit from having a complete copy on the GPU. The stuff in the quoted part mentioning creating and recreating buffers is not something that's an inherent part of memory mapping though. It's fine to just create a buffer once and wrap buffer modification code in a map/unmap pair.

                          I'm not sure what to say about the timing you got on the mapping and unmapping. On one hand, it should be wrong, because the copy operation should have to do the same operation as the map operation in addition to doing the copy(That is: Lock the pages into ram -> Issue DMA transfer -> unlock pages on completion), on the other hand, you issued a ton of operations, so it's hard to imagine that freak task switches should be responsible for those results.

                          But have you tried what i suggested a bunch of posts ago? If you're using Windows, try using SetProcessWorkingSetSize and VirtualLock to lock a memory region you've allocated yourself, create an opencl buffer from it,  and see if that helps. If nothing else, it might stabilize the timing of the opencl mapping functions closer to the minimum.

                          Or just forget about the mapping and do a copy. =)

                           

                          You got me interested in the timing anyway. I'll do some testing on my machine and see how it works here.

                            • Question about pinned memory implementation in OpenCL
                              Raistmer


                              Or just forget about the mapping and do a copy. =)



                              It's impossible. Buffer should be unmapped before copy and mapped to do update from CPU. I could use Write instead of Copy, but currently when I replace sequence from: Map/update/Unmap/Copy to Map/update/Write/Unmap (map/unmap pair still needed here just because same buffer used in other places, it's model situation) I recive invalid results at some point of computations (after many cycles, btw). But it's another story.
                              • Question about pinned memory implementation in OpenCL
                                Illusio

                                Some testing later, it looks to me like the mapping operation tends to be near identical in time consumption to a copy. It's possible that it always does a copy as well(Maintaining a full buffer on the GPU), because there is a large delay in both the mapping and unmapping operations, even in situations where it should not be necessary to do anything(Such as when doing a read mapping of a read-only buffer. That should be a no-operation on the part of the host if the implementation was optimal ).

                                However, the unmap function also consumes a similar amount of time as the map function, so in total, the map/unmap process is slower by a factor 2. I suspect it does a copy every time as well. Chances are AMD has some optimization opportunities here anyway.

                                That said, I was able to cause large variations in the time each operation took, like you reported in that other thread, but it appeared to be entirely deterministic, and related to what kind of memory was fed to the map or write functions. To take a stab in the dark at an explanation, I'd go with variation being explained by the need for cache writeback on the host, before issuing DMA transfers to the GPU.

                                By the way, you may want to print out all the profiling information. The 500ns timing on the mapping in your other thread is likely due to you using wrong profiling info. Most time tends to be spent between Submit and Start. Not between Start and End for some reason.(I have a near constant 420ns delay between Start and End for both copy and mapping of a 64 MB buffer. The delay between Submit and Start is between 13 and 15 million ns though)

                            • Question about pinned memory implementation in OpenCL
                              Raistmer
                              Well, I replaced map/unmap/copy to write and got very nice speedup.
                              Strange runtime implementation speaking politely...