I'm not sure I understand your questions, but I'll give it a shot:
1) What makes you think CL_MEM_ALLOC_HOST_PTR allocates pinned memory? The standard says this flag requests "host accessible memory", not necessarily from the nonpaged pool.(Although I guess it's pretty likely that, whatever the type of memory, it will be locked down by the driver while the GPU is using the buffer.)
2) Given that 1 seems to arise from a misunderstanding of the spec: If you want nonpaged memory, with specific caching properties, why don't you just allocate it yourself on the host and create your buffer using CL_MEM_USE_HOST_PTR?
Originally posted by: Illusio
I'm not sure I understand your questions, but I'll give it a shot:
1) What makes you think CL_MEM_ALLOC_HOST_PTR allocates pinned memory? The standard says this flag requests "host accessible memory", not necessarily from the nonpaged pool.(Although I guess it's pretty likely that, whatever the type of memory, it will be locked down by the driver while the GPU is using the buffer.)
2) Given that 1 seems to arise from a misunderstanding of the spec: If you want nonpaged memory, with specific caching properties, why don't you just allocate it yourself on the host and create your buffer using CL_MEM_USE_HOST_PTR?
1. Ahh. I have no idea what the default is, but you have no guarantees of any specific behavior in that respect.
2. No. When you create a buffer using CL_MEM_USE_HOST_PTR you pass along a pointer to a memory area in host memory that you have allocated previously. The contents is not copied(but can be cached on the GPU during operations). You also do not have to re-create any buffers. The two flags are almost identical, except that CL_MEM_ALLOC_HOST_PTR will do the allocation of host memory for you. Which means that USE_HOST is an optimization if you already have the data available in host memory somewhere, then you don't need to copy anything to a new OpenCL buffer - it also gives you freedom to tweak paging settings if required.
You're probably confusing it with CL_MEM_COPY_HOST_PTR btw.
Anyway, am I to understand that you really don't care whether the memory is pinned or not - you're just worried about a potential performance hit if the memory is uncachable? In that case, allocate the memory yourself and use CL_MEM_USE_HOST_PTR or try to write your host code in a way that minimize the potential problem(such as using streaming SSE instructions that read/write huge chucks that aren't cached anyway).
Given that the spec is entirely silent on the pinning issue, I'd think that this is something that could change from hardware to hardware in both nVidia and ATIs implementations anyway.
Originally posted by: Illusio
You're probably confusing it with CL_MEM_COPY_HOST_PTR btw.
Anyway, am I to understand that you really don't care whether the memory is pinned or not - you're just worried about a potential performance hit if the memory is uncachable?
(such as using streaming SSE instructions that read/write huge chucks that aren't cached anyway).
Yes, I was thinking of explicitly using the noncached SSE instructions(I thought Intel refered to those as "streaming", but it's been a while since I coded on that level myself), but sadly it sounds like that would be a worthless option for you from what you write.
You can prepare data right in the host buffer if you want, but the buffer must be mapped using clEnqueueMapBuffer before you start modifying it. This is needed to notify the runtime that it needs to invalidate any GPU-side caching of the data(or possibly copy back data if the GPU can modify the buffer).
You then have to do a clEnqueueUnmapMemObject after your modifications.
Obviously, you can't do such changes while the buffer is in use by the GPU. I'm not sure if this clarified anything with regards to the caching issue? The GPU will cache host side buffers on any realistic hardware, and it will be able to do so because of the synchronization mechanisms above.
The main issue with performance when updating is probably the amount of memory you have to map to complete your host-side modifications, because all of the mapped memory must be invalidated from the cache on the GPU. If you need to map it all, the performance may possibly be worse than a full buffer copy due to stalls on the GPU during cache misses being more costly when it has to fetch data from host memory. Depending on your application's cache friendliness it might be simpler to just copy the modified buffer to the GPU though.
It sounds like a bad warning sign that the transfer of data should dominate your application's execution time by the way. Have you tried running it on the CPU device? When memory transfer is a bottleneck, it's quite possible that the CPU may be faster than the GPU if you have a new CPU with lots of cores, and then you don't have to worry about all the issues with anti-social pinning of memory and the like.
Raistmer,
Current implementation does not use pinned memory. You can expect the support in one of the upcoming releases.
Originally posted by: omkaranathan
Raistmer,
Current implementation does not use pinned memory. You can expect the support in one of the upcoming releases.
When I say "cached" I do not mean that the entire thing would be copied and stored on the GPU, just that any hardware that isn't completely useless will have some kind of cached access to hostside memory.
I also agree with the thing you quoted, and that's why I mentioned the stuff about "cache friendliness and that it be simpler to just copy". The cost of cache misses will probably be ridiculous when the GPU references host memory, so unless you've tuned your code to use prefetches to hide latency, you may well benefit from having a complete copy on the GPU. The stuff in the quoted part mentioning creating and recreating buffers is not something that's an inherent part of memory mapping though. It's fine to just create a buffer once and wrap buffer modification code in a map/unmap pair.
I'm not sure what to say about the timing you got on the mapping and unmapping. On one hand, it should be wrong, because the copy operation should have to do the same operation as the map operation in addition to doing the copy(That is: Lock the pages into ram -> Issue DMA transfer -> unlock pages on completion), on the other hand, you issued a ton of operations, so it's hard to imagine that freak task switches should be responsible for those results.
But have you tried what i suggested a bunch of posts ago? If you're using Windows, try using SetProcessWorkingSetSize and VirtualLock to lock a memory region you've allocated yourself, create an opencl buffer from it, and see if that helps. If nothing else, it might stabilize the timing of the opencl mapping functions closer to the minimum.
Or just forget about the mapping and do a copy. 😃
You got me interested in the timing anyway. I'll do some testing on my machine and see how it works here.
Or just forget about the mapping and do a copy. 😃
Some testing later, it looks to me like the mapping operation tends to be near identical in time consumption to a copy. It's possible that it always does a copy as well(Maintaining a full buffer on the GPU), because there is a large delay in both the mapping and unmapping operations, even in situations where it should not be necessary to do anything(Such as when doing a read mapping of a read-only buffer. That should be a no-operation on the part of the host if the implementation was optimal ).
However, the unmap function also consumes a similar amount of time as the map function, so in total, the map/unmap process is slower by a factor 2. I suspect it does a copy every time as well. Chances are AMD has some optimization opportunities here anyway.
That said, I was able to cause large variations in the time each operation took, like you reported in that other thread, but it appeared to be entirely deterministic, and related to what kind of memory was fed to the map or write functions. To take a stab in the dark at an explanation, I'd go with variation being explained by the need for cache writeback on the host, before issuing DMA transfers to the GPU.
By the way, you may want to print out all the profiling information. The 500ns timing on the mapping in your other thread is likely due to you using wrong profiling info. Most time tends to be spent between Submit and Start. Not between Start and End for some reason.(I have a near constant 420ns delay between Start and End for both copy and mapping of a 64 MB buffer. The delay between Submit and Start is between 13 and 15 million ns though)