OK, I think I understand most of your response. I will look into the Linux kernel/boot pool
issue. The remaining mystery is why, even with small enough DATA_S that MAP does not fail,
in either AHP or UHP cases, the UHP case (with Mapped DATA_S) setup does not run as fast as AHP
(~7GB/sec vs ~16GB/sec).
DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size_S, NULL, &status);
Data_s = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );
Data_s = (float*) valloc( size_S );
DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size_S, Data_s, &status);
Data_sx = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );
status |= clEnqueueUnmapMemObject(Queue_1, DATA_S, Data_sx, 0, NULL, NULL );
Your reply indicated that Windows, using calls as above, runs both AHP and UHP at the same high rate.
Have you also tried this on a Linux system?
There are keys for the linux base driver to increase the pools. I don't recall them. Don't know if they are publicly available or not.
You have to remove clEnqueueUnmapMemObject call after map. As soon as you call unmap, runtime doesn't consider UHP as prepinned allocation anymore (no CPU access from the app). Call unmap at the end before memory release. Basically as I mentioned before AHP and UHP have the same behavior in runtime.
BTW, runtime guarantees (Data_s == Data_sx) for UHP.
Yes, I had checked that Data_s == Data_sx, but removing the Unmap still gives the lower
~7 GB/sec rate. If we can get UHP up to AHP speed, I might be able to get around the
Map size limit by:
1) get Data_s from valloc or equivalent (page aligned)
2) form other pointers from Data_s ( eg. p1, p2, ... one for each Read/Write transfer )
3) CreateBuffer( UHP ) small buffer for each ( an array of CPU buffers )
4) Map each one (small size ~ .25 GB ) just before Read/Write
6) Unmap small buffer
I still have to run your test under Linux. Didn't have time. Basically I forgot about an extra limitation under Linux. In Linux cacheable pool is much smaller than USWC pool. UHP allocations will go to cacheable. Personally I don't see any reason to limit UHP allocations to any pools, but that's how memory manager under Linux works. Windows also has some limitations, but much bigger size. Anyway try to reduce UHP allocations to 128MB to see if you will get 16GB/s. In a case of UHP alloc failure runtime may disable zero-copy it's necessary so some tests could still work. That may explain your numbers. The pool size limitation under Linux can be fixed in the future, but don't know the time frame.
Your pseudo code isn't optimal and should introduce bubbles between CPU/GPU executions. Any UHP allocation requires memory pinning. Pinning involves GPU page tables update, GPU stalls are possible. I believe under Windows VidMM scheduling thread will disable any submissions during that operation and I doubt Linux will be any more optimal than that. To be honest I'm not sure there is an optimal solution to bypass UHP size limit, which shouldn't really exist in the first place. Well again it depends on the system configuration and the amount of requested memory for pinning.
I would suggest you to implement double copy to see if you can get better performance overall, running CPU copy
asynchronously with GPU transfers. Otherwise I think your new code shouldn't be any faster than the current 7GB/s and no size "limits".
You are correct ... reducing DATA_S to 128MB has both AHP and UHP running at ~12GB/sec.
Probably lower than 16GB/sec due to overhead relative to smaller transfers. I will try
my silly idea just to see what happens. I am willing to reserve at boot a large pool in physical
memory, but I do not know how to configure it so that UHP will recognize it. For now,
since I have a workable, if slow, UHP method I will proceed to the more interesting job
of the kernels. If you have any further thoughts on this, let me know ... and Thanks
for your help.