Hi all:
This is my first dab at GPU computing. I am running on 64-bit Linux with 32GB
memory and an AMD 7770 GPU with 2GB memory. The data set is large, (28GB for
the largest mesh) and will be streamed through the GPU in pieces for computation.
In the best of all worlds a simple 3-buffer scheme with each buffer controlled
by a separate queue would allow transfers to and from the GPU as well as the GPU
computation to run concurrently.
To set up the CPU buffers I have tried two methods:
float* Data_s = (float*) valloc( size_S );
if( mlock( Data_s, size_S ) != 0 ) printf("*Data_s not locked\n" );
DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size_S, Data_s, &status);
or
DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size_S, NULL, &status);
float* Data_s = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );
To set up the GPU buffers :
cl_mem Buffer_s1 = clCreateBuffer(context, CL_MEM_READ_WRITE, size_s, NULL, &status);
The runs indicate that Kernel execution is overlapped but reads do not overlap writes.
The latter is a disappointment, but not totally unexpected. For small meshes the
ALLOC_HOST_PTR method runs at full speed (same as the BufferBandwidth sample) but
the USE_HOST_PTR only runs at roughly 2/3 of that speed.
For larger meshes the ALLOC_HOST_PTR methods fails at the MapBuffer call (error -12,
which the opencl spec 1.2 explicitly states cannot happen!), but the slower USE_HOST_PTR
method will handle the largest mesh (28GB).
Since the CPU <--> GPU transfers are the bottleneck for the code, I need a method
that gives the full transfer rates over the largest mesh. There are several posts
on this forum about Maps of CPU buffers requiring buffer allocation on the GPU.
Has this been fixed or a work-around provided? Also since the GCN family has dual
bi-directional DMA engines, does AMD expect to implement concurrent bi-directional transfers
in the future?
catalyst-13.1-linux-x86.x86_64 AMD-APP-SDK-v2.8-lnx64
First thing to fnid out is -- whether VM is enabled or not. That, you can check by running "clinfo" and check the driver version string. (should be something like "1182.2 (VM)"). Presence of "VM" string is what you should look for.
Assuming VM is enabled, AHP buffer will be directly accessed by the OpenCL kernel i.e. kernel's pointer access will translate to PCIe transaction which in turn accesses pinned memory. This means that the kernel is not doing any work most of the time and is stalling (very badly) on memory operations. So, the overlap that you intend to make - probably is not happening. Suggest you to allocate a buffer inside GPU and "enqueueWriteBuffer" to it.
I had earlier noted that using "UHP" directly as kernel argument - slows it down very badly. You may want to build a small prototype to probe this.
Thanks German and Himanshu for your rapid response.
In response to Himanshu, the driver from clinfo is 1084.4 (VM). The test code
is a modified version of the HelloWorld sample and is quite simple. The kernel
simply modifies the GPU buffer to show that the data was actually moved to and
from the GPU. The large buffer on the CPU (DATA_S) is streamed in pieces to and
from the buffers on the GPU (Buffer_s1, etc) using WriteBuffer and ReadBuffer.
The host code must be able to access DATA_S but does not ever access Buffer_s1,
etc. Likewise the kernel will access Buffer_s1 but not DATA_S. The only
commumication of data between CPU and GPU is via the Write/Read processes.
In response to German, I changed the command queues from separate queues for the
three buffers to separate queues for write, read, execute as you suggested and
that did enable read/write overlap. The round-trip bandwidth increased from
~12 GB/sec. to ~16 GB/sec., less than the ~22 GB/sec. that I hoped for, but a very
promising start. The different allocations implied by AHP and UHP do probably
account for the lower bandwidth of UHP (cache thrashing on the CPU) so I may be
forced to use AHP for DATA_S. But since host needs a pointer with which to access
DATA_S it seems I must Map it as using Query to get an AHP pointer is explicitly
prohibited according to the opencl 1.2 ref. So that seems to imply that I must
somehow overcome the problem of mapping a large buffer. The problem may be in
the AMD software or in Linux. Is it possible to get more informative error
information from Map? It might also be useful to try the code under Windows to
see if Linux is the problem. If you think so, I can send you the code to try.
The code runs with DATA_S = 1.21 GB but fails at 1.23 GB.
22GB? Do you have a PCIE Gen3 system? Attach the code for windows and I'll tell if you can improve performance and how.
1.23GB doesn't look big, but originally runtime didn't allow single allocations > 512MB even for AHP. Linux base driver could fail something. I'll try to check that. I assume you run a 64bit build of your test? You shouldn't see this issue under Windows.
German:
I will try to attach a zip file with everything needed for Linux. For windows you
can substitute a suitable timer for WALLTIME. Other than that, the .cpp and .cl
files should work under Windows. To control the buffer sizes change NX, NY, NZ,
and BATCH. Running "transfer2 1" uses AHP, "transfer2 2" uses UHP. The largest
mesh I can run is NX=NY=NZ=1024, (with BATCH=16) 16 GB DATA_S. My machine is
an i7-3820 (32GB) HD7770 (2GB) with pcie-3.
I am now trying to avoid Mapping the large structure DATA_S and instead Mapping
each piece of it before Writing/Reading it, and UnMapping it after. Lots of
Map/UnMap. Getting seg. faults at the first Write at this point ... probably my bad.
Happy hunting.
1. You still have to prepin memory even for the UHP allocations. In theory it's not really necessary. However runtime uses the same path for AHP and UHP. Also I believe OpenCL 1.2 spec requires a map call for CPU access even for the UHP allocations. So call clEnqueueMapBuffer for UHP similar to AHP and that should fix the app performance. Also don't forget about the unmap calls:-)
2. I can confirm that the both transfers are running asynchronously in HW, but when they run together DMA engine 1 is slower than DMA engine 0.On top of that even DMA0 is slightly slower than a single transfer on either DMA0 or DMA1. So I would say 16GB/s is the best what you can get for now.
German:
Yes, I did lock (prepin) DATA_S by calling mlock. On my machine any user can lock up to 4GB,
and above that I run as root. I had also tried Map for the UHP case and that Map failed just
like the one for AHP. Map simply fails if the buffer to be mapped is too large (> ~1.2GB).
As I said in my last post, I tried Mapping each separate piece ( <= .25GB) of DATA_S to get
the pointer for Read/Write, and that worked as before with DATA_S < 1.23GB but failed with
map error -12 with larger DATA_S. So the Map failure seems to be triggered by the size of
the buffer being mapped rather than the size of the region of that buffer that is Mapped.
The UHP case runs at the same rate whether or not it is prepinned and whether or not it is
Mapped. Does Windows do any better?
You don't have to call mlock. The base driver will lock memory when UHP allocation is created. mlock has nothing to do with clEnqueueMapBuffer().
That's correct. There is a limit on the allocated AHP/UHP size in the linux base driver. The pools have to be preallocated during the boot. As far as I heard the limitation comes from the linux kernel and has to be workaround. Windows should allow a half of system memory for AHP/UHP allocations. The reason it works without clEnqueueMapBuffer is runtime has deferred memory allocations. Basically clCreateBuffer does nothing and runtime allocates memory on the first access. So when you call clEnqueueMapBuffer the actual allocation occurs (the error code can be fixed). Without clEnqueueMapBuffer call runtime doesn't know that the pointer in read/write buffer is a UHP allocation, so it will pin system memory in small chunks and perform multiple transfers. There are optimizations in runtime that will try to hide the pinning cost, but performance may vary, depending on the CPU speed and OS. Currently the pinning cost in Linux is quite more expensive than in Windows. In general it's much less efficient than prepin (clEnqueueMapBuffer call). In windows with prepin the performance is identical I ran with smaller buffers(my systems don't have 32GB ram).
Please note: there are more limitations with big allocations. Currently any buffer allocations (AHP/UHP/Global) can't exceed >4GB address space, but only if they are used in kernels. Runtime can work with >4GB AHP/UHP allocations for data upload/download, because transfers are done with the DMA engines and it doesn't require single address space.
German:
OK, I think I understand most of your response. I will look into the Linux kernel/boot pool
issue. The remaining mystery is why, even with small enough DATA_S that MAP does not fail,
in either AHP or UHP cases, the UHP case (with Mapped DATA_S) setup does not run as fast as AHP
(~7GB/sec vs ~16GB/sec).
AHP:
DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size_S, NULL, &status);
Data_s = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );
UHP:
Data_s = (float*) valloc( size_S );
DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size_S, Data_s, &status);
Data_sx = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );
status |= clEnqueueUnmapMemObject(Queue_1, DATA_S, Data_sx, 0, NULL, NULL );
Your reply indicated that Windows, using calls as above, runs both AHP and UHP at the same high rate.
Have you also tried this on a Linux system?
There are keys for the linux base driver to increase the pools. I don't recall them. Don't know if they are publicly available or not.
You have to remove clEnqueueUnmapMemObject call after map. As soon as you call unmap, runtime doesn't consider UHP as prepinned allocation anymore (no CPU access from the app). Call unmap at the end before memory release. Basically as I mentioned before AHP and UHP have the same behavior in runtime.
BTW, runtime guarantees (Data_s == Data_sx) for UHP.
German:
Yes, I had checked that Data_s == Data_sx, but removing the Unmap still gives the lower
~7 GB/sec rate. If we can get UHP up to AHP speed, I might be able to get around the
Map size limit by:
1) get Data_s from valloc or equivalent (page aligned)
2) form other pointers from Data_s ( eg. p1, p2, ... one for each Read/Write transfer )
3) CreateBuffer( UHP ) small buffer for each ( an array of CPU buffers )
4) Map each one (small size ~ .25 GB ) just before Read/Write
5) Read/Write
6) Unmap small buffer
Silly idea?
I still have to run your test under Linux. Didn't have time. Basically I forgot about an extra limitation under Linux. In Linux cacheable pool is much smaller than USWC pool. UHP allocations will go to cacheable. Personally I don't see any reason to limit UHP allocations to any pools, but that's how memory manager under Linux works. Windows also has some limitations, but much bigger size. Anyway try to reduce UHP allocations to 128MB to see if you will get 16GB/s. In a case of UHP alloc failure runtime may disable zero-copy it's necessary so some tests could still work. That may explain your numbers. The pool size limitation under Linux can be fixed in the future, but don't know the time frame.
Your pseudo code isn't optimal and should introduce bubbles between CPU/GPU executions. Any UHP allocation requires memory pinning. Pinning involves GPU page tables update, GPU stalls are possible. I believe under Windows VidMM scheduling thread will disable any submissions during that operation and I doubt Linux will be any more optimal than that. To be honest I'm not sure there is an optimal solution to bypass UHP size limit, which shouldn't really exist in the first place. Well again it depends on the system configuration and the amount of requested memory for pinning.
I would suggest you to implement double copy to see if you can get better performance overall, running CPU copy
asynchronously with GPU transfers. Otherwise I think your new code shouldn't be any faster than the current 7GB/s and no size "limits".
German:
You are correct ... reducing DATA_S to 128MB has both AHP and UHP running at ~12GB/sec.
Probably lower than 16GB/sec due to overhead relative to smaller transfers. I will try
my silly idea just to see what happens. I am willing to reserve at boot a large pool in physical
memory, but I do not know how to configure it so that UHP will recognize it. For now,
since I have a workable, if slow, UHP method I will proceed to the more interesting job
of the kernels. If you have any further thoughts on this, let me know ... and Thanks
for your help.