Hi all:
This is my first dab at GPU computing. I am running on 64-bit Linux with 32GB
memory and an AMD 7770 GPU with 2GB memory. The data set is large, (28GB for
the largest mesh) and will be streamed through the GPU in pieces for computation.
In the best of all worlds a simple 3-buffer scheme with each buffer controlled
by a separate queue would allow transfers to and from the GPU as well as the GPU
computation to run concurrently.
To set up the CPU buffers I have tried two methods:
float* Data_s = (float*) valloc( size_S );
if( mlock( Data_s, size_S ) != 0 ) printf("*Data_s not locked\n" );
DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size_S, Data_s, &status);
or
DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size_S, NULL, &status);
float* Data_s = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );
To set up the GPU buffers :
cl_mem Buffer_s1 = clCreateBuffer(context, CL_MEM_READ_WRITE, size_s, NULL, &status);
The runs indicate that Kernel execution is overlapped but reads do not overlap writes.
The latter is a disappointment, but not totally unexpected. For small meshes the
ALLOC_HOST_PTR method runs at full speed (same as the BufferBandwidth sample) but
the USE_HOST_PTR only runs at roughly 2/3 of that speed.
For larger meshes the ALLOC_HOST_PTR methods fails at the MapBuffer call (error -12,
which the opencl spec 1.2 explicitly states cannot happen!), but the slower USE_HOST_PTR
method will handle the largest mesh (28GB).
Since the CPU <--> GPU transfers are the bottleneck for the code, I need a method
that gives the full transfer rates over the largest mesh. There are several posts
on this forum about Maps of CPU buffers requiring buffer allocation on the GPU.
Has this been fixed or a work-around provided? Also since the GCN family has dual
bi-directional DMA engines, does AMD expect to implement concurrent bi-directional transfers
in the future?
catalyst-13.1-linux-x86.x86_64 AMD-APP-SDK-v2.8-lnx64
First thing to fnid out is -- whether VM is enabled or not. That, you can check by running "clinfo" and check the driver version string. (should be something like "1182.2 (VM)"). Presence of "VM" string is what you should look for.
Assuming VM is enabled, AHP buffer will be directly accessed by the OpenCL kernel i.e. kernel's pointer access will translate to PCIe transaction which in turn accesses pinned memory. This means that the kernel is not doing any work most of the time and is stalling (very badly) on memory operations. So, the overlap that you intend to make - probably is not happening. Suggest you to allocate a buffer inside GPU and "enqueueWriteBuffer" to it.
I had earlier noted that using "UHP" directly as kernel argument - slows it down very badly. You may want to build a small prototype to probe this.
Thanks German and Himanshu for your rapid response.
In response to Himanshu, the driver from clinfo is 1084.4 (VM). The test code
is a modified version of the HelloWorld sample and is quite simple. The kernel
simply modifies the GPU buffer to show that the data was actually moved to and
from the GPU. The large buffer on the CPU (DATA_S) is streamed in pieces to and
from the buffers on the GPU (Buffer_s1, etc) using WriteBuffer and ReadBuffer.
The host code must be able to access DATA_S but does not ever access Buffer_s1,
etc. Likewise the kernel will access Buffer_s1 but not DATA_S. The only
commumication of data between CPU and GPU is via the Write/Read processes.
In response to German, I changed the command queues from separate queues for the
three buffers to separate queues for write, read, execute as you suggested and
that did enable read/write overlap. The round-trip bandwidth increased from
~12 GB/sec. to ~16 GB/sec., less than the ~22 GB/sec. that I hoped for, but a very
promising start. The different allocations implied by AHP and UHP do probably
account for the lower bandwidth of UHP (cache thrashing on the CPU) so I may be
forced to use AHP for DATA_S. But since host needs a pointer with which to access
DATA_S it seems I must Map it as using Query to get an AHP pointer is explicitly
prohibited according to the opencl 1.2 ref. So that seems to imply that I must
somehow overcome the problem of mapping a large buffer. The problem may be in
the AMD software or in Linux. Is it possible to get more informative error
information from Map? It might also be useful to try the code under Windows to
see if Linux is the problem. If you think so, I can send you the code to try.
The code runs with DATA_S = 1.21 GB but fails at 1.23 GB.
22GB? Do you have a PCIE Gen3 system? Attach the code for windows and I'll tell if you can improve performance and how.
1.23GB doesn't look big, but originally runtime didn't allow single allocations > 512MB even for AHP. Linux base driver could fail something. I'll try to check that. I assume you run a 64bit build of your test? You shouldn't see this issue under Windows.
German:
I will try to attach a zip file with everything needed for Linux. For windows you
can substitute a suitable timer for WALLTIME. Other than that, the .cpp and .cl
files should work under Windows. To control the buffer sizes change NX, NY, NZ,
and BATCH. Running "transfer2 1" uses AHP, "transfer2 2" uses UHP. The largest
mesh I can run is NX=NY=NZ=1024, (with BATCH=16) 16 GB DATA_S. My machine is
an i7-3820 (32GB) HD7770 (2GB) with pcie-3.
I am now trying to avoid Mapping the large structure DATA_S and instead Mapping
each piece of it before Writing/Reading it, and UnMapping it after. Lots of
Map/UnMap. Getting seg. faults at the first Write at this point ... probably my bad.
Happy hunting.
1. You still have to prepin memory even for the UHP allocations. In theory it's not really necessary. However runtime uses the same path for AHP and UHP. Also I believe OpenCL 1.2 spec requires a map call for CPU access even for the UHP allocations. So call clEnqueueMapBuffer for UHP similar to AHP and that should fix the app performance. Also don't forget about the unmap calls:-)
2. I can confirm that the both transfers are running asynchronously in HW, but when they run together DMA engine 1 is slower than DMA engine 0.On top of that even DMA0 is slightly slower than a single transfer on either DMA0 or DMA1. So I would say 16GB/s is the best what you can get for now.
German:
Yes, I did lock (prepin) DATA_S by calling mlock. On my machine any user can lock up to 4GB,
and above that I run as root. I had also tried Map for the UHP case and that Map failed just
like the one for AHP. Map simply fails if the buffer to be mapped is too large (> ~1.2GB).
As I said in my last post, I tried Mapping each separate piece ( <= .25GB) of DATA_S to get
the pointer for Read/Write, and that worked as before with DATA_S < 1.23GB but failed with
map error -12 with larger DATA_S. So the Map failure seems to be triggered by the size of
the buffer being mapped rather than the size of the region of that buffer that is Mapped.
The UHP case runs at the same rate whether or not it is prepinned and whether or not it is
Mapped. Does Windows do any better?
You don't have to call mlock. The base driver will lock memory when UHP allocation is created. mlock has nothing to do with clEnqueueMapBuffer().
That's correct. There is a limit on the allocated AHP/UHP size in the linux base driver. The pools have to be preallocated during the boot. As far as I heard the limitation comes from the linux kernel and has to be workaround. Windows should allow a half of system memory for AHP/UHP allocations. The reason it works without clEnqueueMapBuffer is runtime has deferred memory allocations. Basically clCreateBuffer does nothing and runtime allocates memory on the first access. So when you call clEnqueueMapBuffer the actual allocation occurs (the error code can be fixed). Without clEnqueueMapBuffer call runtime doesn't know that the pointer in read/write buffer is a UHP allocation, so it will pin system memory in small chunks and perform multiple transfers. There are optimizations in runtime that will try to hide the pinning cost, but performance may vary, depending on the CPU speed and OS. Currently the pinning cost in Linux is quite more expensive than in Windows. In general it's much less efficient than prepin (clEnqueueMapBuffer call). In windows with prepin the performance is identical I ran with smaller buffers(my systems don't have 32GB ram).
Please note: there are more limitations with big allocations. Currently any buffer allocations (AHP/UHP/Global) can't exceed >4GB address space, but only if they are used in kernels. Runtime can work with >4GB AHP/UHP allocations for data upload/download, because transfers are done with the DMA engines and it doesn't require single address space.