Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

streaming large datasets through the GPU

Hi all:

This is my first dab at GPU computing.  I am running on 64-bit Linux with 32GB

memory and an AMD 7770 GPU with 2GB memory.  The data set is large, (28GB for

the largest mesh) and will be streamed through the GPU in pieces for computation.

In the best of all worlds a simple 3-buffer scheme with each buffer controlled

by a separate queue would allow transfers to and from the GPU as well as the GPU

computation to run concurrently.

To set up the CPU buffers I have tried two methods:

    float* Data_s = (float*) valloc( size_S );

    if( mlock( Data_s, size_S ) != 0 )    printf("*Data_s not locked\n" );

    DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size_S, Data_s, &status);


    DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size_S, NULL, &status);

    float* Data_s = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );

To set up the GPU buffers :

    cl_mem Buffer_s1 = clCreateBuffer(context, CL_MEM_READ_WRITE, size_s, NULL, &status);

The runs indicate that Kernel execution is overlapped but reads do not overlap writes.

The latter is a disappointment, but not totally unexpected.  For small meshes the

ALLOC_HOST_PTR method runs at full speed (same as the BufferBandwidth sample) but

the USE_HOST_PTR only runs at roughly 2/3 of that speed. 

For larger meshes the ALLOC_HOST_PTR methods fails at the MapBuffer call (error -12,

which the opencl spec 1.2 explicitly states cannot happen!), but the slower USE_HOST_PTR

method will handle the largest mesh (28GB).

Since the CPU <--> GPU transfers are the bottleneck for the code, I need a method

that gives the full transfer rates over the largest mesh.  There are several posts

on this forum about Maps of CPU buffers requiring buffer allocation on the GPU.

Has this been fixed or a work-around provided?  Also since the GCN family has dual

bi-directional DMA engines, does AMD expect to implement concurrent bi-directional transfers

in the future?

catalyst-13.1-linux-x86.x86_64    AMD-APP-SDK-v2.8-lnx64

14 Replies

Re: streaming large datasets through the GPU

  • AHP from your code is USWC memory and UHP is cacheable.  Performance from the GPU side should be identical, but USWC doesn't pollute the CPU cache. I would suggest to create a simple test just for the data transfer and see if you can reproduce the performance difference.
  • What’s the size of AHP buffer? USWC allocation by default doesn’t request the CPU virtual address. That requires an extra operation – map(), which may fail for whatever reason. The error code can be changed, but it doesn’t mean runtime won’t fail the call.
  • I’m not sure about the allocation issue. Is it VM/zerocopy? HD7770 supports VM – no allocations on GPU for CPU mem.
  • Bidirectional transfers should work in the latest driver. Usually Windows is the main target for all performance tunings, because MS has the advanced tools for the GPU profiling. However I don’t expect any major issues under Linux. OpenCL runtime pairs 2 CPs(command processors) with 2 DMA engines. So if the application creates 3 queues, then 2 queues will be assigned to CP0 and DMA0 and 1 queue with CP1 and DMA1. The application has to make sure read and write transfers go to different DMA engines without any synchronization between.

Re: streaming large datasets through the GPU

First thing to fnid out is -- whether VM is enabled or not. That, you can check by running "clinfo" and check the driver version string. (should be something like "1182.2 (VM)"). Presence of "VM" string is what you should look for.

Assuming VM is enabled, AHP buffer will be directly accessed by the OpenCL kernel i.e. kernel's pointer access will translate to PCIe transaction which in turn accesses pinned memory. This means that the kernel is not doing any work most of the time and is stalling (very badly) on memory operations. So, the overlap that you intend to make - probably is not happening. Suggest you to allocate a buffer inside GPU and "enqueueWriteBuffer" to it.

I had earlier noted that using "UHP" directly as kernel argument - slows it down very badly. You may want to build a small prototype to probe this.

Journeyman III

Re: streaming large datasets through the GPU

Thanks German and Himanshu for your rapid response.

In response to Himanshu, the driver from clinfo is 1084.4 (VM).  The test code

is a modified version of the HelloWorld sample and is quite simple.  The kernel

simply modifies the GPU buffer to show that the data was actually moved to and

from the GPU. The large buffer on the CPU (DATA_S) is streamed in pieces to and

from the buffers on the GPU (Buffer_s1, etc) using WriteBuffer and ReadBuffer.

The host code must be able to access DATA_S but does not ever access Buffer_s1,

etc.  Likewise the kernel will access Buffer_s1 but not DATA_S.  The only

commumication of data between CPU and GPU is via the Write/Read processes.

In response to German, I changed the command queues from separate queues for the

three buffers to separate queues for write, read, execute as you suggested and

that did enable read/write overlap.  The round-trip bandwidth increased from

~12 GB/sec. to ~16 GB/sec., less than the ~22 GB/sec. that I hoped for, but a very

promising start. The different allocations implied by AHP and UHP do probably

account for the lower bandwidth of UHP (cache thrashing on the CPU) so I may be

forced to use AHP for DATA_S.  But since host needs a pointer with which to access

DATA_S it seems I must Map it as using Query to get an AHP pointer is explicitly

prohibited according to the opencl 1.2 ref.  So that seems to imply that I must

somehow overcome the problem of mapping a large buffer.  The problem may be in

the AMD software or in Linux.  Is it possible to get more informative error

information from Map?  It might also be useful to try the code under Windows to

see if Linux is the problem.  If you think so, I can send you the code to try.

The code runs with DATA_S = 1.21 GB but fails at 1.23 GB.


Re: streaming large datasets through the GPU

22GB? Do you have a PCIE Gen3 system? Attach the code for windows and I'll tell if you can improve performance and how.

1.23GB doesn't look big, but originally runtime didn't allow single allocations > 512MB even for AHP. Linux base driver could fail something. I'll try to check that. I assume you run a 64bit build of your test? You shouldn't see this issue under Windows.

Journeyman III

Re: streaming large datasets through the GPU


I will try to attach a zip file with everything needed for Linux.  For windows you

can substitute a suitable timer for WALLTIME.  Other than that, the .cpp and .cl

files should work under Windows.  To control the buffer sizes change NX, NY, NZ,

and BATCH.  Running "transfer2 1" uses AHP, "transfer2 2" uses UHP. The largest

mesh I can run is NX=NY=NZ=1024, (with BATCH=16)  16 GB DATA_S.  My machine is

an i7-3820 (32GB) HD7770 (2GB) with pcie-3.

I am now trying to avoid Mapping the large structure DATA_S and instead Mapping

each piece of it before Writing/Reading it, and UnMapping it after.  Lots of

Map/UnMap.  Getting seg. faults at the first Write at this point ... probably my bad.

Happy hunting.

Journeyman III

Re: streaming large datasets through the GPU

Here is the code.


Re: streaming large datasets through the GPU

1. You still have to prepin memory even for the UHP allocations. In theory it's not really necessary. However runtime uses the same path for AHP and UHP. Also I believe OpenCL 1.2 spec requires a map call for CPU access even for the UHP allocations. So call clEnqueueMapBuffer for UHP similar to AHP and that should fix the app performance. Also don't forget about the unmap calls:-)

2. I can confirm that the both transfers are running asynchronously in HW, but when they run together DMA engine 1 is slower than DMA engine 0.On top of that even DMA0 is slightly slower than a single transfer on either DMA0 or DMA1. So I would say 16GB/s is the best what you can get for now.

Journeyman III

Re: streaming large datasets through the GPU


Yes, I did lock (prepin) DATA_S by calling mlock.  On my machine any user can lock up to 4GB,

and above that I run as root.  I had also tried Map for the UHP case and that Map failed just

like the one for AHP.  Map simply fails if the buffer to be mapped is too large (> ~1.2GB).

As I said in my last post, I tried Mapping each separate piece ( <= .25GB) of DATA_S to get

the pointer for Read/Write, and that worked as before with DATA_S < 1.23GB but failed with

map error -12 with larger DATA_S.  So the Map failure seems to be triggered by the size of

the buffer being mapped rather than the size of the region of that buffer that is Mapped.

The UHP case runs at the same rate whether or not it is prepinned and whether or not it is

Mapped.  Does Windows do any better?


Re: streaming large datasets through the GPU

You don't have to call mlock. The base driver will lock memory when UHP allocation is created. mlock has nothing to do with clEnqueueMapBuffer().

That's correct. There is a limit on the allocated AHP/UHP size in the linux base driver. The pools have to be preallocated during the boot. As far as I heard the limitation comes from the linux kernel and has to be workaround. Windows should allow a half of system memory for AHP/UHP allocations. The reason it works without clEnqueueMapBuffer is runtime has deferred memory allocations. Basically clCreateBuffer does nothing and runtime allocates memory on the first access. So when you call clEnqueueMapBuffer the actual allocation occurs (the error code can be fixed). Without clEnqueueMapBuffer call runtime doesn't know that the pointer in read/write buffer is a UHP allocation, so it will pin system memory in small chunks and perform multiple transfers. There are optimizations in runtime that will try to hide the pinning cost, but performance may vary, depending on the CPU speed and OS. Currently the pinning cost in Linux is quite more expensive than in Windows. In general it's much less efficient than prepin (clEnqueueMapBuffer call). In windows with prepin the performance is identical I ran with smaller buffers(my systems don't have 32GB ram).

Please note: there are more limitations with big allocations. Currently any buffer allocations (AHP/UHP/Global) can't exceed >4GB address space, but only if they are used in kernels. Runtime can work with >4GB AHP/UHP allocations for data upload/download, because transfers are done with the DMA engines and it doesn't require single address space.