Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

streaming large datasets through the GPU

Hi all:

This is my first dab at GPU computing.  I am running on 64-bit Linux with 32GB

memory and an AMD 7770 GPU with 2GB memory.  The data set is large, (28GB for

the largest mesh) and will be streamed through the GPU in pieces for computation.

In the best of all worlds a simple 3-buffer scheme with each buffer controlled

by a separate queue would allow transfers to and from the GPU as well as the GPU

computation to run concurrently.

To set up the CPU buffers I have tried two methods:

    float* Data_s = (float*) valloc( size_S );

    if( mlock( Data_s, size_S ) != 0 )    printf("*Data_s not locked\n" );

    DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size_S, Data_s, &status);


    DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size_S, NULL, &status);

    float* Data_s = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );

To set up the GPU buffers :

    cl_mem Buffer_s1 = clCreateBuffer(context, CL_MEM_READ_WRITE, size_s, NULL, &status);

The runs indicate that Kernel execution is overlapped but reads do not overlap writes.

The latter is a disappointment, but not totally unexpected.  For small meshes the

ALLOC_HOST_PTR method runs at full speed (same as the BufferBandwidth sample) but

the USE_HOST_PTR only runs at roughly 2/3 of that speed. 

For larger meshes the ALLOC_HOST_PTR methods fails at the MapBuffer call (error -12,

which the opencl spec 1.2 explicitly states cannot happen!), but the slower USE_HOST_PTR

method will handle the largest mesh (28GB).

Since the CPU <--> GPU transfers are the bottleneck for the code, I need a method

that gives the full transfer rates over the largest mesh.  There are several posts

on this forum about Maps of CPU buffers requiring buffer allocation on the GPU.

Has this been fixed or a work-around provided?  Also since the GCN family has dual

bi-directional DMA engines, does AMD expect to implement concurrent bi-directional transfers

in the future?

catalyst-13.1-linux-x86.x86_64    AMD-APP-SDK-v2.8-lnx64

14 Replies

  • AHP from your code is USWC memory and UHP is cacheable.  Performance from the GPU side should be identical, but USWC doesn't pollute the CPU cache. I would suggest to create a simple test just for the data transfer and see if you can reproduce the performance difference.
  • What’s the size of AHP buffer? USWC allocation by default doesn’t request the CPU virtual address. That requires an extra operation – map(), which may fail for whatever reason. The error code can be changed, but it doesn’t mean runtime won’t fail the call.
  • I’m not sure about the allocation issue. Is it VM/zerocopy? HD7770 supports VM – no allocations on GPU for CPU mem.
  • Bidirectional transfers should work in the latest driver. Usually Windows is the main target for all performance tunings, because MS has the advanced tools for the GPU profiling. However I don’t expect any major issues under Linux. OpenCL runtime pairs 2 CPs(command processors) with 2 DMA engines. So if the application creates 3 queues, then 2 queues will be assigned to CP0 and DMA0 and 1 queue with CP1 and DMA1. The application has to make sure read and write transfers go to different DMA engines without any synchronization between.

First thing to fnid out is -- whether VM is enabled or not. That, you can check by running "clinfo" and check the driver version string. (should be something like "1182.2 (VM)"). Presence of "VM" string is what you should look for.

Assuming VM is enabled, AHP buffer will be directly accessed by the OpenCL kernel i.e. kernel's pointer access will translate to PCIe transaction which in turn accesses pinned memory. This means that the kernel is not doing any work most of the time and is stalling (very badly) on memory operations. So, the overlap that you intend to make - probably is not happening. Suggest you to allocate a buffer inside GPU and "enqueueWriteBuffer" to it.

I had earlier noted that using "UHP" directly as kernel argument - slows it down very badly. You may want to build a small prototype to probe this.

Journeyman III

Thanks German and Himanshu for your rapid response.

In response to Himanshu, the driver from clinfo is 1084.4 (VM).  The test code

is a modified version of the HelloWorld sample and is quite simple.  The kernel

simply modifies the GPU buffer to show that the data was actually moved to and

from the GPU. The large buffer on the CPU (DATA_S) is streamed in pieces to and

from the buffers on the GPU (Buffer_s1, etc) using WriteBuffer and ReadBuffer.

The host code must be able to access DATA_S but does not ever access Buffer_s1,

etc.  Likewise the kernel will access Buffer_s1 but not DATA_S.  The only

commumication of data between CPU and GPU is via the Write/Read processes.

In response to German, I changed the command queues from separate queues for the

three buffers to separate queues for write, read, execute as you suggested and

that did enable read/write overlap.  The round-trip bandwidth increased from

~12 GB/sec. to ~16 GB/sec., less than the ~22 GB/sec. that I hoped for, but a very

promising start. The different allocations implied by AHP and UHP do probably

account for the lower bandwidth of UHP (cache thrashing on the CPU) so I may be

forced to use AHP for DATA_S.  But since host needs a pointer with which to access

DATA_S it seems I must Map it as using Query to get an AHP pointer is explicitly

prohibited according to the opencl 1.2 ref.  So that seems to imply that I must

somehow overcome the problem of mapping a large buffer.  The problem may be in

the AMD software or in Linux.  Is it possible to get more informative error

information from Map?  It might also be useful to try the code under Windows to

see if Linux is the problem.  If you think so, I can send you the code to try.

The code runs with DATA_S = 1.21 GB but fails at 1.23 GB.


22GB? Do you have a PCIE Gen3 system? Attach the code for windows and I'll tell if you can improve performance and how.

1.23GB doesn't look big, but originally runtime didn't allow single allocations > 512MB even for AHP. Linux base driver could fail something. I'll try to check that. I assume you run a 64bit build of your test? You shouldn't see this issue under Windows.



I will try to attach a zip file with everything needed for Linux.  For windows you

can substitute a suitable timer for WALLTIME.  Other than that, the .cpp and .cl

files should work under Windows.  To control the buffer sizes change NX, NY, NZ,

and BATCH.  Running "transfer2 1" uses AHP, "transfer2 2" uses UHP. The largest

mesh I can run is NX=NY=NZ=1024, (with BATCH=16)  16 GB DATA_S.  My machine is

an i7-3820 (32GB) HD7770 (2GB) with pcie-3.

I am now trying to avoid Mapping the large structure DATA_S and instead Mapping

each piece of it before Writing/Reading it, and UnMapping it after.  Lots of

Map/UnMap.  Getting seg. faults at the first Write at this point ... probably my bad.

Happy hunting.


Here is the code.


1. You still have to prepin memory even for the UHP allocations. In theory it's not really necessary. However runtime uses the same path for AHP and UHP. Also I believe OpenCL 1.2 spec requires a map call for CPU access even for the UHP allocations. So call clEnqueueMapBuffer for UHP similar to AHP and that should fix the app performance. Also don't forget about the unmap calls:-)

2. I can confirm that the both transfers are running asynchronously in HW, but when they run together DMA engine 1 is slower than DMA engine 0.On top of that even DMA0 is slightly slower than a single transfer on either DMA0 or DMA1. So I would say 16GB/s is the best what you can get for now.



Yes, I did lock (prepin) DATA_S by calling mlock.  On my machine any user can lock up to 4GB,

and above that I run as root.  I had also tried Map for the UHP case and that Map failed just

like the one for AHP.  Map simply fails if the buffer to be mapped is too large (> ~1.2GB).

As I said in my last post, I tried Mapping each separate piece ( <= .25GB) of DATA_S to get

the pointer for Read/Write, and that worked as before with DATA_S < 1.23GB but failed with

map error -12 with larger DATA_S.  So the Map failure seems to be triggered by the size of

the buffer being mapped rather than the size of the region of that buffer that is Mapped.

The UHP case runs at the same rate whether or not it is prepinned and whether or not it is

Mapped.  Does Windows do any better?


You don't have to call mlock. The base driver will lock memory when UHP allocation is created. mlock has nothing to do with clEnqueueMapBuffer().

That's correct. There is a limit on the allocated AHP/UHP size in the linux base driver. The pools have to be preallocated during the boot. As far as I heard the limitation comes from the linux kernel and has to be workaround. Windows should allow a half of system memory for AHP/UHP allocations. The reason it works without clEnqueueMapBuffer is runtime has deferred memory allocations. Basically clCreateBuffer does nothing and runtime allocates memory on the first access. So when you call clEnqueueMapBuffer the actual allocation occurs (the error code can be fixed). Without clEnqueueMapBuffer call runtime doesn't know that the pointer in read/write buffer is a UHP allocation, so it will pin system memory in small chunks and perform multiple transfers. There are optimizations in runtime that will try to hide the pinning cost, but performance may vary, depending on the CPU speed and OS. Currently the pinning cost in Linux is quite more expensive than in Windows. In general it's much less efficient than prepin (clEnqueueMapBuffer call). In windows with prepin the performance is identical I ran with smaller buffers(my systems don't have 32GB ram).

Please note: there are more limitations with big allocations. Currently any buffer allocations (AHP/UHP/Global) can't exceed >4GB address space, but only if they are used in kernels. Runtime can work with >4GB AHP/UHP allocations for data upload/download, because transfers are done with the DMA engines and it doesn't require single address space.


OK, I think I understand most of your response.  I will look into the Linux kernel/boot pool

issue.  The remaining mystery is why, even with small enough DATA_S that MAP does not fail,

in either AHP or UHP cases, the UHP case (with Mapped DATA_S) setup does not run as fast as AHP

(~7GB/sec vs ~16GB/sec).


    DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size_S, NULL, &status);

    Data_s = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );


    Data_s = (float*) valloc( size_S );

    DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size_S, Data_s, &status);

    Data_sx = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );

    status |= clEnqueueUnmapMemObject(Queue_1, DATA_S, Data_sx, 0, NULL, NULL );

Your reply indicated that Windows, using calls as above, runs both AHP and UHP at the same high rate.

Have you also tried this on a Linux system?


There are keys for the linux base driver to increase the pools. I don't recall them. Don't know if they are publicly available or not.

You have to remove clEnqueueUnmapMemObject call after map. As soon as you call unmap, runtime doesn't consider UHP as prepinned allocation anymore (no CPU access from the app). Call unmap at the end before memory release. Basically as I mentioned before AHP and UHP have the same behavior in runtime.

BTW, runtime guarantees (Data_s == Data_sx) for UHP.



Yes, I had checked that Data_s == Data_sx, but removing the Unmap still gives the lower

~7 GB/sec rate.  If we can get UHP up to AHP speed, I might be able to get around the

Map size limit by:

1) get Data_s from valloc or equivalent (page aligned)

2) form other pointers from Data_s ( eg.  p1, p2, ... one for each Read/Write transfer )

3) CreateBuffer( UHP ) small buffer for each ( an array of CPU buffers )

4) Map each one  (small size ~ .25 GB ) just before Read/Write

5) Read/Write

6) Unmap small buffer

Silly idea?


I still have to run your test under Linux. Didn't have time. Basically I forgot about an extra limitation under Linux. In Linux cacheable pool is much smaller than USWC pool. UHP allocations will go to cacheable. Personally I don't see any reason to limit UHP allocations to any pools, but that's how memory manager under Linux works. Windows also has some limitations, but much bigger size. Anyway try to reduce UHP allocations to 128MB to see if you will get 16GB/s. In a case of UHP alloc failure runtime may disable zero-copy it's necessary so some tests could still work. That may explain your numbers. The pool size limitation under Linux can be fixed in the future, but don't know the time frame.

Your pseudo code isn't optimal and should introduce bubbles between CPU/GPU executions. Any UHP allocation requires memory pinning. Pinning involves GPU page tables update, GPU stalls are possible. I believe under Windows VidMM scheduling thread will disable any submissions during that operation and I doubt Linux will be any more optimal than that.  To be honest I'm not sure there is an optimal solution to bypass UHP size limit, which shouldn't really exist in the first place. Well again it depends on the system configuration and the amount of requested memory for pinning.

I would suggest you to implement double copy to see if you can get better performance overall, running CPU copy

asynchronously with GPU transfers. Otherwise I think your new code shouldn't be any faster than the current 7GB/s and no size "limits".



You are correct ... reducing DATA_S to 128MB has both AHP and UHP running at ~12GB/sec.

Probably lower than 16GB/sec due to overhead relative to smaller transfers. I will try

my silly idea just to see what happens.  I am willing to reserve at boot a large pool in physical

memory, but I do not know how to configure it so that UHP will recognize it.  For now,

since I have a workable, if slow, UHP method I will proceed to the more interesting job

of the kernels.  If you have any further thoughts on this, let me know ... and Thanks

for your help.