AnsweredAssumed Answered

streaming large datasets through the GPU

Question asked by bobrog on Feb 18, 2013
Latest reply on Feb 22, 2013 by bobrog

Hi all:


This is my first dab at GPU computing.  I am running on 64-bit Linux with 32GB

memory and an AMD 7770 GPU with 2GB memory.  The data set is large, (28GB for

the largest mesh) and will be streamed through the GPU in pieces for computation.

In the best of all worlds a simple 3-buffer scheme with each buffer controlled

by a separate queue would allow transfers to and from the GPU as well as the GPU

computation to run concurrently.


To set up the CPU buffers I have tried two methods:


    float* Data_s = (float*) valloc( size_S );

    if( mlock( Data_s, size_S ) != 0 )    printf("*Data_s not locked\n" );

    DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size_S, Data_s, &status);


    DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size_S, NULL, &status);

    float* Data_s = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );


To set up the GPU buffers :

    cl_mem Buffer_s1 = clCreateBuffer(context, CL_MEM_READ_WRITE, size_s, NULL, &status);


The runs indicate that Kernel execution is overlapped but reads do not overlap writes.

The latter is a disappointment, but not totally unexpected.  For small meshes the

ALLOC_HOST_PTR method runs at full speed (same as the BufferBandwidth sample) but

the USE_HOST_PTR only runs at roughly 2/3 of that speed. 

For larger meshes the ALLOC_HOST_PTR methods fails at the MapBuffer call (error -12,

which the opencl spec 1.2 explicitly states cannot happen!), but the slower USE_HOST_PTR

method will handle the largest mesh (28GB).


Since the CPU <--> GPU transfers are the bottleneck for the code, I need a method

that gives the full transfer rates over the largest mesh.  There are several posts

on this forum about Maps of CPU buffers requiring buffer allocation on the GPU.

Has this been fixed or a work-around provided?  Also since the GCN family has dual

bi-directional DMA engines, does AMD expect to implement concurrent bi-directional transfers

in the future?


catalyst-13.1-linux-x86.x86_64    AMD-APP-SDK-v2.8-lnx64