This is my first dab at GPU computing. I am running on 64-bit Linux with 32GB
memory and an AMD 7770 GPU with 2GB memory. The data set is large, (28GB for
the largest mesh) and will be streamed through the GPU in pieces for computation.
In the best of all worlds a simple 3-buffer scheme with each buffer controlled
by a separate queue would allow transfers to and from the GPU as well as the GPU
computation to run concurrently.
To set up the CPU buffers I have tried two methods:
float* Data_s = (float*) valloc( size_S );
if( mlock( Data_s, size_S ) != 0 ) printf("*Data_s not locked\n" );
DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size_S, Data_s, &status);
DATA_S = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size_S, NULL, &status);
float* Data_s = (float*) clEnqueueMapBuffer(Queue_1, DATA_S, CL_TRUE, CL_MAP_READ|CL_MAP_WRITE, 0, size_S, 0, NULL, NULL, &status );
To set up the GPU buffers :
cl_mem Buffer_s1 = clCreateBuffer(context, CL_MEM_READ_WRITE, size_s, NULL, &status);
The runs indicate that Kernel execution is overlapped but reads do not overlap writes.
The latter is a disappointment, but not totally unexpected. For small meshes the
ALLOC_HOST_PTR method runs at full speed (same as the BufferBandwidth sample) but
the USE_HOST_PTR only runs at roughly 2/3 of that speed.
For larger meshes the ALLOC_HOST_PTR methods fails at the MapBuffer call (error -12,
which the opencl spec 1.2 explicitly states cannot happen!), but the slower USE_HOST_PTR
method will handle the largest mesh (28GB).
Since the CPU <--> GPU transfers are the bottleneck for the code, I need a method
that gives the full transfer rates over the largest mesh. There are several posts
on this forum about Maps of CPU buffers requiring buffer allocation on the GPU.
Has this been fixed or a work-around provided? Also since the GCN family has dual
bi-directional DMA engines, does AMD expect to implement concurrent bi-directional transfers
in the future?