Archives Discussions

simon1 · ‎06-14-2013

Context: I'm working on a matching algorithm: basically, an unknown pattern is compared to a gallery in order to find the best match. The gallery contains up to a billion of examples, which is about 30GB (which fits into host-memory in my case).

On the Cuda version of my implementation, I'm allocating two buffers as big as possible on the GPU, I split the gallery into chunks of pinned host-memory (using cudaMallocHost). This allows me to upload the chunks on the device without any copy and at the highest bandwidth, and process one of the buffers while the other one is filling.

In section 3.1.1 of their OpenCL best pratices guide, NVidia explains how to do the same in OpenCL.

Here's how I tried it with my AMD GPU:

// create host buffers

cl_mem host_buffers[num_host_buffers];

for (uint i = 0; i < num_host_buffers; i++) {

host_buffers = clCreateBuffer(context,

CL_MEM_ALLOC_HOST_PTR,

chunk_size * sizeof(int),

...);

}

// init host buffers

for (uint i = 0; i < num_host_buffers; i++) {

int* m = (int*)clEnqueueMapBuffer(queue, host_buffers, true,

CL_MAP_WRITE_INVALIDATE_REGION,

0, chunk_size * sizeof(int),

...);

// ...

clEnqueueUnmapMemObject(queue, host_buffers, (void*)m, ...);

}

// alloc device buffers

for (uint i = 0; i < 2; i++) {

device_buffers = clCreateBuffer(context,

CL_MEM_READ_WRITE | CL_MEM_HOST_NO_ACCESS,

chunk_size * sizeof(int), ...);

}

To upload the required chunk of data, I use a CopyBuffer from a host_buffer to a device_buffer. But the clEnqueueMaps start failing with a CL_MAP_FAILURE when the VRAM capacity is reached.

Regarding the table in section 4.5.2 of the APP programming guide, it seems that there's no way to allocated "upload-ready" memory chunks on the host only (at least without that "VM" thing enabled).

Manually align, page-lock and set non-cacheable a memory chunk is not an option either. From the APP guide, section 4.5.1.2:

Currently, the runtime recognizes only data that is in pinned host memory for operation arguments that are memory objects it has allocated in pinned host memory.

To make things short: what is the best way to manage a (splittable) set of data that fits in host-memory but not in device-memory ? Is it possible to avoid copies and take advantage of the highest bandwidth available in the same time ?

Archives Discussions

More pinned host-memory than device-memory capacity