Archives Discussions

royvm · ‎05-07-2013

I have a for loop (host side) which calls the same kernel many times, such that for each call an input and output buffers are passed as kernel arguments.

Using the AMD profiler, it turns out that there is a huge overhead caused by multiple kernel calls, buffer writes at the beginning, and buffer reads at the end of each for loop iteration.

I would like to replace the multiple kernel calls with a single kernel call which has a single buffer array as an input argument and a single buffer array as an output argument. That "batch" kernel should run a for loop, on the GPU hardware (instead of on the host CPU), and call the existing single buffer kernel multiple time. This batch call should replace the huge number of data transfers between host and GPU with a single (larger) transfer of a single buffer array.

How do I declare and pass a memory buffer array as an argument to the kernel? how do I read that buffer array at the kernel? Any code sample?

(I do not believe that calling clCreateSubBuffer is a good option)

Thanks,

Roy

dmeiser · ‎05-07-2013

In some opencl implementations the cl_mem objects that you use to refer to device memory are in fact plain pointers to device memory (nvidia's opencl and probably also the CPU drivers from intel and AMD). In principle you could store the values of the cl_mem objects representing your buffers in another buffer and loop through them in your kernel.

However this is not portable and I wouldn't recommend it. The more natural solution would be to pack your data into a larger array by hand (both input and output). For instance in the case where you have N arrays that are each n elements long you could do

cl_float *input = (cl_float*)malloc(N * n * sizeof(cl_float));

for (int i = 0; i < N; ++i) {

// Loop over buffers

for (int j = 0; j < n; ++j) {

// Fill each buffer

input[i * n + j] = ...;

}

cl_float *output = (cl_float*)malloc(N * n * sizeof(cl_float));

Then you'd copy your input to a buffer with something like clEnqueueWriteBuffer. In your kernel you could then process this batch as follows

int j = get_global_id(0);

for (int i = 0; i < N; ++i) {

float in = inputBuffer[i * n + j];

outputBuffer[i * n + j] = someCoolFunctionOfInput(in);

}

Hope this helps.

royvm · ‎05-07-2013

Thank you for your answer, however this solution is inadequate for my specific problem since the multiple kernel buffers are created at different times, not necessarily in order, and each buffer is very large. Also, buffer sizes may vary, even though I may get around this issue.

I do need run in a batch, at a given time, all the multiple buffer which accumulated so far, however I have no control on their order, size, or time of creation and filling.

Therefore, I do need some method to pack multiple previously allocated buffers which were created at different times and not in order, into an array of buffers, and pass that array of buffers to the kernel.

Is there a way to do that ?

Thanks,

Roy

dmeiser · ‎05-07-2013

Therefore, I do need some method to pack multiple previously allocated buffers which were created at different times and not in order, into an array of buffers, and pass that array of buffers to the kernel.
Is there a way to do that ?

You could try the first approach I mentioned, filling a buffer with the values of the cl_mem objects that represent your individual buffers. This should work on a fair number of opencl implementations but I'm not sure if it's guaranteed to work per the OpenCL standard.

roger512 · ‎05-07-2013

hi,

firstly I'd say be carefull with allocations because it's a costy operation, so if you do that between kernels it will probably affect performances badly.

You can try to allocate a very large uchar array and create your own memory manager.From there you can pass the uchar array to your kernel, with an int buffer which give you the offset of every buffers contained in the uchar buffer.

then you can access buffers that way :

__kernel void megaExample(__global uchar* megaarray, __global int* offset)

{

__global float4* f4array = (float4*)(megaarray + offset[0]);

__global int2* intarray = (int2*)(megaarray + offset[1]);

}

when the kernel end the megaarray memory manager "release" buffers that are to be deleted and you can use the memory back without allocating again.

Passing array of pointers is possible but it's a hack and you can't be sure pointers won't be moved.

Roger

himanshu_gautam · ‎05-08-2013

Passing an array of cl_mem objects will not work. It is not an OpenCL compatible solution.

Pointers may change during the course of execution depending on what runtime is up to.

And hence, I don't think the developer should look inferring pointers out of cl_mem objects.

Earlier, Software requiring to point inside a cl_mem object used 2 arguments: "cl_mem", "offset"

As an example, you can look at APPML APIs.

With 1.2, one can use the "Sub Buffer". but 1.2 has very less patrons.

In your case, you need to set as many arguments as there are buffers.

But before that, I would like to understand the memory-transfer overheads you are experiencing.

Is it necessary that you have to read back the buffers immediately?

Can't you make them just lie idle on the GPU and use it next time directly?

If you have to make small changes to these buffers, you can consider using CL_PERSISTENT_AMD flag so that the host can directly write into a cl_mem object resident on GPU (A pointer indirection causing a PCIe transfer to reach out to the GPU). This is a slow un-cached access. But if you just need to set only a flag (or) justupdate a few bytes here and there -- it will be useful. You dont need to transfer the whole data, update and write back. It can save you a lot of time.

Archives Discussions

Is it possible to pass a kernel an array of memory buffers as an argument ?