Archives Discussions

ochensati · ‎06-14-2012

This is probably two questions, but I cannot figure out how to make this work.

I have a set of 500 1024X900 pixel images. I need to load each one into memory and performing 1 convolution on the image. (this is one kernal)

Then once the image is convoluted, I need to push it into a 3D buffer that should stay on the device. (this is a second kernal)

Since all these images and the 3D volume is prohibitively large in memory, it would be really helpful to be able to set the two kernals in some sort of wait condition, streaming the images to the GPU, and then have the first kernal run when it notices the image, and then the second kernal run when it notices the output of the first kernal.

Is this possible? I am using cloo for .net to load the buffer, run the first kernal, load the output into the next buffer and run the second kernal. This is slower than just doing the process on a CPU. Can someone point me to the correct way or an example of how to do such a operation? Is this something that would be better performed with openGL interop?

nathan1986 · ‎06-14-2012

Hi, ochensati,

It seems that you want to do the three steps(loading, convolution, converting buffer) in a way of pipeline. I'm interested in how many GPU devices do you have on your system? if one, I suggest to do the two kernels on one command queue. but you can do the loading image(CPU) and kernels execution(GPU) in parallel way, means when process the first image, you can make the CPU load the next image. you may use the cl_event for synchronization, like:

loading();

clEnqueueWriteBuffer(....0, NULL, &event);

for (i = 0 ~ 500){

clWaitForEvents(&event);

clRelease(event);

EnqueueFirstKernel();

EnqueueSecondKernel();

clFlush();

loading();

clEnqueueWriteBuffer(...CL_FALSE....0, NULL, &event);

}

you said that the GPU process is slower than CPU one, have you timed which part is slower than CPU, and Is there any chance to optimize that kernel in more effective way?

View solution in original post

nathan1986 · ‎06-14-2012

Hi, ochensati,

It seems that you want to do the three steps(loading, convolution, converting buffer) in a way of pipeline. I'm interested in how many GPU devices do you have on your system? if one, I suggest to do the two kernels on one command queue. but you can do the loading image(CPU) and kernels execution(GPU) in parallel way, means when process the first image, you can make the CPU load the next image. you may use the cl_event for synchronization, like:

loading();

clEnqueueWriteBuffer(....0, NULL, &event);

for (i = 0 ~ 500){

clWaitForEvents(&event);

clRelease(event);

EnqueueFirstKernel();

EnqueueSecondKernel();

clFlush();

loading();

clEnqueueWriteBuffer(...CL_FALSE....0, NULL, &event);

}

you said that the GPU process is slower than CPU one, have you timed which part is slower than CPU, and Is there any chance to optimize that kernel in more effective way?

ochensati · ‎06-16-2012

Perfect. it is now so much faster. Thanks

Archives Discussions

Circular buffer with two kernals