cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

jadr
Adept I

Performance of zero-ing OpenCL buffers on device

Hello!

In my project, I'm running a chain of several kernels in a loop with millions of iterations, and I need to zero out a buffer of up to 5000 floats at the start of every iteration of this loop. I tried using clEnqueueFillBuffer() and clEnqueueCopyBuffer() using strategies described in the AMD Optimization guide. It works, but it does not yield satisfactory performance. So I also tried implementing a kernel for it:

__kernel void p1d_zero_buffer(__global float4* data)
{
size_t i = get_global_id(0);
data = (float4)(0.0f);
}

I used vectorizing to see if it'd be any faster, but to no avail. Zeroing out the buffer absolutely kills the performance of my app. I'm running this on a Radeon Pro WX 9100, so memory bandwidth is definitely not the issue.

The interesting thing is that if I only run this kernel in the loop, it performs very fast. When I run all the other kernels, without the zero-ing one, the performance is very good also. But once I combine zero-ing kernel with any of the other kernels in the loop, the zero-ing causes a huge performance drop.

Running this kernel through the CodeXL profiler, I found out that the cache hit ratio for this kernel is extremely low (0,00 for L1CacheHit, 6.90 for L2CacheHit). Is that the reason for the massive performance drop?

What are the best strategies for zero-ing out buffers in a fast, efficient way?

I'm attaching the profiler outputs for a deeper insight into the kernels' performance.

0 Likes
Reply
4 Replies
dipak
Staff
Staff

Re: Performance of zero-ing OpenCL buffers on device

The interesting thing is that if I only run this kernel in the loop, it performs very fast. When I run all the other kernels, without the zero-ing one, the performance is very good also. But once I combine zero-ing kernel with any of the other kernels in the loop, the zero-ing causes a huge performance drop.

Did you see the "Application Timeline Trace" in CodeXL to find out if there is any time gap / delay in the kernel execution timeline while all the kernels are running? It would be helpful if you can provide the timeline traces for these different scenarios.

Thanks.

0 Likes
Reply
jadr
Adept I

Re: Performance of zero-ing OpenCL buffers on device

Sure, I'm attaching it.

This is interesting, it would seem like zeroing the "data" buffer either by doing the copy or executing the zeroing kernel makes the "interpolate" and "poisson_solve" kernels take 10 times longer to finish. Any idea why would that happen?.

EDIT: It is also worth noting that both the "interpolate" and "poisson_solve" kernels utilize global-memory atomics (summing floats) to store their results in the "data" array, and they also happen to be affected by zeroing the buffer. Maybe that's the problem?

Thank you!

0 Likes
Reply
dipak
Staff
Staff

Re: Performance of zero-ing OpenCL buffers on device

Yes, data dependency might be the reason for this poor performance. As I can see from the timeline trace report, "p1d_interpolate_pg" kernel is taking most of the total execution time (>85%), whereas the zeroing kernel itself is not consuming much time  (< 1% of total time). 

0 Likes
Reply
sowson
Adept II

Re: Performance of zero-ing OpenCL buffers on device

@jadr

Please look at this:

__kernel void fill_kernel(int N, __const float ALPHA, __global float *X, int OFFX, int INCX)
{
int i = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0);
if(i < N) X[i*INCX + OFFX] = ALPHA;
}

And this:

void fill_offset_gpu(int N, float ALPHA, cl_mem_ext X, int OFFX, int INCX)
{
dim2 dimGrid;
dimGrid = opencl_gridsize(N);
opencl_kernel(opencl_fill_kernel[opencl_device_id_t], dimGrid, 10, &N, sizeof(cl_int), &ALPHA, sizeof(cl_float), &X.mem, sizeof(cl_mem), &OFFX, sizeof(cl_int), &INCX, sizeof(cl_int));
/*
#ifdef BENCHMARK
clock_t t;
t = clock();
#endif
clEnqueueFillBuffer(opencl_queues[opencl_device_id_t], X.mem, &ALPHA, sizeof(cl_float), OFFX*sizeof(cl_float), N*sizeof(cl_float),0, NULL, NULL);
#ifdef BENCHMARK
t = clock() - t;
double time_taken = ((double)t);
printf("%s\t%d\n", "fill_kernel", (int)time_taken);
#endif
*/

}

Source of example I took from: GitHub - sowson/darknet: Convolutional Neural Networks on OpenCL on Intel & NVidia & AMD & Mali GPUs... 

I hope that will help you! Thanks!

0 Likes
Reply