Hello!
In my project, I'm running a chain of several kernels in a loop with millions of iterations, and I need to zero out a buffer of up to 5000 floats at the start of every iteration of this loop. I tried using clEnqueueFillBuffer() and clEnqueueCopyBuffer() using strategies described in the AMD Optimization guide. It works, but it does not yield satisfactory performance. So I also tried implementing a kernel for it:
__kernel void p1d_zero_buffer(__global float4* data)
{
size_t i = get_global_id(0);
data = (float4)(0.0f);
}
I used vectorizing to see if it'd be any faster, but to no avail. Zeroing out the buffer absolutely kills the performance of my app. I'm running this on a Radeon Pro WX 9100, so memory bandwidth is definitely not the issue.
The interesting thing is that if I only run this kernel in the loop, it performs very fast. When I run all the other kernels, without the zero-ing one, the performance is very good also. But once I combine zero-ing kernel with any of the other kernels in the loop, the zero-ing causes a huge performance drop.
Running this kernel through the CodeXL profiler, I found out that the cache hit ratio for this kernel is extremely low (0,00 for L1CacheHit, 6.90 for L2CacheHit). Is that the reason for the massive performance drop?
What are the best strategies for zero-ing out buffers in a fast, efficient way?
I'm attaching the profiler outputs for a deeper insight into the kernels' performance.
The interesting thing is that if I only run this kernel in the loop, it performs very fast. When I run all the other kernels, without the zero-ing one, the performance is very good also. But once I combine zero-ing kernel with any of the other kernels in the loop, the zero-ing causes a huge performance drop.
Did you see the "Application Timeline Trace" in CodeXL to find out if there is any time gap / delay in the kernel execution timeline while all the kernels are running? It would be helpful if you can provide the timeline traces for these different scenarios.
Thanks.
Sure, I'm attaching it.
This is interesting, it would seem like zeroing the "data" buffer either by doing the copy or executing the zeroing kernel makes the "interpolate" and "poisson_solve" kernels take 10 times longer to finish. Any idea why would that happen?.
EDIT: It is also worth noting that both the "interpolate" and "poisson_solve" kernels utilize global-memory atomics (summing floats) to store their results in the "data" array, and they also happen to be affected by zeroing the buffer. Maybe that's the problem?
Thank you!
Yes, data dependency might be the reason for this poor performance. As I can see from the timeline trace report, "p1d_interpolate_pg" kernel is taking most of the total execution time (>85%), whereas the zeroing kernel itself is not consuming much time (< 1% of total time).
@jadr
Please look at this:
__kernel void fill_kernel(int N, __const float ALPHA, __global float *X, int OFFX, int INCX) | |
{ | |
int i = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0); | |
if(i < N) X[i*INCX + OFFX] = ALPHA; | |
} |
And this:
void fill_offset_gpu(int N, float ALPHA, cl_mem_ext X, int OFFX, int INCX) | |
{ | |
dim2 dimGrid; | |
dimGrid = opencl_gridsize(N); | |
opencl_kernel(opencl_fill_kernel[opencl_device_id_t], dimGrid, 10, &N, sizeof(cl_int), &ALPHA, sizeof(cl_float), &X.mem, sizeof(cl_mem), &OFFX, sizeof(cl_int), &INCX, sizeof(cl_int)); | |
/* | |
#ifdef BENCHMARK | |
clock_t t; | |
t = clock(); | |
#endif | |
clEnqueueFillBuffer(opencl_queues[opencl_device_id_t], X.mem, &ALPHA, sizeof(cl_float), OFFX*sizeof(cl_float), N*sizeof(cl_float),0, NULL, NULL); | |
#ifdef BENCHMARK | |
t = clock() - t; | |
double time_taken = ((double)t); | |
printf("%s\t%d\n", "fill_kernel", (int)time_taken); | |
#endif | |
*/ | |
} |
Source of example I took from: GitHub - sowson/darknet: Convolutional Neural Networks on OpenCL on Intel & NVidia & AMD & Mali GPUs...
I hope that will help you! Thanks!