In my project, I'm running a chain of several kernels in a loop with millions of iterations, and I need to zero out a buffer of up to 5000 floats at the start of every iteration of this loop. I tried using clEnqueueFillBuffer() and clEnqueueCopyBuffer() using strategies described in the AMD Optimization guide. It works, but it does not yield satisfactory performance. So I also tried implementing a kernel for it:
__kernel void p1d_zero_buffer(__global float4* data)
size_t i = get_global_id(0);
data[i] = (float4)(0.0f);
I used vectorizing to see if it'd be any faster, but to no avail. Zeroing out the buffer absolutely kills the performance of my app. I'm running this on a Radeon Pro WX 9100, so memory bandwidth is definitely not the issue.
The interesting thing is that if I only run this kernel in the loop, it performs very fast. When I run all the other kernels, without the zero-ing one, the performance is very good also. But once I combine zero-ing kernel with any of the other kernels in the loop, the zero-ing causes a huge performance drop.
Running this kernel through the CodeXL profiler, I found out that the cache hit ratio for this kernel is extremely low (0,00 for L1CacheHit, 6.90 for L2CacheHit). Is that the reason for the massive performance drop?
What are the best strategies for zero-ing out buffers in a fast, efficient way?
I'm attaching the profiler outputs for a deeper insight into the kernels' performance.