Well, each work-item does its part of the the whole copy operation.
He managed to reach 293 GFLOPS (4000x4000x16x16x2 FLOPS / 0,028ms) at AMD 5870 (which is 2,72 TFLOPS), that is 11% efficiency. Not bad. I managed to get about 18% at AMD 6950, but I used recent drivers and optimized not only for global memory access, but for local memory access also.
thanks for your quick reply.
The copy operation is my actual problem. I know that it is best, when all work-items copy approximately the same number of items into local memory. But I do not have an idea which work-item should copy which items.
Maybe you could present me your implementation. However, I do not want to copy&paste from you, I just need some ideas, how these copy operations can be managed.
I assume you used Buffers, or did you use Image2D?
Yep, I use local buffers.
Let's assume the local worksize is N. Then the 1st work-item is responsible for filling elements of the local buffer with indexes 0, N, 2N and so on. The 2nd work-item is responsible for indexes 1, N+1, 2N+1 and so on. You get the idea.
Thus you will get no bank-conflicts when writing to LDS, and mostly coalesced access when reading from global memory.
Of course the code gets a little more complex, if local buffer size (input window size) is not multiple of local worksize.
P.S. Besides, using textures here doesn't bring large improvements (if any) anylonger as the compiler is now able to generate cached read instructions for "const restrict" buffers.
Thanks for the idea.
I think that is a basis I can start from.
The local buffer size almost never is a multiple of the local worksize. However, I will try to get this done, too.
If I fail, I will post again ;-)
2 short advices:
1) Read AMD Accelerated Parallel Processing OpenCL Programming Guide, if you didn't do it yet.
2) Learn ISA code (visible in Kernel Analizer and APP Profiler). It will help you greatly to understand why the code works faster or slower.