Sounds like a race condition to me : if you have a chain of buffers, you need to be sure that for a particular buffer,
that the previous kernel is finished processing that buffer as output, before the current kernel can use that buffer as input.
Events are necessary in this case to ensure synchronization between kernels.
thank you for replying. The fact is that every kernel is synchronized with the previous one thorugh events.
Moreover, the buffer in question is not the output of the previous one, but it's an independent buffer created as CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR.
If I remember correctly, could the synchronization inside the kernel (asynchronous transfer, even if I shouldn't use local memory in that kernel processing) a solution? Otherwise I don't know why the kernel is not reading the buffer correctly.
Still sounds like a race condition. You also need to synchronize the transfers with the kernels, otherwise the kernel may still be writing to the buffer
while you are transferring to the host. I would carefully re-check how everything is synchronized, If you use a single command queue, then you have
a guarantee about the sequence of kernels and transfers. In my work, I used separate command queues for the transfers, to improve performance,
but I had to be very careful about synchronizing everything.
Ok, I'll check everything again and I'll try to modify something.
I do use a single queue, so every command is called respecting the order. The fact is that the project is, to speak easily, divided in two sections:
1) on the top I create the basics (platform, device, context, program, kernels, queue, work sizes and other stuffs). Moreover here I create SOME OF the pinned buffers (pointer to already stored data to be read and not changing during execution) and device buffers (pointer to null). Then I push to the different kernels only the arguments that do not change during execution (the most part of) using the command SetKernelArg;
2) on the bottom (while loop - if statements) I pack all the Enqueue commands (memory mapping, kernel execution, buffer copying in some cases) to speed up processing (batching). Here sometimes I have to redefine some of the kernel args (few, changed during execution) and I have to create THE REST OF the pinned buffers (few, related to changed data).
After the first ten kernels about, I encounter the problem, since the kernel that takes one of the pinned buffers created on the top as an input (so independent on previous kernel outputs), as suggested using CodeXL debugger, does not read the buffer correctly, some of the values are real and accurate (so not random numbers or related to buffer pointers), but not correspondent to the original values, even if printing on the host everything's fine and even if every kernel is synchronized with the previous one (but in this case the kernel is not waiting for a buffer from the previous kernel to be read).
Now that you're letting me think about, this pinned buffer suffering of this problem is pushed (SetKernelArg) on the top, since it's a not changing buffer (I have to save computational time as much as I can, so I placed on the bottom only the vital commands), and the kernel is enqueued on the bottom after the previous one has finished execution. Enqueuing a kernel means: "Ok, you have been enqueued kernel, so NOW you take all the input buffers, you read them, and you start computations" or it sounds like "kernel enqueued, fire! Input buffers have been ALREADY read before with SetKernelArg, so don't worry, start!" ? This could be a race condition?
The fact is that two kernels before the problematic one there is another pinned buffer input created on the top, and no errors are found reading correctly the values inside the kernel, everything's ok. So I'm asking myself why theis eventual race condition is not highlighted here and just some kernels later. In between there are just device buffers which act as output/input between each other.
A question aside: using more than one queue is powerful, since I divide scheduling the work, but I figured out it was suggested ONLY for queues that have not to share data. Since the project has to process some data with ten kernels about, extrapolate something and then use this output as an input for another different set of eight kernels about, everything is connected, outputs are automatically inputs on the cascade, all inserted in host if-while statements to select-loop over the proper path. Do you think it is advisable to create two different queues at least for the different set of packed kernels provided a correct synchronization between the queues is employed? Do you think also that this kind of implementation could speed up (maybe only a bit) the overall processing?
Thank you very much for your patience.
Woah, Marco, you shouldn't be changing kernel args after you enqueue the kernels.
Perhaps this is the cause.
As for queues, I would recommend using multiple queues. Hawaii arch has 8 hardware asynch queues - make use of the them.
While memory is being send via DMA between host and card, the card can be crunching kernels.
But why I can't change the value of kernel args on the fly? I mean, for example, if I should (as the case) modify the value of an index used in the kernel as a loop counter (int parameter), and its value depends on some conditions verified each time on the host depending on the final processing output, shouldn't change its value? Should I create separate if statements depending which path is followed on? The same for some buffers, they point memory area corresponding to a certain vector, which can change during processing, but the memory is pointed. Whatever, the buffer in question doesn't change, it is created on static values at the beginning and it remains constant.