Archives Discussions

binghy · ‎12-24-2015

Hi everybody, I'm working on a project on the GPU for signal processing. It includes multiplications, additions and FFTs/IFFTs. The data set is quite huge, and the kernels that I've developed are 9 in total, the most complicated are obviously those for FFTs/IFFTs. I've got a problem that I really don't understand. I create the context, the queue, the buffers for all the kernels in the proper way (most of them are device buffers, the first ones of the chain are pinned buffers to read input data from host variables), and then I set the kernel arguments (apart for some of them that I've to set before enqueing the kernel since I use the same kernel twice during processing). Most of the processing proceeds in a good way, but a certain point I obtain uncorrect data (compared to a similar algorithm developed to run on the CPU). This is because one of the pinned buffers in the middle of the chain, just for some values, is read uncorrectly on the GPU (debugging with CodeXL not all the values are the same of those of the host variable associated to the buffer). But, if I read the buffer before being pushed for kernel processing or after kernel output, the values of the input buffer are identical, no problems.

According to you, which could be the source of the problem?

I'm very sorry for giving you a very short description of the problem (and maybe a bit confusing) and unfortunately I can't share pieces of the code for privacy rules.

Regards to everybody

Marco

boxerab · ‎12-24-2015

Sounds like a race condition to me : if you have a chain of buffers, you need to be sure that for a particular buffer,

that the previous kernel is finished processing that buffer as output, before the current kernel can use that buffer as input.

Events are necessary in this case to ensure synchronization between kernels.

binghy · ‎01-07-2016

thank you for replying. The fact is that every kernel is synchronized with the previous one thorugh events.

Moreover, the buffer in question is not the output of the previous one, but it's an independent buffer created as CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR.

If I remember correctly, could the synchronization inside the kernel (asynchronous transfer, even if I shouldn't use local memory in that kernel processing) a solution? Otherwise I don't know why the kernel is not reading the buffer correctly.

boxerab · ‎01-07-2016

Still sounds like a race condition. You also need to synchronize the transfers with the kernels, otherwise the kernel may still be writing to the buffer

while you are transferring to the host. I would carefully re-check how everything is synchronized, If you use a single command queue, then you have

a guarantee about the sequence of kernels and transfers. In my work, I used separate command queues for the transfers, to improve performance,

but I had to be very careful about synchronizing everything.

binghy · ‎01-08-2016

Ok, I'll check everything again and I'll try to modify something.

I do use a single queue, so every command is called respecting the order. The fact is that the project is, to speak easily, divided in two sections:

1) on the top I create the basics (platform, device, context, program, kernels, queue, work sizes and other stuffs). Moreover here I create SOME OF the pinned buffers (pointer to already stored data to be read and not changing during execution) and device buffers (pointer to null). Then I push to the different kernels only the arguments that do not change during execution (the most part of) using the command SetKernelArg;

2) on the bottom (while loop - if statements) I pack all the Enqueue commands (memory mapping, kernel execution, buffer copying in some cases) to speed up processing (batching). Here sometimes I have to redefine some of the kernel args (few, changed during execution) and I have to create THE REST OF the pinned buffers (few, related to changed data).

After the first ten kernels about, I encounter the problem, since the kernel that takes one of the pinned buffers created on the top as an input (so independent on previous kernel outputs), as suggested using CodeXL debugger, does not read the buffer correctly, some of the values are real and accurate (so not random numbers or related to buffer pointers), but not correspondent to the original values, even if printing on the host everything's fine and even if every kernel is synchronized with the previous one (but in this case the kernel is not waiting for a buffer from the previous kernel to be read).

Now that you're letting me think about, this pinned buffer suffering of this problem is pushed (SetKernelArg) on the top, since it's a not changing buffer (I have to save computational time as much as I can, so I placed on the bottom only the vital commands), and the kernel is enqueued on the bottom after the previous one has finished execution. Enqueuing a kernel means: "Ok, you have been enqueued kernel, so NOW you take all the input buffers, you read them, and you start computations" or it sounds like "kernel enqueued, fire! Input buffers have been ALREADY read before with SetKernelArg, so don't worry, start!" ? This could be a race condition?

The fact is that two kernels before the problematic one there is another pinned buffer input created on the top, and no errors are found reading correctly the values inside the kernel, everything's ok. So I'm asking myself why theis eventual race condition is not highlighted here and just some kernels later. In between there are just device buffers which act as output/input between each other.

A question aside: using more than one queue is powerful, since I divide scheduling the work, but I figured out it was suggested ONLY for queues that have not to share data. Since the project has to process some data with ten kernels about, extrapolate something and then use this output as an input for another different set of eight kernels about, everything is connected, outputs are automatically inputs on the cascade, all inserted in host if-while statements to select-loop over the proper path. Do you think it is advisable to create two different queues at least for the different set of packed kernels provided a correct synchronization between the queues is employed? Do you think also that this kind of implementation could speed up (maybe only a bit) the overall processing?

Thank you very much for your patience.

Marco

boxerab · ‎01-08-2016

Woah, Marco, you shouldn't be changing kernel args after you enqueue the kernels.

Perhaps this is the cause.

As for queues, I would recommend using multiple queues. Hawaii arch has 8 hardware asynch queues - make use of the them.

While memory is being send via DMA between host and card, the card can be crunching kernels.

binghy · ‎01-11-2016

But why I can't change the value of kernel args on the fly? I mean, for example, if I should (as the case) modify the value of an index used in the kernel as a loop counter (int parameter), and its value depends on some conditions verified each time on the host depending on the final processing output, shouldn't change its value? Should I create separate if statements depending which path is followed on? The same for some buffers, they point memory area corresponding to a certain vector, which can change during processing, but the memory is pointed. Whatever, the buffer in question doesn't change, it is created on static values at the beginning and it remains constant.

Archives Discussions

Uncorrect kernel buffer reading