cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

boxerab
Challenger

Data corruption with multiple command queues

RX 470

Windows 10

I have one parent kernel and three child kernels, where the child kernels are independent of each other,

but they can only start executing once the parent kernel completes.

The parent runs on command queue 0, and the three child kernels run on command queues 1,2 and 3 respectively.

I use events to synchronize parent and children.

There seems to be a race condition or bug where data sometimes gets corrupted at the end of the pipeline.

It happens roughly once every 10 runs of my application.

If I put all kernels on command queue 0, then problem goes away.

As this design is at the core of my application, it is hard to create a reproducer.

So, just want to make this vague bug report - perhaps this kind of scenario can be added to AMD unit tests.

Thanks,

Aaron

0 Likes
15 Replies
dipak
Big Boss

Thank you for reporting it and sharing your suggestions.

If you think it's a driver bug, not an application-side synchronization issue, then it would be helpful if you could provide us with a reproducible test-case.

Regards,

0 Likes

1. There really wouldn't be a need to synchronize parent and child or at least in the simple case its just a matter of waiting for the parent to finishing prior to starting the children.

2. Are the child kernel accessing shared data? If so, how are you ensuring coherent access ?

They are a lot of unknowns here and I would be somewhat willing to be this is an application synch issue, unless more details is provided to explain the use case.

0 Likes

cgrant78@netzero.com​  Yes, it's looking like you are right: some type of synch issue in my code.

0 Likes

I can now confirm that this was a problem with my code: corruption has disappeared now. 

Thanks.

0 Likes

Nope. Looks like a driver issue, still. Will work on reproducer

0 Likes

The problem only happens when I am running two 470s,

Single 470 never has this problem.

I have analysed my code very carefully - all seems to be correct.

Is there someone at AMD that I can PM with more details about my code ?

Thanks.

0 Likes

Hi Aaron,

I've sent a private message. Please check your inbox.

Regards,

0 Likes

Thanks a lot, Dipak. It will take me some time to isolate the code from my

main application.

0 Likes

Sure...take your time.

0 Likes

Also user events to start all commandqueues simultaneously, leaks, codexl says. For r7_240 and HD7870, same. Normal events between commands work flawlessly. I think either user event handling could have problem with windows-10 64 bit or codexl 2.2.733 has something to fix.

0 Likes

Could you please provide a reproducible test-case?

0 Likes

Yes, I also see problems with user events and CodeXL.

Even though my code runs fine, CodeXL lists errors for when I created the user events.

i.e. clCreateUserEvent returns 0x0000000000001234AED f

0 Likes

Also, when I run with GPU performance counters, my application does not run at all.

Using application timeline trace, it does.

0 Likes

Hi Aaron,

I can see your another post (Bug: CodeXL falsely reports that clCreateUserEvent does not succeed ) where you already reported the clCreateUserEvent  issue to the CodeXL team. Please report it once again if you still observe the same in the latest version of CodeXL.

FYI:

Now that CodeXL is part of the GPUOpen initiative, please report all the CodeXL related issues here: https://github.com/GPUOpen-Tools/CodeXL/issues

Regards,

0 Likes

Thanks, Dipak. I just now reported these issues.

0 Likes