Yes, I agree with you. I also thought that it's just an optimization that is pointed out in the optimization guide as "this can help keep the GPU busy with kernel execution and DMA transfers".
Anyway, let me check with OpenCL team. I believe they can provide more insights regarding this.
Thanks again for providing these valuable inputs.
Thanks.
I ran the latest attached code on my setup and got similar findings as you mentioned above. It indeed seems that synchronization using events has no effect on the ordering.
Also, as I checked with the OpenCL team. The code looks good to them and they have asked me to create a ticket to investigate the issue in detail. I'll create a ticket and include these testing results. I'll let you know if I've any update on this.
Thanks.
Good to hear you can reproduce. Does that mean you also require more than the single flush before reading to see correct results? Do you have any idea of a timescale on that ticket?
Thanks for all your continuing help with this!
Does that mean you also require more than the single flush before reading to see correct results?
In my case, a single flush before the reading is enough to produce the correct result.
As I tried the macros, I observed below outputs and event orders:
I believe, a clFinish before the reading should work without any other clFlush. In that case, passing "CL_TRUE" to enqueueReadBuffer would be effectively a no-wait operation. I know, these approaches may not be as effective as event/barrier based synchronization, but they can be used as workaround till a fix is available.
Do you have any idea of a timescale on that ticket?
Sorry, it's difficult to provide any timeline at this moment.
Sadly, a finish before the read results in correct ordering but incorrect output (like the single flush):
1007736281121096(mapXPostEventStart)
1007736281153370(mapSpkCntPreEventStart)
1007736281307106(fillXPostEventStart)
1007736281312386(fillXPostEventEnd)
1007736281312746(fillInSynEventStart)
1007736281313066(fillInSynEventEnd)
1007736281317066(buildNeuronKernelEventStart)
1007736281318546(buildNeuronKernelEventEnd)
1007736281322146(buildPresynapticKernelEventStart)
1007736281323546(buildPresynapticKernelEventEnd)
1007736281497466(writeSpkCntPreEventStart)
1007736281502786(writeSpkCntPreEventEnd)
1007736281599546(updatePresynapticEventStart)
1007736281604866(updatePresynapticEventEnd)
1007736281605226(updateNeuronsEventStart)
1007736281605586(updateNeuronsEventEnd)
1007736281726036(readXPostEventStart)
1007736281729036(readXPostEventEnd)
So the only workaround we currently have is to flush between every kernel launch which is very detrimental to performance. However, I totally understand with respect to the timeline, if you could keep me updated via this thread that would be great.
Sure, I'll let you know if I get any update about this issue.
a finish before the read results in correct ordering but incorrect output (like the single flush)
This is another unexpected behavior. Did you observe it on Windows or Linux? Please let me know your setup details. I'll mention this information in the related ticket.
I think it would really helpful if you can provide any profiler report for these cases i.e. with single clFlush or clFinish.
Thanks.
We can reproduce this on both a Linux system with a Radeon 5700 XT and GPU PRO 20.30 drivers; and a Windows system with a Radeon RX 580 and 20.5.1 drivers. If I can get the profiler to work, I'll post the results here.
Thanks for the information.
Just FYI.
It looks like more recent drivers are available for both Windows (Adrenalin 20.9.1 WHQL and 20.9.2 Optional) and Linux (AMDGPU-Pro 20.40). As it is always recommended to verify an issue with the latest drivers, I would suggest you to try those recent drivers to see if there is any different observations.
Please note, I tested with Adrenalin 20.9.1.
Thanks.
It's going to take us a little longer to get our Linux machine upgraded but, on Windows with a RX 580 and 20.9.2 drivers, the behaviour we see is unchanged i.e. a single flush or finish before the read does not result in correct results. What GPU are you testing on?
Thanks