Hi, I have the following sync issue:
I have a kernel that does some output to a buffer. after calling the kernel I issue a cl::finish (also tried waitforevents, enqueueBarrier etc.) on that queue and blocking-read that buffer to host memory. If the device is under heavy load (3 other GPU boinc projects are running) there is a low chance(less than 1%) that the results of last several threadblocks(10-70, proportional to number of CUs) are not written by the time copy occurs. When this happens, profiling info for the kernel event end time is also bogus. If I do a loop that retries reads until whole buffer is non-zero(it was initialized to 0 for this experiment), then it will be eventually flushed after 1-50ms. My guess is that global cache is not flushed to memory at the end of kernel, because the problem occurs at 128Byte granularity, while my writes are not 128byte aligned (so there are some warps whose output is partially available).
Workaround: The problem disappears if I queue a dummy kernel that uses same buffer, even if I don't read from that buffer in the dummy kernel. A memory read, after dummy kernel finishes, provides correct and complete data.
Is this a known issue? I'll provide "streamlined" repro code if needed.
win 7 pro x64
260X or 290x
catalyst 14.9 or 14.11.2
32bit or 64bit executable
heavy GPU load from other tasks unveils a race condition
Thanks for reporting this. Till now I was not aware of this kind of issue.
As clFinish is a synchronization point, the buffer should be updated properly before the data copying. From your workaround, it seems that the (dummy) kernel launching is acting as the synchronization point here. It would be great help if you provide us the repo test code (host + kernel) with the workaround version.
I've few queries:
1) Does the issue occur always or randomly?
2) As you mentioned it occurs when the device is under heavy load. So can I assume in normal circumstances there is no synchronization problem?
3) Does it also occur with OpenCL 1.2?
1) randomly. chance increases as load increases. Frequency also seem to depend on CPU load. Seen up to 1% chance per kernel launch.
2) yes, normally(this app is the only opencl app running) this does not occur. Also does not occur if card is being used by a game at the same time opencl app is running.
3) yes, I was wrong in title - using OpenCL 1.2 AMD-APP (1573.4)
I've replaced "dummy" kernel with a loop that does read/finish() until it's satisfied with the contents of the buffer as a workaround.
Tried to make slim repro, but simple kernels worked fine.
did some testing on 970 and 670 gtx, trying to repro the issue - everything worked fine for several hours of runtime.
slim repro proves to be trickier...
Thanks for your reply and efforts. I also tried a simple test but failed to recreate. As per the nature of this issue, it would be very difficult for us to trace. If you find a repro test case, please share with us.
More info about this repro: seems like it stopped to repro after a user rolled back to 13.12 driver. verified that 14.4 and newer fail, but I only have access to 290X which is not supported by 13.12, so I can't verify this myself.
In this regard, if you need any support from our side or anything needs to be tested at our end, please let us know. BTW, do you see same issue using latest catalyst Omega driver and APP SDK 3.0 beta too?
Edit: I was confused. I've tried both 2.9 and 3.0beta released in december.
I have not tried 3.0 beta yet. Will try today. Currently on 14.12.
If I provide not a small amount of .cl code (50k) with a C wrapper to run it, will someone on your side be able to look at the problem attentively? If yes, please send an email with instructions whom to contact.
So far I've tried instrumenting kernel to make sure I don't write outside of global or local memory (automatically, and then going through the instrumented kernel manually to verify it has correct boundaries), as well as instrumenting CPU code to check for memory overruns. It runs fine when it's the only thing running on AMD GPU and the results are stable (running the code several times produces binary identical output of float values). The CPU code is single threaded with a cl.finish call after each kernel or memory operation.
I sent you my email id via a private message. You can send the code there. Please provide some description about how to run/test the code and validate the result.