Okay, I wrote a big long winded reply, but I think I've solved my problems now. No more access violations (which I thought were out of bounds access) and no deadlocks. (or at least, not for the last few hours) on GPU or CPU.
My new setup;
100 host threads, 100 queues (one each)
N kernels, all instanced per-thread (no cross-queue/cross-thread kernels). All kernels on a thread use the same queue.
All writes are now non-blocking
All executions are non-blocking.
All reads are blocking.
[then kernel and data is disposed]
My problem I realised in the end was whenever I made a write or execution non-blocking, was that the data on the queue for that kernel wasn't ready. PERHAPS more threads & queues just highlighted a problem that was there, or I read somewhere about having more than one queue for a context warranted more clFinish's. (clFinish before execution also worked, but clFlush still resulted in access violations)
Anyway, now, for all my non-blocking writes I store the cl_event...
Before execution (though after clSetKernelArg) I do clWaitForEvents on all the events relevent to this kernel/queue.
All my crashes and deadlocks have gone away. I have NO mutex's host side related to opencl and execution is faster.
I wrongly assumed an execution (blocking or non-blocking) would ensure the relevent data-write
on the queue would be finished, but it seems not.
From what I infer from your post, the bug was due to your misunderstanding of asynchronous execution and nothing to do with AMD's opencl run-time. Please confirm.
And yes, Good luck and Thanks for taking time to post your experience here!
It can be a great time-saver to someone...
And,. I hope your code runs for many more hours to come and then one day terminates normally...!
Does all these mean that I can do operations below?
1)Create single context.
2)Create single oredered queue for all kernels.
3)Create an oredered queue for each write/read operation. So if I have N read and M write operations, I create N+M queues.
5)From an openmp body, simultaneously do :
5) All writes/reads are done so I can start computing on the gpu:
6)Do very similar thing for reading the results as step 5
7)repeat from 5
This way, can I get full pci-express read/write bandwidth?
Right now Im using only a single ordered queue for all read/write/compute operations and I have a single singleQueue.clFinish() at the very end. This makes me able to use only 1.4 GB/s for read/write buffer operations. I'm kind of hoping 4GB/s - 5GB/s for my gigabyte 990-xa-ud3 motherboard.
yes you can do that. you just need ensure that all chunks are finished before going to computing which want use written data. but to achieve peak transfer rate you need utilize pre-pinned buffer. look into BufferBandwith example how to achieve this peak transfer rate.
Thank you, now I have a problem, one of the buffers are not read when doing concurrent reads. Codexl shows holes. 3 simultanous reads instead of 4. But rarely. Maybe drivers are bugged? All give CL_SUCCESS.
Edit: just tested example. Clenqueuemapbuffer shows 6.5 GB/s so I should exchange clenqueuewritebuffer with clenqueuemapbuffer. (uses DMA maybe?)
CPU read shows 1.3 GB/s which must be same thing of clenqueuereadbuffer and slowness of my implementation.
Noe: even AMD examples make CodeXL throw lots of leak errors. I think codexl makes a lot of false positives.