cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

soylentgraham
Adept II

Re: Multiple contexts parallel allocating or writing to memory of a single device

Okay, I wrote a big long winded reply, but I think I've solved my problems now. No more access violations (which I thought were out of bounds access) and no deadlocks. (or at least, not for the last few hours) on GPU or CPU.

My new setup;

100 host threads, 100 queues (one each)

N kernels, all instanced per-thread (no cross-queue/cross-thread kernels). All kernels on a thread use the same queue.

All writes are now non-blocking

All executions are non-blocking.

All reads are blocking.

[then kernel and data is disposed]

My problem I realised in the end was whenever I made a write or execution non-blocking, was that the data on the queue for that kernel wasn't ready. PERHAPS more threads & queues just highlighted a problem that was there, or I read somewhere about having more than one queue for a context warranted more clFinish's. (clFinish before execution also worked, but clFlush still resulted in access violations)

Anyway, now, for all my non-blocking writes I store the cl_event...

Before execution (though after clSetKernelArg) I do clWaitForEvents on all the events relevent to this kernel/queue.

All my crashes and deadlocks have gone away. I have NO mutex's host side related to opencl and execution is faster.

I wrongly assumed an execution (blocking or non-blocking) would ensure the relevent data-write on the queue would be finished, but it seems not.

himanshu_gautam
Grandmaster

Re: Multiple contexts parallel allocating or writing to memory of a single device

Oh, Thanks!

From what I infer from your post, the bug was due to your misunderstanding of asynchronous execution and nothing to do with AMD's opencl run-time. Please confirm.

And yes, Good luck and Thanks for taking time to post your experience here!

It can be a great time-saver to someone...

And,. I hope your code runs for many more hours to come and then one day terminates normally...!

Best,

Bruhaaa........

0 Likes
tugrul_512bit
Adept III

Re: Multiple contexts parallel allocating or writing to memory of a single device

Does all these mean that I can do operations below?

1)Create single context.

2)Create single oredered queue for all kernels.

3)Create an oredered queue for each write/read operation. So if I have N read and M write operations, I create N+M queues.

4)start

5)From an openmp body, simultaneously do :

chunk0

{

       clEnqueueWriteBuffer(queue0, buffer0);

       queue0.finish();

}

chunk1

{

       clEnqueueWriteBuffer(queue1, buffer1);

       queue1.finish();

}

....

chunkN

{

       clEnqueueWriteBuffer(queueN, bufferN);

       queueN.finish();

}

5) All writes/reads are done so I can start computing on the gpu:

clEnqueueNDRangeKernel(queueCompute, blabla)

queueCompute.finish();

6)Do very similar thing for reading the results as step 5

7)repeat from 5

This way, can I get full pci-express read/write bandwidth?

Right now Im using only a single ordered queue for all read/write/compute operations and I have a single singleQueue.clFinish() at the very end. This makes me able to use only 1.4 GB/s for read/write buffer operations. I'm kind of hoping 4GB/s - 5GB/s for my gigabyte 990-xa-ud3 motherboard.

0 Likes
nou
Exemplar

Re: Multiple contexts parallel allocating or writing to memory of a single device

yes you can do that. you just need ensure that all chunks are finished before going to computing which want use written data. but to achieve peak transfer rate you need utilize pre-pinned buffer. look into BufferBandwith example how to achieve this peak transfer rate.

tugrul_512bit
Adept III

Re: Multiple contexts parallel allocating or writing to memory of a single device

Thank you, now I have a problem, one of the buffers are not read when doing concurrent reads. Codexl shows holes. 3 simultanous reads instead of 4. But rarely. Maybe drivers are bugged? All give CL_SUCCESS.

Edit: just tested example.  Clenqueuemapbuffer shows 6.5 GB/s so I should exchange clenqueuewritebuffer with clenqueuemapbuffer. (uses DMA maybe?)

CPU read shows 1.3 GB/s which must be same thing of clenqueuereadbuffer and slowness of my implementation.

Noe: even  AMD examples make CodeXL throw lots of leak errors. I think codexl makes a lot of false positives.

0 Likes
nou
Exemplar

Re: Multiple contexts parallel allocating or writing to memory of a single device

download AMD OpenCL programing guide and read whole chapter 5.6.

0 Likes