Archives Discussions

empty_knapsack · ‎06-19-2009

Running simple CAL application like:

calResMap(...)

...

calResUnmap(...)

calCtxSetMem(...)

calCtxRunProgram(...)

while (calCtxIsEventDone(p->ctx, e) == CAL_RESULT_PENDING);

is no problem at all.

But if we have two (or more) CAL devices and trying to run two instances at same time (of course with different contexts) the performance is very poor because calResMap() function always waits for kernel execution completes. And as calResMap doesn't use any context reference (it's global), it doesn't care which thread currently executing kernel and which trying to map memory, calResMap just blocks both threads.

Is there any way to avoid this? Any async calResMap function? Any plans to implement such function?

Atm, only solution I've got is to run only one GPU calculation thread per process and start as much processes as we have GPUs at system. This solution isn't looks cute but at least it works.

Second question is more general: is AMD/ATI just abandoning ATI Stream? This forum doesn't looks like live one -- tons of unanswered questions and nearly zero activity from AMD/ATI. There was one guy (Micah) who at least was trying to answer some questions but I haven't seen him here for a month+ already. With such level of "support" there no future for ATI Stream, imho.

Fithik1242 · ‎06-26-2009

Hi.

I have one device so I haven’t this problems and unfortunately can’t answer your first question.

But your second question is very actual to me. I sent my own question about IL output array without global buffer 15.06.2009 but nobody has answered yet.

gaurav_garg · ‎06-27-2009

I think the best way for data transfer is to use DMA or may be copy shader to copy data from host resource to Local resource and vice-versa. And, both DMA & copy shader can be asynchronous.

Still, in this case you need to map host resource to copy data from CPU pointer, but I think the map-unmap time would be much lesser.

empty_knapsack · ‎06-27-2009

Well, map-unmap time isn't that important for me. As it only 16-128Kb data to copy while kernel runs for 200ms minimum. Real problem is calResMap sync issues.

Without extending to many processes running it's possible to do:

1. map input data #1

2. map input data #2

3. start kernel on gpu #1

4. start kernel on gpu #2

5. wait for both of them to complete

6. map output data #1

7. map output data #2

(Funny thing that you'll need two threads to wait for kernels because calCtxIsEventDone() blocks thread until kernel not executed while documentation said exactly opposite. I already wrote about this in another unanswered topic. Anyways).

It works but 2 more problems:

1. While mapping inputs/outputs ALL GPUs idling doing nothing. Not that a problem for small sized inputs/outputs and really big problem for large data sets.

2. Executing speed limited by lowest speed GPU. It's no problem if there only similar GPUs at system (say, one 4870x2) but it's problem when mixing anything else (like I have atm 4770 + 4850). It, in theory, can be resolved with carefully selecting input data size so each kernel will runs for (almost) same time even on different GPUs. But in practice it's way too complex.

Anyway, calResMap behaviour is just a result of poor design done by ATI team and it's possible to improve this with some work on ATI side not ours. Unfortunately I can see a clear message from ATI: "We simply don't care".

Archives Discussions

Is there any way to avoid stalls when mixing calResMap with kernels execution?