Originally posted by: empty_knapsack[/i
... It really amazing that at 2010 ATI still cannot provide normal multi-threaded DLL to work with CAL (calResMap and some others also blocks everything when used), so inelegant to use IPC when it doesn't needed at all...
|
I am in the process of moving code from using brook to CALfor memory managment. I am wondering what the nature of this blocking is. I would like to access remote cachable memory with the CPU while other kernels are running. Can I expect everything else to block between the map() and unmap() call ?
Update:
I am running a single thread of code that is very heavy on GPU computation. The CPU thread is in a tight loop that does on out these 4:
1) Initiate DMA
2) Initate Kernel execution (23ms compute time)
3) Map - CPU process in cached memory (0.7ms) - Unmap
4) Wait for event
When I run this on a single GPU in 2 contexts, calCtx...Counter shows idle times in the region of 3-7%. But when I run the same thread with 4 context on 2 GPUs (2+2) I have 2 contexts showing 3-7% , and the 2 others showing 36-40% idle time.
But the strange thing is that each GPU has 1 "good" and 1 "bad" context - so these numbers me very well be a bug in the calCounters. Since there are multiple GPU operations between each Begin and End call, one should expect to see the exact same numbers for for all contexts on a GPU, not the strange figures of 3% and 37%
However performance has really taken a hit, it should scale almost linearly, but I get:
1 GPU: 2200 completed jobs / second
2GPUs: 3300 completed jobs a second. ( 75% efficiency )
I wil try to detect locking in the kernel, but it is quite obvious that a fork() is needed to bring performance up to expected levels.
More updates:
I discoverd quite unsurprisingly that the Flush calls can take take a long time to complete. But what was stranger is if you execute a kernel, and then immediatly wait for the event, you get a long stall. waiting for anything else is fine, and helps keep things flowing.
However the multi GPU slowdown I get even happens across multiple processes, after a removed these potential sources of locking. ( Together with the strange idle figures) Brook+ somehow manages to avoid the issue that I am having, so I must be doing something "wrong".
...
and that turns out to be to have multiple contexts pr GPU. I was trying to avoid (re)Binding kernel input / outputs all the time by keeping a circular buffer of contexts. Suddenly this all seems very fixable.