cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

realhet
Miniboss

Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10

Hi,

I've just downloaded the new driver and noticed, that my program is became 3.7% slower than before. I have figured out that the problem is that the new driver wont allow to execute 2 kernels overlapped.

Here is what worked well before cat 12.10:

start kernel1

at 90% of the exec time of kernel1 -> start kernel2

at 90% of the exec time of kernel2 -> start kernel3

and so on.

(kernel time is around 0.4 seconds)

And the way it was worked on opencl: -> make 2 contexts and run two of the overlapped kernels on each different contexts. Check the completion in a 20msec timer and launch new kernel when needed.

Now this is not working with Catalyst 12.10. It's 3.7% slower now because CUes are sleeping between kernels :S. You know, it's like turning off one of the 32 CUes in a HD7970, just because those bad kernel-to-kernel transitions.

So If anyone knows, please tell what is the proper way to keep the CUes filled with work ALL THE TIME?

Thank you for answers.

Tags (4)
0 Likes
32 Replies
binying
Challenger

Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10

Would u mind posting ur code?

0 Likes
realhet
Miniboss

Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10

Sry, I can't post exact codes. If I do so, my client have to kill me . Though, I should extract the thing and make a demo project out of it...

But the syndrome is so simple: If there's already a kernel running on the card, the api will block you when you start another one. So there will be a gap between the 2 kernels. The problem is that it worked well with previous drivers and now it seems like broken or changed somehow.

BTW It seems like the same behaviour as the CAL api suffered since win_cat_12.2.

0 Likes
yurtesen
Miniboss

Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10

Why dont you just enqueue your kernels and let the runtime to decide what to do with them? I think any sort of reliable synchronization between contexts is not supported by OpenCL in general?

I am not sure if it was ever possible to run 2 kernels at the same time on the device anyway.

http://developer.amd.com/tools/hc/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf

11. Is it possible to run multiple AMD APP applications (compute and graphics)

concurrently?

Multiple AMD APP applications can be run concurrently, as long as they do not access the

same GPU at the same time. AMD APP applications that attempt to access the same GPU

at the same time are automatically serialized by the runtime system.

Also see: http://devgurus.amd.com/message/1283414#1283414

I think you probably would get better performance if you can queue your kernels to same command queue one after each other without waiting using events within same context.

In general it will be more time consuming to try to wait for one event to finish and then queue an event after that. Also since you cant really check if the execution will finish after 20msec or not, I dont quite understand how you can queue another kernel in an overlapping fashion the way you described. Perhaps if you could at least provide some code fragments, it could be useful.

0 Likes
realhet
Miniboss

Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10

Hi,

A few months ago when I ported my stuff from cal to ocl, the first thing I tried was the one context method. There was that gap between the kernels (was easy to notice because the 1-2% slowdown while used exactly the same ISA microcode as on the cal version). Then I've found that CL_OUT_OF_ORDER_EXECUTION flag -> also a fail. And finally I tried to make two contexts and the thing worked excellently up until cat12.10. Btw that two context method was faster that cal when I used more than one gpu devices.

"In general it will be more time consuming to try to wait for one event to finish and then queue an event after that."

Sure it is! That's why I enqueue a new kernel when the actual one is about to finish. I know that a kernel finishes in 0.3 sec, so I can enqueue the next one at 0.2 sec after kernel1 was launched.

0 Likes
realhet
Miniboss

Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10

Hi,

A few months ago when I ported my stuff from cal to ocl, the first thing I tried was the one context method. There was that gap between the kernels (was easy to notice because the 1-2% slowdown while used exactly the same ISA microcode as on the cal version). Then I've found that CL_OUT_OF_ORDER_EXECUTION flag -> also a fail. And finally I tried to make two contexts and the thing worked excellently up until cat12.10. Btw that two context method was faster that cal when I used more than one gpu devices.

"In general it will be more time consuming to try to wait for one event to finish and then queue an event after that."

Sure it is! That's why I enqueue a new kernel when the actual one is about to finish. I know that a kernel finishes in 0.3 sec, so I can enqueue the next one at 0.2 sec after kernel1 was launched.

0 Likes
yurtesen
Miniboss

Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10

My colleague made a kernel which he enqueued thousands of times and average runtime was under 40 microseconds. (it is slighty strange since amd manuals say average latency of kernel execution is 70 microseconds on a decent GPU). Although I had a problem where my programs runtime went up when I divided work into several enqueues and used offsets. Although that was rectified slightly by modifying the kernel code a bit (later we agreed it must have been something to do with cache behavior). So, I have mixed experience with this myself. But the fastest option should be enqueuing all the kernels to command queue using single context and let the runtime to decide when to run what.

Since I had some mixed results myself in the past, all I can say is that your approach sounds unintuitive to me. For example the problem could be caused by some improvement in the driver which protects data within a context not mixing up with another context. (It would be bad securitywise if two programs which had separate contexts could read each others leftover data right?). Therefore at least in my opinion I cant think of a reason why I would blame the driver for this.

Did you try to profile your program to see what it is doing exactly? You never mentioned how you figured out your kernels were running overlapped at first place? (which should have been impossible as far as I know...). Maybe your original speedup was due to some other reason?

0 Likes
nou
Exemplar

Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10

70 microseconds is latency. if you send 10 times 40 microsecond kernel then total exectuion time should be 70+10*40. driver can batch more kernel execution into one command so it get executed faster.

0 Likes
yurtesen
Miniboss

Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10

nou wrote:

70 microseconds is latency. if you send 10 times 40 microsecond kernel then total exectuion time should be 70+10*40. driver can batch more kernel execution into one command so it get executed faster.

Yes, we thought about the throughput being different but I would think there must be at least some small additional latency when a kernel ends and re-starts. I think it was surprisingly fast still. One thing is that we ran exactly same kernel on exactly same data(for testing purposes), if some kernel parameters were to be different, perhaps it could have some overhead.

Anyway, this doesnt help realhet that much I guess just that kernels are able to run after each other without significant delay.

0 Likes
notzed
Challenger

Re: Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10

Why are you polling and sleeping?  Any mismatch of the cycle  time (which will be hard to predict) will cause idle 'stalls'.

Just waiting for a kernel to finish and starting one straight away is quite a loss - but obviously not something you can always avoid - but that is the best case scenario possible.

Have you tried putting each kernel on a separate thread with a separate queue?  And rather than a sleep/poll loop, just doing a blocking CL operation for the synchronisation?  I presume from your description they are independent.

0 Likes