cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

realhet
Miniboss

Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10

Hi,

I've just downloaded the new driver and noticed, that my program is became 3.7% slower than before. I have figured out that the problem is that the new driver wont allow to execute 2 kernels overlapped.

Here is what worked well before cat 12.10:

start kernel1

at 90% of the exec time of kernel1 -> start kernel2

at 90% of the exec time of kernel2 -> start kernel3

and so on.

(kernel time is around 0.4 seconds)

And the way it was worked on opencl: -> make 2 contexts and run two of the overlapped kernels on each different contexts. Check the completion in a 20msec timer and launch new kernel when needed.

Now this is not working with Catalyst 12.10. It's 3.7% slower now because CUes are sleeping between kernels :S. You know, it's like turning off one of the 32 CUes in a HD7970, just because those bad kernel-to-kernel transitions.

So If anyone knows, please tell what is the proper way to keep the CUes filled with work ALL THE TIME?

Thank you for answers.

0 Likes
32 Replies
binying
Challenger

Would u mind posting ur code?

0 Likes

Sry, I can't post exact codes. If I do so, my client have to kill me . Though, I should extract the thing and make a demo project out of it...

But the syndrome is so simple: If there's already a kernel running on the card, the api will block you when you start another one. So there will be a gap between the 2 kernels. The problem is that it worked well with previous drivers and now it seems like broken or changed somehow.

BTW It seems like the same behaviour as the CAL api suffered since win_cat_12.2.

0 Likes

Why dont you just enqueue your kernels and let the runtime to decide what to do with them? I think any sort of reliable synchronization between contexts is not supported by OpenCL in general?

I am not sure if it was ever possible to run 2 kernels at the same time on the device anyway.

http://developer.amd.com/tools/hc/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf

11. Is it possible to run multiple AMD APP applications (compute and graphics)

concurrently?

Multiple AMD APP applications can be run concurrently, as long as they do not access the

same GPU at the same time. AMD APP applications that attempt to access the same GPU

at the same time are automatically serialized by the runtime system.

Also see: http://devgurus.amd.com/message/1283414#1283414

I think you probably would get better performance if you can queue your kernels to same command queue one after each other without waiting using events within same context.

In general it will be more time consuming to try to wait for one event to finish and then queue an event after that. Also since you cant really check if the execution will finish after 20msec or not, I dont quite understand how you can queue another kernel in an overlapping fashion the way you described. Perhaps if you could at least provide some code fragments, it could be useful.

0 Likes

Hi,

A few months ago when I ported my stuff from cal to ocl, the first thing I tried was the one context method. There was that gap between the kernels (was easy to notice because the 1-2% slowdown while used exactly the same ISA microcode as on the cal version). Then I've found that CL_OUT_OF_ORDER_EXECUTION flag -> also a fail. And finally I tried to make two contexts and the thing worked excellently up until cat12.10. Btw that two context method was faster that cal when I used more than one gpu devices.

"In general it will be more time consuming to try to wait for one event to finish and then queue an event after that."

Sure it is! That's why I enqueue a new kernel when the actual one is about to finish. I know that a kernel finishes in 0.3 sec, so I can enqueue the next one at 0.2 sec after kernel1 was launched.

0 Likes

My colleague made a kernel which he enqueued thousands of times and average runtime was under 40 microseconds. (it is slighty strange since amd manuals say average latency of kernel execution is 70 microseconds on a decent GPU). Although I had a problem where my programs runtime went up when I divided work into several enqueues and used offsets. Although that was rectified slightly by modifying the kernel code a bit (later we agreed it must have been something to do with cache behavior). So, I have mixed experience with this myself. But the fastest option should be enqueuing all the kernels to command queue using single context and let the runtime to decide when to run what.

Since I had some mixed results myself in the past, all I can say is that your approach sounds unintuitive to me. For example the problem could be caused by some improvement in the driver which protects data within a context not mixing up with another context. (It would be bad securitywise if two programs which had separate contexts could read each others leftover data right?). Therefore at least in my opinion I cant think of a reason why I would blame the driver for this.

Did you try to profile your program to see what it is doing exactly? You never mentioned how you figured out your kernels were running overlapped at first place? (which should have been impossible as far as I know...). Maybe your original speedup was due to some other reason?

0 Likes

70 microseconds is latency. if you send 10 times 40 microsecond kernel then total exectuion time should be 70+10*40. driver can batch more kernel execution into one command so it get executed faster.

0 Likes

nou wrote:

70 microseconds is latency. if you send 10 times 40 microsecond kernel then total exectuion time should be 70+10*40. driver can batch more kernel execution into one command so it get executed faster.

Yes, we thought about the throughput being different but I would think there must be at least some small additional latency when a kernel ends and re-starts. I think it was surprisingly fast still. One thing is that we ran exactly same kernel on exactly same data(for testing purposes), if some kernel parameters were to be different, perhaps it could have some overhead.

Anyway, this doesnt help realhet that much I guess just that kernels are able to run after each other without significant delay.

0 Likes

Hi,

A few months ago when I ported my stuff from cal to ocl, the first thing I tried was the one context method. There was that gap between the kernels (was easy to notice because the 1-2% slowdown while used exactly the same ISA microcode as on the cal version). Then I've found that CL_OUT_OF_ORDER_EXECUTION flag -> also a fail. And finally I tried to make two contexts and the thing worked excellently up until cat12.10. Btw that two context method was faster that cal when I used more than one gpu devices.

"In general it will be more time consuming to try to wait for one event to finish and then queue an event after that."

Sure it is! That's why I enqueue a new kernel when the actual one is about to finish. I know that a kernel finishes in 0.3 sec, so I can enqueue the next one at 0.2 sec after kernel1 was launched.

0 Likes
notzed
Challenger

Why are you polling and sleeping?  Any mismatch of the cycle  time (which will be hard to predict) will cause idle 'stalls'.

Just waiting for a kernel to finish and starting one straight away is quite a loss - but obviously not something you can always avoid - but that is the best case scenario possible.

Have you tried putting each kernel on a separate thread with a separate queue?  And rather than a sleep/poll loop, just doing a blocking CL operation for the synchronisation?  I presume from your description they are independent.

0 Likes

notzed, why do you think using different queues will give better performance than queuing the kernels in the same queue?

0 Likes

it's more to do with removing the sleep in the poll loop.

i thought the OP was already using different queues because there's no other way to get overlapping kernel execution.

0 Likes

There wont be overlapping kernel execution on a single card. You can achieve it if you have 2+ cards and a queue for each card. Then different cards can run kernels in an overlapping fashion. (in different devices). Therefore, having multiple queues should not increase performance if a single device is used? I believe it would be easier for the runtime to just run the next kernel in the same queue right away, than trying to schedule which queue to run and when (intuitively) there might be additional delays when switching between queues perhaps?

See:

http://developer.amd.com/tools/hc/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf

11. Is it possible to run multiple AMD APP applications (compute and graphics)

concurrently?

Multiple AMD APP applications can be run concurrently, as long as they do not access the

same GPU at the same time. AMD APP applications that attempt to access the same GPU

at the same time are automatically serialized by the runtime system.

0 Likes

That is talking about running shader kernels and opencl kernels at the same time.

0 Likes

The text says "Multiple AMD APP applications". (while I can see how the title is confusing). If current AMD APP SDK supported concurrent kernel execution, this would probably be listed as a feature. It would be weird if AMD would have forgotten to mention that they implemented such an important feature. Since there is no document (afaik) which says that kernels can be run concurrently how do you come up with the idea that you can run two or more kernels concurrently on a single card? Do you have an example program which can run kernels concurrently? (on a single GPU obviously)

Furthermore, I made the following tests (albeit on Cypress, perhaps I should re-test on Tahiti to be 100% sure). I have a kernel which queues a large problem in 50k sized pieces. I tried to queue the kernel to same queue 0-50k, 50k-100k... and also by using threads and multiple queues q1=0-50k, q2=50k-100k and so on (next piece goes to the queue which finishes first). Both schemes took exactly the same total runtime. As a matter of fact, the individual kernel runtimes were exactly the same. If the kernels ran overlapping, there must have been some differences. (The program was originallly designed to run multi-device and it accomplishes simultaneous runs on multiple-devices, I simply changed to code to run multi-queue on same device)

Total number of worker threads 2

Slave node 0 thread 0 offset 0 length 50000 events 1 time 2.07 seconds

Slave node 0 thread 1 offset 50000 length 50000 events 1 time 1.86 seconds

Slave node 0 thread 0 offset 100000 length 50000 events 1 time 1.72 seconds

Slave node 0 thread 1 offset 150000 length 50000 events 1 time 2.10 seconds

Slave node 0 thread 0 offset 200000 length 50000 events 1 time 1.95 seconds

Slave node 0 thread 1 offset 250000 length 50000 events 1 time 1.87 seconds

Slave node 0 thread 0 offset 300000 length 50000 events 1 time 1.60 seconds

Slave node 0 thread 1 offset 350000 length 50000 events 1 time 1.15 seconds

Slave node 0 thread 0 offset 400000 length 30932 events 1 time 0.97 seconds

Total number of worker threads 1

Slave node 0 thread 0 offset 0 length 50000 events 1 time 2.07 seconds

Slave node 0 thread 0 offset 50000 length 50000 events 1 time 1.86 seconds

Slave node 0 thread 0 offset 100000 length 50000 events 1 time 1.72 seconds

Slave node 0 thread 0 offset 150000 length 50000 events 1 time 2.10 seconds

Slave node 0 thread 0 offset 200000 length 50000 events 1 time 1.95 seconds

Slave node 0 thread 0 offset 250000 length 50000 events 1 time 1.86 seconds

Slave node 0 thread 0 offset 300000 length 50000 events 1 time 1.60 seconds

Slave node 0 thread 0 offset 350000 length 50000 events 1 time 1.15 seconds

Slave node 0 thread 0 offset 400000 length 30932 events 1 time 0.97 seconds

In another test, I tried to run two copies of an app where the kernel run took ~30 seconds. One finished in 31.3 seconds, the next one finished in 62.4 seconds. (as it appears, after I queued the kernel, it did not start executing beffore first programs kernel finished).

0 Likes

I dunno - it was plastered all over the GCN marketing materials as a pretty big plus-point.

From the programming guide, Section 6.11 'optimisation guidelnes for southern island gpus':

Execution of kernel dispatches can overlap if there are no dependencies

between them and if there are resources available in the GPU. This is critical

when writing benchmarks it is important that the measurements are accurate

and that “false dependencies” do not cause unnecessary slowdowns.

...

0 Likes

I will run tests on 7970 and return back. I also found that the guide says

AMD Southern Islands GPUs can execute multiple kernels simultaneously when

there are no dependencies.

However, as far as I understand, this does not mean that you need multiple queues (since it is not mentioned). Afaik this feature was not supported before even though hardware allowed it. Also, if it is supported, there is no documentation of how runtime checks if there are dependencies or not and how can we know why it would or not run kernels simultaneously. It might even not do it if it thinks the GPU resources are busy so it would impede performance of the already running kernel. There are no guarantees afaik. For the same reasons, it might decide to not mix kernels from different queues, since it cant tell which kernel should have precedence or maybe queues does not know each others business that well so it cant tel if there are dependencies or not and choose the safe way...

While there are a lot of assumptions involved, my intuition says that it would be best to queue kernels in a single queue for a single device and let the runtime decide what to do with them.

0 Likes

I was able to get multiple kernels to overlap (on Tahiti) but only under certain conditions.

Like Yurtesen, I tried various 2 que methods that work on multiple GPUs and they all fail.

It seems to work only with one que. (as suggested  above) Here's what seemed to work.

1. Single que.

2. Each kernel must use different output buffers.

3. Input buffers can be the same or different on overlapping kernels.

4. Works only with a single kernel.

With 2 kernels, it only works using the same kernel source file and same kernel name. (yah, same kernel)

I also tried with Catalyst version 12.10 drivers and it does overlap, but my test kernel ran slower on this driver.

Times for 2 kernels with and without overlap:

                            overlap    no-overlap

Old driver (8.98??)    2.44         4.33

12.10 driver              4.23         8.20

drallan

0 Likes

it must be different buffers as otherwise it broke in order execution of kernels. without out of order queue all kernels execution on same queue are implicitly synchronized.

0 Likes

Drallan, now I have opposite results I now tested the same test program which I ran on Cypress, now on Tahiti with 12.10 drivers. I have to say, things seem to be getting overlapped. This program has multi-queue, but exactly same input/output buffers and kernel are used (in this modified version for single-gpu multi-queue). There is 10% performance increase....? Does this mean that the SDK thinks these kernels must be independent because they are on different queues?

I say overlapping, because total even times go over 10.1 seconds, yet wall time is less, so multiple events must have been active at the same time.

Two queues:

------------------

Slave node 0 thread 1 offset 50000 length 50000 events 1 time 2.64 seconds

Slave node 0 thread 0 offset 0 length 50000 events 1 time 1.84 seconds

Slave node 0 thread 1 offset 100000 length 50000 events 1 time 1.27 seconds

Slave node 0 thread 0 offset 150000 length 50000 events 1 time 1.97 seconds

Slave node 0 thread 1 offset 200000 length 50000 events 1 time 2.64 seconds

Slave node 0 thread 0 offset 250000 length 50000 events 1 time 1.76 seconds

Slave node 0 thread 1 offset 300000 length 50000 events 1 time 2.20 seconds

Slave node 0 thread 0 offset 350000 length 50000 events 1 time 1.14 seconds

Slave node 0 thread 1 offset 400000 length 30932 events 1 time 1.30 seconds

WALL time =   10.1 seconds

Single queue:

--------------------

Slave node 0 thread 0 offset 0 length 50000 events 1 time 1.53 seconds

Slave node 0 thread 0 offset 50000 length 50000 events 1 time 1.40 seconds

Slave node 0 thread 0 offset 100000 length 50000 events 1 time 1.27 seconds

Slave node 0 thread 0 offset 150000 length 50000 events 1 time 1.75 seconds

Slave node 0 thread 0 offset 200000 length 50000 events 1 time 1.43 seconds

Slave node 0 thread 0 offset 250000 length 50000 events 1 time 1.40 seconds

Slave node 0 thread 0 offset 300000 length 50000 events 1 time 0.95 seconds

Slave node 0 thread 0 offset 350000 length 50000 events 1 time 0.89 seconds

Slave node 0 thread 0 offset 400000 length 30932 events 1 time 0.54 seconds

WALL time =   11.2 seconds

0 Likes

yurtesen wrote:

Drallan, now I have opposite results I now tested the same test program which I ran on Cypress, now on Tahiti with 12.10 drivers. I have to say, things seem to be getting overlapped. This program has multi-queue, but exactly same input/output buffers and kernel are used (in this modified versionOutput buffers      same  s for single-gpu multi-queue). There is 10% performance increase....? Does this mean that the SDK thinks these kernels must be independent because they are on different queues?

It certainly looks that way, so perhaps it's also driver dependent?   .

I tested a few drivers to see how overlap kernel execution depends on both number of ques and output buffers.

I'm also using a small test kernel that consumes few GPU resources and easily runs in parallel.

Execution times are tabled below with parallel kernel execution in red. The same results show in CodeXL.

Output buffers      same  seperate  same  seperate

No. of Ques         ----one----     ----two----

Cat_12_8             73      37      73      73

Cat_12_9             64      33      62      61

Cat_12_10            65      30      32      31

Cat_12_11            58      58      28      28

Of course, it may depend on other things as well.

drallan, nice table surely this information will come handy at some point

0 Likes

It is strange that AMD seems to think concurrent kernel execution is enabled in 12.10 only...

http://devgurus.amd.com/thread/159926

0 Likes

yurtesen wrote:

It is strange that AMD seems to think concurrent kernel execution is enabled in 12.10 only...

http://devgurus.amd.com/thread/159926

Yurtesen, can you, or anyone, access the  link that you posted above?
When I click on the link I get the following message...... 
   
Access to this place or content is restricted. If you think this is a mistake, please contact your administrator or the person who directed you here.           

Edit: (Answered by nou, please see below! )
0 Likes

it lead to CodeXL section which is private. it is my bug report about that i see running kernels in paralel in two queues on one GPU. and AMD employee commented that GCN from catalyst 12.10 can overlap kernels.

0 Likes

Hi All!

Nice to see the thread is running. I did a series of tests, and the results were disappointing :S I dunno what was wrong, maybe a driver installation bug or don't know... Anyways, here's the test:

Priorities:

- Must not use 100% CPU

- Must be as fast as the dumb 100% CPU polled version.

- Also work on multigpu.

- Overlapped kernel dispatch is not the goal, it's just the tool to achieve good performance while not using noticeable CPU power.

Test kernel:

v_mad_i32_i24 v10, v10, v12, v11

v_add_i32 v10, vcc, 0, v10

unrolled 2200 times and executed in a 25x loop. It's 2*2200*25 instructions per kernel.

Number of threads per kernel run: 4000000

Number of kernel runs: 32

Number of batches: 2 (fastest time chosen)

Preview of the results: http://x.pgy.hu/~worm/het/oclspeedtest/PerfMeasure.png

Here are sources/compiled kernels (check UFrmMain.pas for 'pseudo-code', TFrmMain.tTwinUpdateTimer is the important part): http://x.pgy.hu/~worm/het/oclspeedtest/OclSpeedTest0.zip

Problems while installing drivers: there were some crashes and lockups while screen flickering. I recall that in July I was able to run 2x gpues with slightly better effectiviti that CAL. But now I can't reach it with OpenCL. There must be some voodoo magic I missed this time...

sleep: It's useless (on win7) It needs a working windows message loop (TTimer in Delphi)

The weird thing is that my 'dual queue method' (twin) works well on 11.12's CAL with 1 or 2 gpues, also works with any driver's OCL  1 gpu, but OCL 2gpu is so bad.

The next step for me would be to make a precise log and investigate what is exactly happening on OpenCL 2xGPU.

Update: Here are my results:

http://x.pgy.hu/~worm/het/oclspeedtest/ocltest_twin_timing.xlsx

twin[0] is the first ctx of the first gpu, twin[1] is the second ctx of the first gpu, twin[2] is ctx#1 of gpu#2...

'running' is is a func that polls the kernel completion.

On the times you can see that when there is more that one kernel running on a context the opencl api functions (enqueue and getcolmpletion) becan BLOCKING OPERATIONS (it waits the kernel inside the OCL Api). That's totally bad and that's why multigpu performance is so slow compared to single gpu (both are blocking each other).

While the exactly SAME method works fine on 11.12/CAL, I remember somehow I did it in the past on OCL as well, but I forgot the trick then...

0 Likes

I forgot to ask, were you using CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when overlapping using single queue? ( I think I might have forgotten to set that...hmm...)

0 Likes

No,

I checked, the ques are created with the cl_command_queue_properties = 0.

I assume (with 50% probability of being correct) that overlap and out of order are unrelated.

0 Likes

Hello yurtesen and drallan!

Yes, the  first thing I've found was the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE option.

And if I recall I was able to made it good with cat12-4. But If you ask me what is the best way to get around 100% alu performance from MULTIPLE HD7970 with 0% CPU usage, then I'd say use the old 11-12 driver (for win7 64) and CAL (linux: 32bit 12-2). On 7xxx I have to do voodoo magic (Unfortunately I dunno how I did it in the past ) otherwise clEnququeKenlel() became a blocking operation until It can launch the task I've sent him. Although on 4850, 5970 and 6990 this paralell dispatch thing works perfectly with 0% cpu.

0 Likes

realhet, I wonder why there is an example in OpenCL manuals where at page 1-20:

http://developer.amd.com.php53-23.ord1-1.websitetestlink.com/wordpress/media/2012/10/AMD_Accelerated...

Why does it say

6. The kernel is launched. While it is necessary to specify the global work size,

OpenCL determines a good local work size for this device. Since the kernel

was launch asynchronously, clFinish() is used to wait for completion.

While the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE option was not used? Is it a mistake in documentation?

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateCommandQueue.html

I find the khronos and amd documentations to be conflicting in this area(unless if I am understanding something wrong?), is the clFinish() necessary or not? (on the example 1 in PDF?) If they are both correct, does this mean there is a bug AMDs OpenCL implementation?

BTW. for multiple GPUs, isnt it better to have a queue per GPU? This way OpenCL wont have to be bothered by checking if the input/output objects conflict or not.

0 Likes

AFAIK AMD doesn't implement out of order in their implementation. clFinish() or other synchronization is required always.

You need queue per device/GPU. clEnqueueNDRAnge can take longer time at first run. Most likely it must check if there is enough resource to execute kernel and return error. On other hand nVidia is doing this check at execution and then return error from clFinish().

0 Likes

Nou, do you mean AMD implements only out of order? If execution is in order, why do we need clFinish() etc. ?

0 Likes

only in order. and you need clFinish() to wait until all kernel executions are done. clEnqueue*() calls are all asynchronous.

0 Likes