Archives Discussions

realhet · ‎10-25-2012

Hi,

I've just downloaded the new driver and noticed, that my program is became 3.7% slower than before. I have figured out that the problem is that the new driver wont allow to execute 2 kernels overlapped.

Here is what worked well before cat 12.10:

start kernel1

at 90% of the exec time of kernel1 -> start kernel2

at 90% of the exec time of kernel2 -> start kernel3

and so on.

(kernel time is around 0.4 seconds)

And the way it was worked on opencl: -> make 2 contexts and run two of the overlapped kernels on each different contexts. Check the completion in a 20msec timer and launch new kernel when needed.

Now this is not working with Catalyst 12.10. It's 3.7% slower now because CUes are sleeping between kernels :S. You know, it's like turning off one of the 32 CUes in a HD7970, just because those bad kernel-to-kernel transitions.

So If anyone knows, please tell what is the proper way to keep the CUes filled with work ALL THE TIME?

Thank you for answers.

binying · ‎10-25-2012

Would u mind posting ur code?

realhet · ‎10-25-2012

Sry, I can't post exact codes. If I do so, my client have to kill me . Though, I should extract the thing and make a demo project out of it...

But the syndrome is so simple: If there's already a kernel running on the card, the api will block you when you start another one. So there will be a gap between the 2 kernels. The problem is that it worked well with previous drivers and now it seems like broken or changed somehow.

BTW It seems like the same behaviour as the CAL api suffered since win_cat_12.2.

yurtesen · ‎10-25-2012

Why dont you just enqueue your kernels and let the runtime to decide what to do with them? I think any sort of reliable synchronization between contexts is not supported by OpenCL in general?

I am not sure if it was ever possible to run 2 kernels at the same time on the device anyway.

http://developer.amd.com/tools/hc/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf

11. Is it possible to run multiple AMD APP applications (compute and graphics)
concurrently?
Multiple AMD APP applications can be run concurrently, as long as they do not access the
same GPU at the same time. AMD APP applications that attempt to access the same GPU
at the same time are automatically serialized by the runtime system.

Also see: http://devgurus.amd.com/message/1283414#1283414

I think you probably would get better performance if you can queue your kernels to same command queue one after each other without waiting using events within same context.

In general it will be more time consuming to try to wait for one event to finish and then queue an event after that. Also since you cant really check if the execution will finish after 20msec or not, I dont quite understand how you can queue another kernel in an overlapping fashion the way you described. Perhaps if you could at least provide some code fragments, it could be useful.

realhet · ‎10-25-2012

Hi,

A few months ago when I ported my stuff from cal to ocl, the first thing I tried was the one context method. There was that gap between the kernels (was easy to notice because the 1-2% slowdown while used exactly the same ISA microcode as on the cal version). Then I've found that CL_OUT_OF_ORDER_EXECUTION flag -> also a fail. And finally I tried to make two contexts and the thing worked excellently up until cat12.10. Btw that two context method was faster that cal when I used more than one gpu devices.

"In general it will be more time consuming to try to wait for one event to finish and then queue an event after that."

Sure it is! That's why I enqueue a new kernel when the actual one is about to finish. I know that a kernel finishes in 0.3 sec, so I can enqueue the next one at 0.2 sec after kernel1 was launched.

yurtesen · ‎10-25-2012

My colleague made a kernel which he enqueued thousands of times and average runtime was under 40 microseconds. (it is slighty strange since amd manuals say average latency of kernel execution is 70 microseconds on a decent GPU). Although I had a problem where my programs runtime went up when I divided work into several enqueues and used offsets. Although that was rectified slightly by modifying the kernel code a bit (later we agreed it must have been something to do with cache behavior). So, I have mixed experience with this myself. But the fastest option should be enqueuing all the kernels to command queue using single context and let the runtime to decide when to run what.

Since I had some mixed results myself in the past, all I can say is that your approach sounds unintuitive to me. For example the problem could be caused by some improvement in the driver which protects data within a context not mixing up with another context. (It would be bad securitywise if two programs which had separate contexts could read each others leftover data right?). Therefore at least in my opinion I cant think of a reason why I would blame the driver for this.

Did you try to profile your program to see what it is doing exactly? You never mentioned how you figured out your kernels were running overlapped at first place? (which should have been impossible as far as I know...). Maybe your original speedup was due to some other reason?

nou · ‎10-26-2012

70 microseconds is latency. if you send 10 times 40 microsecond kernel then total exectuion time should be 70+10*40. driver can batch more kernel execution into one command so it get executed faster.

yurtesen · ‎10-26-2012

nou wrote:
70 microseconds is latency. if you send 10 times 40 microsecond kernel then total exectuion time should be 70+10*40. driver can batch more kernel execution into one command so it get executed faster.

Yes, we thought about the throughput being different but I would think there must be at least some small additional latency when a kernel ends and re-starts. I think it was surprisingly fast still. One thing is that we ran exactly same kernel on exactly same data(for testing purposes), if some kernel parameters were to be different, perhaps it could have some overhead.

Anyway, this doesnt help realhet that much I guess just that kernels are able to run after each other without significant delay.

realhet · ‎10-25-2012

Hi,

A few months ago when I ported my stuff from cal to ocl, the first thing I tried was the one context method. There was that gap between the kernels (was easy to notice because the 1-2% slowdown while used exactly the same ISA microcode as on the cal version). Then I've found that CL_OUT_OF_ORDER_EXECUTION flag -> also a fail. And finally I tried to make two contexts and the thing worked excellently up until cat12.10. Btw that two context method was faster that cal when I used more than one gpu devices.

"In general it will be more time consuming to try to wait for one event to finish and then queue an event after that."

Sure it is! That's why I enqueue a new kernel when the actual one is about to finish. I know that a kernel finishes in 0.3 sec, so I can enqueue the next one at 0.2 sec after kernel1 was launched.

notzed · ‎10-29-2012

Why are you polling and sleeping? Any mismatch of the cycle time (which will be hard to predict) will cause idle 'stalls'.

Just waiting for a kernel to finish and starting one straight away is quite a loss - but obviously not something you can always avoid - but that is the best case scenario possible.

Have you tried putting each kernel on a separate thread with a separate queue? And rather than a sleep/poll loop, just doing a blocking CL operation for the synchronisation? I presume from your description they are independent.

yurtesen · ‎10-30-2012

notzed, why do you think using different queues will give better performance than queuing the kernels in the same queue?

notzed · ‎10-30-2012

it's more to do with removing the sleep in the poll loop.

i thought the OP was already using different queues because there's no other way to get overlapping kernel execution.

yurtesen · ‎10-30-2012

There wont be overlapping kernel execution on a single card. You can achieve it if you have 2+ cards and a queue for each card. Then different cards can run kernels in an overlapping fashion. (in different devices). Therefore, having multiple queues should not increase performance if a single device is used? I believe it would be easier for the runtime to just run the next kernel in the same queue right away, than trying to schedule which queue to run and when (intuitively) there might be additional delays when switching between queues perhaps?

See:

http://developer.amd.com/tools/hc/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf

11. Is it possible to run multiple AMD APP applications (compute and graphics)
concurrently?
Multiple AMD APP applications can be run concurrently, as long as they do not access the
same GPU at the same time. AMD APP applications that attempt to access the same GPU
at the same time are automatically serialized by the runtime system.

notzed · ‎10-30-2012

That is talking about running shader kernels and opencl kernels at the same time.

yurtesen · ‎10-30-2012

The text says "Multiple AMD APP applications". (while I can see how the title is confusing). If current AMD APP SDK supported concurrent kernel execution, this would probably be listed as a feature. It would be weird if AMD would have forgotten to mention that they implemented such an important feature. Since there is no document (afaik) which says that kernels can be run concurrently how do you come up with the idea that you can run two or more kernels concurrently on a single card? Do you have an example program which can run kernels concurrently? (on a single GPU obviously)

Furthermore, I made the following tests (albeit on Cypress, perhaps I should re-test on Tahiti to be 100% sure). I have a kernel which queues a large problem in 50k sized pieces. I tried to queue the kernel to same queue 0-50k, 50k-100k... and also by using threads and multiple queues q1=0-50k, q2=50k-100k and so on (next piece goes to the queue which finishes first). Both schemes took exactly the same total runtime. As a matter of fact, the individual kernel runtimes were exactly the same. If the kernels ran overlapping, there must have been some differences. (The program was originallly designed to run multi-device and it accomplishes simultaneous runs on multiple-devices, I simply changed to code to run multi-queue on same device)

Total number of worker threads 2
Slave node 0 thread 0 offset 0 length 50000 events 1 time 2.07 seconds
Slave node 0 thread 1 offset 50000 length 50000 events 1 time 1.86 seconds
Slave node 0 thread 0 offset 100000 length 50000 events 1 time 1.72 seconds
Slave node 0 thread 1 offset 150000 length 50000 events 1 time 2.10 seconds
Slave node 0 thread 0 offset 200000 length 50000 events 1 time 1.95 seconds
Slave node 0 thread 1 offset 250000 length 50000 events 1 time 1.87 seconds
Slave node 0 thread 0 offset 300000 length 50000 events 1 time 1.60 seconds
Slave node 0 thread 1 offset 350000 length 50000 events 1 time 1.15 seconds
Slave node 0 thread 0 offset 400000 length 30932 events 1 time 0.97 seconds
Total number of worker threads 1
Slave node 0 thread 0 offset 0 length 50000 events 1 time 2.07 seconds
Slave node 0 thread 0 offset 50000 length 50000 events 1 time 1.86 seconds
Slave node 0 thread 0 offset 100000 length 50000 events 1 time 1.72 seconds
Slave node 0 thread 0 offset 150000 length 50000 events 1 time 2.10 seconds
Slave node 0 thread 0 offset 200000 length 50000 events 1 time 1.95 seconds
Slave node 0 thread 0 offset 250000 length 50000 events 1 time 1.86 seconds
Slave node 0 thread 0 offset 300000 length 50000 events 1 time 1.60 seconds
Slave node 0 thread 0 offset 350000 length 50000 events 1 time 1.15 seconds
Slave node 0 thread 0 offset 400000 length 30932 events 1 time 0.97 seconds

In another test, I tried to run two copies of an app where the kernel run took ~30 seconds. One finished in 31.3 seconds, the next one finished in 62.4 seconds. (as it appears, after I queued the kernel, it did not start executing beffore first programs kernel finished).

notzed · ‎10-30-2012

I dunno - it was plastered all over the GCN marketing materials as a pretty big plus-point.

From the programming guide, Section 6.11 'optimisation guidelnes for southern island gpus':

Execution of kernel dispatches can overlap if there are no dependencies
between them and if there are resources available in the GPU. This is critical
when writing benchmarks it is important that the measurements are accurate
and that “false dependencies” do not cause unnecessary slowdowns.
...

yurtesen · ‎10-31-2012

I will run tests on 7970 and return back. I also found that the guide says

AMD Southern Islands GPUs can execute multiple kernels simultaneously when
there are no dependencies.

However, as far as I understand, this does not mean that you need multiple queues (since it is not mentioned). Afaik this feature was not supported before even though hardware allowed it. Also, if it is supported, there is no documentation of how runtime checks if there are dependencies or not and how can we know why it would or not run kernels simultaneously. It might even not do it if it thinks the GPU resources are busy so it would impede performance of the already running kernel. There are no guarantees afaik. For the same reasons, it might decide to not mix kernels from different queues, since it cant tell which kernel should have precedence or maybe queues does not know each others business that well so it cant tel if there are dependencies or not and choose the safe way...

While there are a lot of assumptions involved, my intuition says that it would be best to queue kernels in a single queue for a single device and let the runtime decide what to do with them.

drallan · ‎10-31-2012

I was able to get multiple kernels to overlap (on Tahiti) but only under certain conditions.

Like Yurtesen, I tried various 2 que methods that work on multiple GPUs and they all fail.

It seems to work only with one que. (as suggested above) Here's what seemed to work.

1. Single que.

2. Each kernel must use different output buffers.

3. Input buffers can be the same or different on overlapping kernels.

4. Works only with a single kernel.

With 2 kernels, it only works using the same kernel source file and same kernel name. (yah, same kernel)

I also tried with Catalyst version 12.10 drivers and it does overlap, but my test kernel ran slower on this driver.

Times for 2 kernels with and without overlap:

overlap no-overlap

Old driver (8.98??) 2.44 4.33

12.10 driver 4.23 8.20

drallan

nou · ‎10-31-2012

it must be different buffers as otherwise it broke in order execution of kernels. without out of order queue all kernels execution on same queue are implicitly synchronized.

yurtesen · ‎10-31-2012

Drallan, now I have opposite results I now tested the same test program which I ran on Cypress, now on Tahiti with 12.10 drivers. I have to say, things seem to be getting overlapped. This program has multi-queue, but exactly same input/output buffers and kernel are used (in this modified version for single-gpu multi-queue). There is 10% performance increase....? Does this mean that the SDK thinks these kernels must be independent because they are on different queues?

I say overlapping, because total even times go over 10.1 seconds, yet wall time is less, so multiple events must have been active at the same time.

Two queues:
------------------
Slave node 0 thread 1 offset 50000 length 50000 events 1 time 2.64 seconds
Slave node 0 thread 0 offset 0 length 50000 events 1 time 1.84 seconds
Slave node 0 thread 1 offset 100000 length 50000 events 1 time 1.27 seconds
Slave node 0 thread 0 offset 150000 length 50000 events 1 time 1.97 seconds
Slave node 0 thread 1 offset 200000 length 50000 events 1 time 2.64 seconds
Slave node 0 thread 0 offset 250000 length 50000 events 1 time 1.76 seconds
Slave node 0 thread 1 offset 300000 length 50000 events 1 time 2.20 seconds
Slave node 0 thread 0 offset 350000 length 50000 events 1 time 1.14 seconds
Slave node 0 thread 1 offset 400000 length 30932 events 1 time 1.30 seconds
WALL time = 10.1 seconds
Single queue:
--------------------
Slave node 0 thread 0 offset 0 length 50000 events 1 time 1.53 seconds
Slave node 0 thread 0 offset 50000 length 50000 events 1 time 1.40 seconds
Slave node 0 thread 0 offset 100000 length 50000 events 1 time 1.27 seconds
Slave node 0 thread 0 offset 150000 length 50000 events 1 time 1.75 seconds
Slave node 0 thread 0 offset 200000 length 50000 events 1 time 1.43 seconds
Slave node 0 thread 0 offset 250000 length 50000 events 1 time 1.40 seconds
Slave node 0 thread 0 offset 300000 length 50000 events 1 time 0.95 seconds
Slave node 0 thread 0 offset 350000 length 50000 events 1 time 0.89 seconds
Slave node 0 thread 0 offset 400000 length 30932 events 1 time 0.54 seconds
WALL time = 11.2 seconds

drallan · ‎11-01-2012

yurtesen wrote:
Drallan, now I have opposite results I now tested the same test program which I ran on Cypress, now on Tahiti with 12.10 drivers. I have to say, things seem to be getting overlapped. This program has multi-queue, but exactly same input/output buffers and kernel are used (in this modified versionOutput buffers same s for single-gpu multi-queue). There is 10% performance increase....? Does this mean that the SDK thinks these kernels must be independent because they are on different queues?

It certainly looks that way, so perhaps it's also driver dependent? .

I tested a few drivers to see how overlap kernel execution depends on both number of ques and output buffers.

I'm also using a small test kernel that consumes few GPU resources and easily runs in parallel.

Execution times are tabled below with parallel kernel execution in red. The same results show in CodeXL.

Output buffers same seperate same seperate

No. of Ques ----one---- ----two----

Cat_12_8 73 37 73 73

Cat_12_9 64 33 62 61

Cat_12_10 65 30 32 31

Cat_12_11 58 58 28 28

Of course, it may depend on other things as well.

yurtesen · ‎11-02-2012

drallan, nice table surely this information will come handy at some point

yurtesen · ‎11-02-2012

It is strange that AMD seems to think concurrent kernel execution is enabled in 12.10 only...

http://devgurus.amd.com/thread/159926

drallan · ‎11-02-2012

yurtesen wrote:
It is strange that AMD seems to think concurrent kernel execution is enabled in 12.10 only...
http://devgurus.amd.com/thread/159926

Yurtesen, can you, or anyone, access the link that you posted above?

When I click on the link I get the following message......

Access to this place or content is restricted. If you think this is a mistake, please contact your administrator or the person who directed you here.

Edit: (Answered by nou, please see below! )

nou · ‎11-03-2012

it lead to CodeXL section which is private. it is my bug report about that i see running kernels in paralel in two queues on one GPU. and AMD employee commented that GCN from catalyst 12.10 can overlap kernels.

realhet · ‎11-04-2012

Hi All!

Nice to see the thread is running. I did a series of tests, and the results were disappointing :S I dunno what was wrong, maybe a driver installation bug or don't know... Anyways, here's the test:

Priorities:

- Must not use 100% CPU

- Must be as fast as the dumb 100% CPU polled version.

- Also work on multigpu.

- Overlapped kernel dispatch is not the goal, it's just the tool to achieve good performance while not using noticeable CPU power.

Test kernel:

v_mad_i32_i24 v10, v10, v12, v11

v_add_i32 v10, vcc, 0, v10

unrolled 2200 times and executed in a 25x loop. It's 2*2200*25 instructions per kernel.

Number of threads per kernel run: 4000000

Number of kernel runs: 32

Number of batches: 2 (fastest time chosen)

Preview of the results: http://x.pgy.hu/~worm/het/oclspeedtest/PerfMeasure.png

Here are sources/compiled kernels (check UFrmMain.pas for 'pseudo-code', TFrmMain.tTwinUpdateTimer is the important part): http://x.pgy.hu/~worm/het/oclspeedtest/OclSpeedTest0.zip

Problems while installing drivers: there were some crashes and lockups while screen flickering. I recall that in July I was able to run 2x gpues with slightly better effectiviti that CAL. But now I can't reach it with OpenCL. There must be some voodoo magic I missed this time...

sleep: It's useless (on win7) It needs a working windows message loop (TTimer in Delphi)

The weird thing is that my 'dual queue method' (twin) works well on 11.12's CAL with 1 or 2 gpues, also works with any driver's OCL 1 gpu, but OCL 2gpu is so bad.

The next step for me would be to make a precise log and investigate what is exactly happening on OpenCL 2xGPU.

Update: Here are my results:

http://x.pgy.hu/~worm/het/oclspeedtest/ocltest_twin_timing.xlsx

twin[0] is the first ctx of the first gpu, twin[1] is the second ctx of the first gpu, twin[2] is ctx#1 of gpu#2...

'running' is is a func that polls the kernel completion.

On the times you can see that when there is more that one kernel running on a context the opencl api functions (enqueue and getcolmpletion) becan BLOCKING OPERATIONS (it waits the kernel inside the OCL Api). That's totally bad and that's why multigpu performance is so slow compared to single gpu (both are blocking each other).

While the exactly SAME method works fine on 11.12/CAL, I remember somehow I did it in the past on OCL as well, but I forgot the trick then...

yurtesen · ‎11-17-2012

I forgot to ask, were you using CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when overlapping using single queue? ( I think I might have forgotten to set that...hmm...)

drallan · ‎11-17-2012

No,

I checked, the ques are created with the cl_command_queue_properties = 0.

I assume (with 50% probability of being correct) that overlap and out of order are unrelated.

realhet · ‎11-18-2012

Hello yurtesen and drallan!

Yes, the first thing I've found was the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE option.

And if I recall I was able to made it good with cat12-4. But If you ask me what is the best way to get around 100% alu performance from MULTIPLE HD7970 with 0% CPU usage, then I'd say use the old 11-12 driver (for win7 64) and CAL (linux: 32bit 12-2). On 7xxx I have to do voodoo magic (Unfortunately I dunno how I did it in the past ) otherwise clEnququeKenlel() became a blocking operation until It can launch the task I've sent him. Although on 4850, 5970 and 6990 this paralell dispatch thing works perfectly with 0% cpu.

yurtesen · ‎11-18-2012

realhet, I wonder why there is an example in OpenCL manuals where at page 1-20:

http://developer.amd.com.php53-23.ord1-1.websitetestlink.com/wordpress/media/2012/10/AMD_Accelerated...

Why does it say

6. The kernel is launched. While it is necessary to specify the global work size,
OpenCL determines a good local work size for this device. Since the kernel
was launch asynchronously, clFinish() is used to wait for completion.

While the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE option was not used? Is it a mistake in documentation?

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateCommandQueue.html

I find the khronos and amd documentations to be conflicting in this area(unless if I am understanding something wrong?), is the clFinish() necessary or not? (on the example 1 in PDF?) If they are both correct, does this mean there is a bug AMDs OpenCL implementation?

BTW. for multiple GPUs, isnt it better to have a queue per GPU? This way OpenCL wont have to be bothered by checking if the input/output objects conflict or not.

nou · ‎11-18-2012

AFAIK AMD doesn't implement out of order in their implementation. clFinish() or other synchronization is required always.

You need queue per device/GPU. clEnqueueNDRAnge can take longer time at first run. Most likely it must check if there is enough resource to execute kernel and return error. On other hand nVidia is doing this check at execution and then return error from clFinish().

yurtesen · ‎11-18-2012

Nou, do you mean AMD implements only out of order? If execution is in order, why do we need clFinish() etc. ?

nou · ‎11-19-2012

only in order. and you need clFinish() to wait until all kernel executions are done. clEnqueue*() calls are all asynchronous.

Archives Discussions

Recommended way to overlap Ocl kernels on HD7970 Catalyst 12.10