Archives Discussions

joej · ‎12-21-2016

This is more kind of a feedback to improve a future API version. Also i have not enough experience with async compute yet and may be wrong.

So, my usecase is quite different from the typical 'do large compute workloads while rendering shadowmaps'.

I work on realtime GI based on a tree of surface samples.

The algorithm is very complex and often requires one indirect dispatch per tree level and barriers in between, so tiny workloads near the root.

Also it requires many mainteance tasks (e. g. interpolating samples from parents when entering view) resulting again in tiny workloads and mostly zero work dispatches for each tree level.

Similar problems will arise in almost any algorithm with complex work reduction / distribution, or variational workloads (e. g. collision detection for all possible pairs of shape primitives in a physics engine). So we totally need fine grained compute, also for dispatches of only a few wavefronts processing only short programs.

Async compute is just perfect to solve this problem, but it seems synchronization cost and overhead is still to high to do it with full efficiency.

This is what i try to do:

Usually i use a single command buffer containing indirect dispatches and memory barriers for every potential workload.

To go async, i need to divide it into multiple command buffers for multiple queues at each synchronization point (There seems no other way to sync 2 command buffers).

I made my division like this:

Task A (0.1ms - 34 invocations, mostly zero or tiny workloads)

Task B (0.5ms - 16 invocations, starting with tiny, ending with heavy workloads)

Task C (0.5ms - final work, at this point i need results from both A and B)

So i can do A and B simultaneously. My goal is to hide runtime of A behind B, and this totally works.

Option 1 (better):

queue1: process A

queue2: process B, wait on A, process C

Option 2:

queue1: process A

queue2: process B

queue3: wait on A and B, process C

The problem is, i end up with runtime of 1.05 ms, not the expected 1.00 ms.

This is disappointing because if i remove task C, A+B needs only the time of B (0.5ms).

The problem presists if i remove the semaphores, so it seems it's more about enqueing multiple command buffers (additional CPU <-> GPU interaction?).

But i can't be sure of anything - if we talk about 0.05ms even using timestamps for profiling has a larger effect on performance like that (for some details see Confusing performance with async compute - Graphics Programming and Theory - GameDev.net )

However, if you think this makes sense and indicates a API limitation,

maybe it would work to extend synchronization between queues (something like DX12 split barriers or VK events),

or enable async compute for a single queue with user defined dependencies, barriers, etc. to avoid the need to divide any command buffer.

Maybe it's possible for an improvement only on the driver side.

Also let me know if you have an idea of something else i could try.

I'll continue with this when i'm finished with my whole algorithm and have more options for async combinations...

dwitczak · ‎01-10-2017

We've been discussing this internally. Would you be able to send us a repro we could use for internal investigation?

joej · ‎01-10-2017

Thanks, i'll put something together the next days...

joej · ‎01-13-2017

I've sent the project to you by email.

Meanwhile i've fixed inefficient usage of timestamps so they don't affect performance anymore and it's easy to visualize the gap between command buffer execution.

In addition one more problem shows up: Decreasing performance when increasing wavefront count, see included readme and application.

joej · ‎01-13-2017

GitHub - JoeJGit/OpenCL_Fiji_Bug_Report: Expose a 32bit driver bug

I've added the file async_test_project.rar to this repo in case you still did not receive the emails.

Ignore the OpenCL bug report, has been fixed already, i just don't know how to use github properly.

joej · ‎07-14-2017

Somebody has enlightened me that AMD already uses async compute within a single queue (Question concerning internal queue organisation - Graphics and GPU Programming - GameDev.net )

So the situation seems perfect and there is nothing to improve, neither on API nor on drivers.

Sorry for wasting yout time!

joej · ‎07-15-2017

One day later i'm less optimistic again.

Async compute within a single queue works great as long as there are not too much pipeline barriers.

With barriers the whole queue stalls, and we can only use a second queue to keep busy.

But then the mentioned problems pop up again: Saturated queues slow each other down, using semaphores too avoid this is too slow.

Sorry again but you may still want to look at this when you have time.

You can contact me to ask for a less buggy repo - validation is broken, there are rare freezes but i don't know yet why.

dwitczak · ‎08-16-2017

Apologies for the late reply. Can you confirm that the latest version of the validation layers, as built from the latest version available @ corresponding GitHub repository, do not report any errors?

Also, which GPU are you specifically using? What OS are you running? Which driver version do you have installed?

Thanks.

joej · ‎08-17-2017

I'll update and see for validation errors, but back then there where none. However i know the validation code within this old archive does not work anymore, i'll fix this...

Unrelated: Actually there seems a serious driver bug introduced in one of the last 1-3 driver versions. It's very difficult to track down (system hangs all the time), but i have at least a case of a compute shader generating wrong results and should be able to reproduce a test case from there the next days. Let's focus on this first...

FuryX, Win10, using older driver 17.5.1 to avoid the bug i noticed in the most recent driver.

dwitczak · ‎08-17-2017

If you could send me the compute shader in GLSL version + a SPIR-V blob you pass via the vkCreateComputePipelines(), I could look into it right away.

joej · ‎08-17-2017

It's not that easy, you also need the data. I just found out the shader works correctly thousands of times and only after that the bug appears. This increases the chance it's all my fault - maybe the (working) OpenCL version compiles math insturctions in different order and avoids something like a rare indexing bug of mine. I'll let you know...

joej · ‎08-17-2017

Ok, got it. It's probably the same bug that happened with OpenCL but has been fixed, see there: OpenCL Driver Bug FuryX 32bit

I created a gist with some code: https://gist.github.com/JoeJGit/965136069803a26a7c24b04639d54341

The code is just a set of prefix sums depending on workgroup size (WG_WIDTH) and it failed with sizes 64 and 256. (lID = get_local_id(0))

It's a bit cryptic so let me know if you want a simplified working glsl, i can create one tomorrow, time for bed now

dwitczak · ‎08-18-2017

The code snippet is useful *but* I'm still going to need to ask you for a SPIR-V blob you're using. The crashes you're seeing could be triggered by external factors as well (eg. invalid GLSL->SPIR-V translation) and we need to rule them out as well.

If you can't share it publically, please feel free to drop me an e-mail.

joej · ‎08-18-2017

Hi Dominik,

included glsl and spv genarated from code below with SDK 1.0.51.

Notice removing 2 last lines generates another example of wrong results (see #if), there should be also a way to provoke the error with a threadgroup size of 64.

I'll update the AsyncCompute testcase soon (actually it's just confusing because i was not aware of automatic async happening wighin a single queue so don't look at it).

I have also the issue of CreateDevice crashing if i enable all device extensions with recent driver - i'll cover this too with that update...

Best regards, Joe

void main ()

{

uint lID = gl_LocalInvocationID.x;

uint index = lID;

_lds[lID] = 1;

memoryBarrierShared(); barrier();

#if 1 // wrong result: (1...128), (1...128) instead (1...256)

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 0) << 1) | (lID & 0) | 1) ] += _lds[(((lID >> 0) << 1) | 0) ]; memoryBarrierShared(); barrier();

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 1) << 2) | (lID & 1) | 2) ] += _lds[(((lID >> 1) << 2) | 1) ]; memoryBarrierShared(); barrier();

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 2) << 3) | (lID & 3) | 4) ] += _lds[(((lID >> 2) << 3) | 3) ]; memoryBarrierShared(); barrier();

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 3) << 4) | (lID & 7) | 😎 ] += _lds[(((lID >> 3) << 4) | 7) ]; memoryBarrierShared(); barrier();

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 4) << 5) | (lID & 15) | 16) ] += _lds[(((lID >> 4) << 5) | 15) ]; memoryBarrierShared(); barrier();

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 5) << 6) | (lID & 31) | 32) ] += _lds[(((lID >> 5) << 6) | 31) ]; memoryBarrierShared(); barrier();

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 6) << 7) | (lID & 63) | 64) ] += _lds[(((lID >> 6) << 7) | 63) ]; memoryBarrierShared(); barrier();

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 7) << 😎 | (lID & 127) | 128) ] += _lds[(((lID >> 7) << 😎 | 127) ]; memoryBarrierShared(); barrier();

#else // wrong result: (1...64), (1...64), (1...32), (1...32), (1...32), (1...32)

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 0) << 1) | (lID & 0) | 1) ] += _lds[(((lID >> 0) << 1) | 0) ]; memoryBarrierShared(); barrier();

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 1) << 2) | (lID & 1) | 2) ] += _lds[(((lID >> 1) << 2) | 1) ]; memoryBarrierShared(); barrier();

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 2) << 3) | (lID & 3) | 4) ] += _lds[(((lID >> 2) << 3) | 3) ]; memoryBarrierShared(); barrier();

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 3) << 4) | (lID & 7) | 😎 ] += _lds[(((lID >> 3) << 4) | 7) ]; memoryBarrierShared(); barrier();

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 4) << 5) | (lID & 15) | 16) ] += _lds[(((lID >> 4) << 5) | 15) ]; memoryBarrierShared(); barrier();

if (lID<(WG_WIDTH>>1)) _lds[(((lID >> 5) << 6) | (lID & 31) | 32) ] += _lds[(((lID >> 5) << 6) | 31) ]; memoryBarrierShared(); barrier();

#endif

Gtest[index] = float(_lds[lID]);

}

dwitczak · ‎08-18-2017

Let's not mix threads. I'd appreciate if you could open a separate thread for the other issue. Thanks!

I've forwarded the zip file to our compiler folks to have a look. I should be getting back to you on Monday the latest. Thanks.

dwitczak · ‎08-18-2017

We reproduced the issue internally which is good news! A different team will need to have a look at this. Will keep you posted.

joej · ‎08-23-2017

Back on topic, i have updated the async test demo and created a proper (hopefully) repo here: GitHub - JoeJGit/Vulkan-Async-Compute-Test: May help to demystify Async Compute

I have added more options to make it a practical testcase and things look pretty good (i'm still some months away from putting things learned into practice).

My initial criticism presists but it has much less impact. There is some room for improvement, use the buttons in 'Example settings window' to run some good / bad cases.

Most important: There seems a bug of atomics to global memory not working correctly from multiple queues.

Excerpt from included read me file:

To utilize async compute it is necessary to use many Command Buffers and Semaphores causing gaps between executing shaders.

It would be nice to have another synchronization primitive similar to Events that can work across queues so we need only one CB per queue.

Additionally it would also be possible to build a dependency graph from CB dispatches and memory barriers to keep processing unaffected dispatches while executing a barrier within a single queue. (Probably against the specs but you could make an extension.)

And of course - maybe you can just shorten the gaps

Another related feature i'd like to see in the future:

The possibility to skip over commands (e.g. by inserting labels and skipif() command).

I have a lot of indirect zero work dispatches followed by memory barriers every frame and it would be great to skip over them.

(You could do this just automatically again using a depency graph.)

GPU generated Command Buffers could do this as well. I don't know what's possible with current hardware, but NVidias extensions can not insert barriers, so it's not very useful.

A big source of confusion and disappointment to me was the fact that graphics queue is faster than the 3 compute/transfer queues, simply because i did not know.

I accidently compared nonasync graphics queue gainst async compute/transfer queues and got bad async results.

Now, knowing this, things look pretty good but it took me a long time and luck to get there.

I also did not know async happens automatically within a single queue as long as no barriers get in the way.

Also, people assume things like queues map directly to ACEs and do other bad guesswork.

My next guess is that compute/transfer queues have access only to a subset of all CUs. It would be great if you enlighten us.

I think you should do a blog post to clarify things like that. (Although fine grained compute is not yet that common in games, but there should be interest.)

(Click the buttons in 'Example Settings' window to see what questions may arise to the developer)

There are two more issues you should look at (the main issue is a bug with atomics - click 'Atomic Bug' button and look at 'Executed Wavefront count' to see):

1. VkHelper.cpp, Line 92, change last bool to true: I did not look which extension causes the failure.

2. AsyncComputeTest_VK.h, Line 1059, Minor issue: I think there should be no validation warning if it is guaranteed that if a queue waits on a Semaphore, other queues start working.

Archives Discussions

Vulkan API flaw prevents efficient async compute usage