I'm encountering increased CPU execution times when using Vulkan on an APU.
The issue occurs once a command buffer (empty or not) is submitted. After that, CPU computations seem to take around twice the time.
I modified the Vulkan SDK triangle demo (version 1.0.24) to reproduce the problem, only performing initialization and one submission, then measuring some spinning. (See this Gist)
More complex applications consistently behave the same.
- Carrizo (rev 4) APU
- Windows 7
- Radeon driver package 16.7.3
I could not reproduce similar behavior on other platforms.
Thanks for reporting this issue. Before we proceed, I have a few questions.
Are you able observe the same issue in release builds?
Does this issue reproduce in LunarG's cube.exe? (uses Immediate Presents)
Does the slowdown scale? If you gave it 3 milliseconds of work would the second submission take 6 milliseconds?
Did you have any other graphics applications running in the background while observing this? On my Carrizo setup, I saw the 2nd submission decrease in time, probably due to warmer caches.
Any further info you can offer would be appreciated.
- I observe the issue in both release and debug builds. I first encountered it in managed .NET applications, which had the same behavior.
- I can reproduce it in cube.exe. It persists in any present mode and even while not presenting at all. Also it seems to start slightly delayed (some work items after the submission are not affected).
- The slowdown does not scale and seems to be consistently around x2.
- No other graphics applications running.
- Performance returns to normal after tearing down Vulkan
- The issue does not occur on the same device, using Linux (Ubuntu 16.04) with Vulkan driver 16.50
When i tried to reproduce your test case in Debug on my Carrizo laptop, I saw the second run and all subsequent runs were quicker which seemed correct.
I then tried the Release build, but your code "for (int i = 0; i < 10000000; i++)" was optimized out to constant time, the time printed to the console was 0.00000 which does not match that your findings.
It would be an interesting experiment if you could modify your test case to do more CPU work. Maybe about 3ms worth and see if that scales 2x (as you are seeing) to run at 6ms for the subsequent runs.
Also, can you share an ETL trace of this behavior?