Hi,
Why it is faster when executing a queue to wait for the completion signal and then add next packet to the queue rather than add N packets and only wait for the last packet to signal completion? (The queue does not overflow).
Vladimir.
Solved! Go to Solution.
There are usually options in the system BIOS for the systems power profile. That is the first place I suggest looking.
To answer this question more information is required. Specifically, the format of the AQL packets being submitted and the kernels being executed for each dispatch need to be understood. Also, understanding how the timing for both cases is being measured must be analyzed as well. A test case would be helpful.
Hi,
Here is modified vector_copy.c : vector_copy.c
The output of the test run :
bsp@ubunta:~/HSA-Runtime-AMD/sample$ ./vector_copy
Initializing the hsa runtime succeeded.
Calling hsa_iterate_agents succeeded.
Checking if the GPU device is non-zero succeeded.
Querying the device name succeeded.
The device name is Spectre.
Querying the device maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
Creating the brig module from vector_copy.brig succeeded.
Creating the hsa program succeeded.
Adding the brig module to the program succeeded.
Finding the symbol offset for the kernel succeeded.
Finalizing the program succeeded.
Querying the kernel descriptor address succeeded.
Registering argument memory for input parameter succeeded.
Registering argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Registering the argument buffer succeeded.
!!!!! Elapsed submit->wait->repeat 1154830
Creating a HSA signal succeeded.
Destroying the signal succeeded.
!!!!! Elapsed submit->repeat->wait 1303541
Passed validation.
Destroying the program succeeded.
Destroying the queue succeeded.
Shutting down the runtime succeeded.
bsp@ubunta:~/HSA-Runtime-AMD/sample$
As you can see first approach which also includes creation and destruction of signal is faster which really puzzles me as i was expecting exactly the opposite 😃
I can't access the link you provided for the modified vector_copy.c test. Could you just cut an paste the section regarding AQL packet prep and you timing code? Thanks.
The issue appears to be caused by latency introduced by the APU's
power saving protocols interacting with these specific workloads. The second
loop (dispatch all and then wait) appears to allow the CPU to enter a power
saving state, while the first (because it is interacting with signals), does
not. I ran this simple program in the background during execution:
int main() {
while(1) {;}
}
With this running the numbers from the 1000 loop iteration were:
submit->wait->repeat 0.818 seconds
submit->all->repeat 0.743 seconds
Without the loop app running, if I bump the iteration count
to 10k I get the following numbers:
submit->wait->repeat 7.782 seconds
submit->all->repeat 6.492 seconds
These numbers seem reasonable. Give these two scenarios a try.
Is there a way to disable power saving mode without running infinite loop in the background ?
There are usually options in the system BIOS for the systems power profile. That is the first place I suggest looking.
Well played with power options in bios - no luck so far ;(
Probably it would be nice to get some kind of official guidance on bios/kernel tuning 😃