cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

vladimir_1
Adept II

HSA queue

Hi,

Why it is faster when executing a queue to wait for the completion signal and then add next packet to the queue rather than add N packets and only wait for the last packet to signal completion? (The queue does not overflow).

Vladimir.

0 Likes
1 Solution

There are usually options in the system BIOS for the systems power profile. That is the first place I suggest looking.

View solution in original post

0 Likes
8 Replies
jedwards
Staff

To answer this question more information is required. Specifically, the format of the AQL packets being submitted and the kernels being executed for each dispatch need to be understood. Also, understanding how the timing for both cases is being measured must be analyzed as well. A test case would be helpful.

0 Likes

Hi,

Here is modified vector_copy.c : vector_copy.c

The output of the test run :

bsp@ubunta:~/HSA-Runtime-AMD/sample$ ./vector_copy

Initializing the hsa runtime succeeded.

Calling hsa_iterate_agents succeeded.

Checking if the GPU device is non-zero succeeded.

Querying the device name succeeded.

The device name is Spectre.

Querying the device maximum queue size succeeded.

The maximum queue size is 131072.

Creating the queue succeeded.

Creating the brig module from vector_copy.brig succeeded.

Creating the hsa program succeeded.

Adding the brig module to the program succeeded.

Finding the symbol offset for the kernel succeeded.

Finalizing the program succeeded.

Querying the kernel descriptor address succeeded.

Registering argument memory for input parameter succeeded.

Registering argument memory for output parameter succeeded.

Finding a kernarg memory region succeeded.

Allocating kernel argument memory buffer succeeded.

Registering the argument buffer succeeded.

!!!!! Elapsed submit->wait->repeat 1154830

Creating a HSA signal succeeded.

Destroying the signal succeeded.

!!!!! Elapsed submit->repeat->wait 1303541

Passed validation.

Destroying the program succeeded.

Destroying the queue succeeded.

Shutting down the runtime succeeded.

bsp@ubunta:~/HSA-Runtime-AMD/sample$

As you can see first approach which also includes creation and destruction of signal is faster which really puzzles me as i was expecting exactly the opposite 😃

0 Likes

I can't access the link you provided for the modified vector_copy.c test. Could you just cut an paste the section regarding AQL packet prep and you timing code? Thanks.

0 Likes

  1. void submit_packet( hsa_ext_code_descriptor_t *hsaCodeDescriptor, hsa_queue_t* commandQueue, void* kernel_arg_buffer, hsa_signal_t signal)
  2. {
  3.     hsa_status_t err;
  4.     /*
  5.      * Initialize the dispatch packet.
  6.      */
  7.     hsa_dispatch_packet_t aql;
  8.     memset(&aql, 0, sizeof(aql));
  9.     /*
  10.      * Setup the dispatch information.
  11.      */
  12.     aql.completion_signal=signal;
  13.     aql.dimensions=1;
  14.     aql.workgroup_size_x=256;
  15.     aql.workgroup_size_y=1;
  16.     aql.workgroup_size_z=1;
  17.     aql.grid_size_x=1024*1024;
  18.     aql.grid_size_y=1;
  19.     aql.grid_size_z=1;
  20.     aql.header.type=HSA_PACKET_TYPE_DISPATCH;
  21.     aql.header.acquire_fence_scope=2;
  22.     aql.header.release_fence_scope=2;
  23.     aql.header.barrier=1;
  24.     aql.group_segment_size=0;
  25.     aql.private_segment_size=0;
  26.   
  27.     /*
  28.      * Bind kernel code and the kernel argument buffer to the
  29.      * aql packet.
  30.      */
  31.     aql.kernel_object_address=hsaCodeDescriptor->code.handle;
  32.     aql.kernarg_address=(uint64_t)kernel_arg_buffer;
  33.     /*
  34.      * Obtain the current queue write index.
  35.      */
  36.     uint64_t index = hsa_queue_load_write_index_relaxed(commandQueue);
  37.     /*
  38.      * Write the aql packet at the calculated queue index address.
  39.      */
  40.     const uint32_t queueMask = commandQueue->size - 1;
  41.     ((hsa_dispatch_packet_t*)(commandQueue->base_address))[index&queueMask]=aql;
  42.     /*
  43.      * Increment the write index and ring the doorbell to dispatch the kernel.
  44.      */
  45.     hsa_queue_store_write_index_relaxed(commandQueue, index+1);
  46.     hsa_signal_store_relaxed(commandQueue->doorbell_signal, index);
  47. }

  48. ....
  49. #define ITERATIONS 1000
  50.     struct timeval tv1;
  51.     struct timeval tv2;
  52.     gettimeofday(&tv1, 0);
  53.     for (int i = 0 ;i < ITERATIONS ; ++i)
  54.     {
  55.             hsa_signal_t signal;
  56.             hsa_signal_create(1, 0, NULL, &signal);
  57.             submit_packet(hsaCodeDescriptor, commandQueue, kernel_arg_buffer, signal);
  58.             /*
  59.              * Wait on the dispatch signal until the kernel is finished.
  60.              */
  61.             hsa_signal_value_t value = hsa_signal_wait_acquire(signal, HSA_LT, 1, (uint64_t) -1, HSA_WAIT_EXPECTANCY_UNKNOWN);
  62.             /*
  63.              * Cleanup all allocated resources.
  64.              */
  65.             hsa_signal_destroy(signal);
  66.     }
  67.     gettimeofday(&tv2, 0);
  68.     
  69.     printf("!!!!! Elapsed submit->wait->repeat %ld\n", time_difference(tv1, tv2));
  70.     {
  71.             hsa_signal_t signal;
  72.             err=hsa_signal_create(1, 0, NULL, &signal);
  73.             check(Creating a HSA signal, err);
  74.            gettimeofday(&tv1, 0);
  75.             for (int i = 0 ;i < ITERATIONS ; ++i)
  76.             {
  77.                 submit_packet(hsaCodeDescriptor, commandQueue, kernel_arg_buffer, i == ITERATIONS -1 ? signal : 0);
  78.             }
  79.             /*
  80.              * Wait on the dispatch signal until the kernel is finished.
  81.              */
  82.             hsa_signal_value_t value = hsa_signal_wait_acquire(signal, HSA_LT, 1, (uint64_t) -1, HSA_WAIT_EXPECTANCY_UNKNOWN);
  83.          gettimeofday(&tv2, 0);
0 Likes

The issue appears to be caused by latency introduced by the APU's
power saving protocols interacting with these specific workloads. The second
loop (dispatch all and then wait) appears to allow the CPU to enter a power
saving state, while the first (because it is interacting with signals), does
not. I ran this simple program in the background during execution:

int main() {

    while(1) {;}

}

With this running the numbers from the 1000 loop iteration were:

submit->wait->repeat 0.818 seconds

submit->all->repeat 0.743 seconds

Without the loop app running, if I bump the iteration count
to 10k I get the following numbers:

submit->wait->repeat 7.782 seconds

submit->all->repeat 6.492 seconds

These numbers seem reasonable. Give these two scenarios a try.

Is there a way to disable power saving mode without running infinite loop in the background ?

0 Likes

There are usually options in the system BIOS for the systems power profile. That is the first place I suggest looking.

0 Likes

Well played with power options in bios - no luck so far ;(

Probably it would be nice to get some kind of official guidance on bios/kernel tuning 😃

0 Likes