8 Replies Latest reply on Feb 18, 2015 5:20 PM by vladimir_1

    HSA queue

    vladimir_1

      Hi,

       

      Why it is faster when executing a queue to wait for the completion signal and then add next packet to the queue rather than add N packets and only wait for the last packet to signal completion? (The queue does not overflow).

       

      Vladimir.

        • Re: HSA queue
          jedwards

          To answer this question more information is required. Specifically, the format of the AQL packets being submitted and the kernels being executed for each dispatch need to be understood. Also, understanding how the timing for both cases is being measured must be analyzed as well. A test case would be helpful.

            • Re: HSA queue
              vladimir_1

              Hi,

               

              Here is modified vector_copy.c : vector_copy.c

              The output of the test run :

              bsp@ubunta:~/HSA-Runtime-AMD/sample$ ./vector_copy

              Initializing the hsa runtime succeeded.

              Calling hsa_iterate_agents succeeded.

              Checking if the GPU device is non-zero succeeded.

              Querying the device name succeeded.

              The device name is Spectre.

              Querying the device maximum queue size succeeded.

              The maximum queue size is 131072.

              Creating the queue succeeded.

              Creating the brig module from vector_copy.brig succeeded.

              Creating the hsa program succeeded.

              Adding the brig module to the program succeeded.

              Finding the symbol offset for the kernel succeeded.

              Finalizing the program succeeded.

              Querying the kernel descriptor address succeeded.

              Registering argument memory for input parameter succeeded.

              Registering argument memory for output parameter succeeded.

              Finding a kernarg memory region succeeded.

              Allocating kernel argument memory buffer succeeded.

              Registering the argument buffer succeeded.

              !!!!! Elapsed submit->wait->repeat 1154830

              Creating a HSA signal succeeded.

              Destroying the signal succeeded.

              !!!!! Elapsed submit->repeat->wait 1303541

              Passed validation.

              Destroying the program succeeded.

              Destroying the queue succeeded.

              Shutting down the runtime succeeded.

              bsp@ubunta:~/HSA-Runtime-AMD/sample$

              As you can see first approach which also includes creation and destruction of signal is faster which really puzzles me as i was expecting exactly the opposite =)

                • Re: HSA queue
                  jedwards

                  I can't access the link you provided for the modified vector_copy.c test. Could you just cut an paste the section regarding AQL packet prep and you timing code? Thanks.

                    • Re: HSA queue
                      vladimir_1
                      1. void submit_packet( hsa_ext_code_descriptor_t *hsaCodeDescriptor, hsa_queue_t* commandQueue, void* kernel_arg_buffer, hsa_signal_t signal)
                      2. {
                      3.     hsa_status_t err;
                      4.     /*
                      5.      * Initialize the dispatch packet.
                      6.      */
                      7.     hsa_dispatch_packet_t aql;
                      8.     memset(&aql, 0, sizeof(aql));
                      9.     /*
                      10.      * Setup the dispatch information.
                      11.      */
                      12.     aql.completion_signal=signal;
                      13.     aql.dimensions=1;
                      14.     aql.workgroup_size_x=256;
                      15.     aql.workgroup_size_y=1;
                      16.     aql.workgroup_size_z=1;
                      17.     aql.grid_size_x=1024*1024;
                      18.     aql.grid_size_y=1;
                      19.     aql.grid_size_z=1;
                      20.     aql.header.type=HSA_PACKET_TYPE_DISPATCH;
                      21.     aql.header.acquire_fence_scope=2;
                      22.     aql.header.release_fence_scope=2;
                      23.     aql.header.barrier=1;
                      24.     aql.group_segment_size=0;
                      25.     aql.private_segment_size=0;
                      26.   
                      27.     /*
                      28.      * Bind kernel code and the kernel argument buffer to the
                      29.      * aql packet.
                      30.      */
                      31.     aql.kernel_object_address=hsaCodeDescriptor->code.handle;
                      32.     aql.kernarg_address=(uint64_t)kernel_arg_buffer;
                      33.     /*
                      34.      * Obtain the current queue write index.
                      35.      */
                      36.     uint64_t index = hsa_queue_load_write_index_relaxed(commandQueue);
                      37.     /*
                      38.      * Write the aql packet at the calculated queue index address.
                      39.      */
                      40.     const uint32_t queueMask = commandQueue->size - 1;
                      41.     ((hsa_dispatch_packet_t*)(commandQueue->base_address))[index&queueMask]=aql;
                      42.     /*
                      43.      * Increment the write index and ring the doorbell to dispatch the kernel.
                      44.      */
                      45.     hsa_queue_store_write_index_relaxed(commandQueue, index+1);
                      46.     hsa_signal_store_relaxed(commandQueue->doorbell_signal, index);
                      47. }

                      48. ....
                      49. #define ITERATIONS 1000
                      50.     struct timeval tv1;
                      51.     struct timeval tv2;
                      52.     gettimeofday(&tv1, 0);
                      53.     for (int i = 0 ;i < ITERATIONS ; ++i)
                      54.     {
                      55.             hsa_signal_t signal;
                      56.             hsa_signal_create(1, 0, NULL, &signal);
                      57.             submit_packet(hsaCodeDescriptor, commandQueue, kernel_arg_buffer, signal);
                      58.             /*
                      59.              * Wait on the dispatch signal until the kernel is finished.
                      60.              */
                      61.             hsa_signal_value_t value = hsa_signal_wait_acquire(signal, HSA_LT, 1, (uint64_t) -1, HSA_WAIT_EXPECTANCY_UNKNOWN);
                      62.             /*
                      63.              * Cleanup all allocated resources.
                      64.              */
                      65.             hsa_signal_destroy(signal);
                      66.     }
                      67.     gettimeofday(&tv2, 0);
                      68.     
                      69.     printf("!!!!! Elapsed submit->wait->repeat %ld\n", time_difference(tv1, tv2));
                      70.     {
                      71.             hsa_signal_t signal;
                      72.             err=hsa_signal_create(1, 0, NULL, &signal);
                      73.             check(Creating a HSA signal, err);
                      74.            gettimeofday(&tv1, 0);
                      75.             for (int i = 0 ;i < ITERATIONS ; ++i)
                      76.             {
                      77.                 submit_packet(hsaCodeDescriptor, commandQueue, kernel_arg_buffer, i == ITERATIONS -1 ? signal : 0);
                      78.             }
                      79.             /*
                      80.              * Wait on the dispatch signal until the kernel is finished.
                      81.              */
                      82.             hsa_signal_value_t value = hsa_signal_wait_acquire(signal, HSA_LT, 1, (uint64_t) -1, HSA_WAIT_EXPECTANCY_UNKNOWN);
                      83.          gettimeofday(&tv2, 0);
                        • Re: HSA queue
                          jedwards

                          The issue appears to be caused by latency introduced by the APU's
                          power saving protocols interacting with these specific workloads. The second
                          loop (dispatch all and then wait) appears to allow the CPU to enter a power
                          saving state, while the first (because it is interacting with signals), does
                          not. I ran this simple program in the background during execution:

                           

                          int main() {

                              while(1) {;}

                          }

                           

                          With this running the numbers from the 1000 loop iteration were:

                          submit->wait->repeat 0.818 seconds

                          submit->all->repeat 0.743 seconds

                           

                          Without the loop app running, if I bump the iteration count
                          to 10k I get the following numbers:

                          submit->wait->repeat 7.782 seconds

                          submit->all->repeat 6.492 seconds

                           

                          These numbers seem reasonable. Give these two scenarios a try.

                          1 of 1 people found this helpful