7 Replies Latest reply on Sep 17, 2012 7:39 AM by realhet

    Understanding scheduling and/or i-cache behaviour on AMD GPUs

    mindsporter

      Hi,

       

      Noting OpenCL kernel run times under Linux (Ubuntu 12.04, 3.2.0-29-generic, AMD APP SDK v2.7) with the HD7970, I made the following observations in an experiment:

       

      Case 1: kernel1 enqueued 10000 times. Run time: 0.20 seconds

      clFinish(gpu_queue);

      gettimeofday(start_time);

      for (i = 0; i < 10000; i++) {

          clEnqueueNDRangeKernel(gpu_queue, kernel1);

      }

      clFinish(gpu_queue);

      gettimeofday(end_time);

       

      Case 2: kernel2 enqueued 10000 times. Run time: 0.24 seconds

      clFinish(gpu_queue);

      gettimeofday(start_time);

      for (i = 0; i < 10000; i++) {

         clEnqueueNDRangeKernel(gpu_queue, kernel2);

      }

      clFinish(gpu_queue);

      gettimeofday(end_time);

       

      Case 3: kernel 1 enqueued 10000 times, followed by kernel2 10000 times. Run time: 0.40 seconds.

      clFinish(gpu_queue);

      gettimeofday(start_time);

      for (i = 0; i < 10000; i++) {

         clEnqueueNDRangeKernel(gpu_queue, kernel1);

      }

      for (i = 0; i < 10000; i++) {

         clEnqueueNDRangeKernel(qpu_queue, kernel2);

      }

      clFinish(gpu_queue);

      gettimeofday(end_time);

       

      Case 4: both kernel1 and kernel2 enqueued 10000 times, in an interleaved manner. Run time: 2.07 seconds!

      clFinish(gpu_queue);

      gettimeofday(start_time);

      for (i = 0; i < 10000; i++) {

         clEnqueueNDRangeKernel(gpu_queue, kernel1);

         clEnqueueNDRangeKernel(gpu_queue, kernel2);

      }

      clFinish(gpu_queue);

      gettimeofday(end_time);

       

      The reported run times were in the same ballpark over repeated runs of the experiment.

      Why is the run time in case 4 so much higher than in case 3? Is it to do with the scheduling of the kernels and/or instruction cache behaviour? Is the i-cache invalidated at the start of every kernel run?

       

      Thanks.

      Albert.

       

      ---------------------

      Message was edited by: Albert Antony

       

      Scroll down a few responses for code that reproduces the behaviour

        • Re: Understanding scheduling and/or i-cache behaviour on AMD GPUs
          Wenju

          Hi Albert,

          Let's assume: run time = setup time(not related to calculation, maybe it does not exist) + calculate time(+,-,*,/, read, write operation and so on) . And you have two kernels: kernel1 and kernel2.    case1: I think the run time = 1 setup time + 10000 calculate time(because the cache,  the time may be smaller). case2 is the same with case1. case3: run time = 2 setup time(kernel1+kernel2)  + 20000 calculate time(kernel1+kernel2). case4: run time = 20000 setup time(kernel1+kernel2) + calculate time(kernel1+ kernel2). Just speculating, maybe I'm wrong.

            • Re: Understanding scheduling and/or i-cache behaviour on AMD GPUs
              mindsporter

              Hi Wenju,

               

              Thanks for your answer. I had started out thinking the same. However one must note a couple of points mentioned in the APP SDK Programming Guide in relation to this:

              1. For CPU devices, the kernel launch time is fast (tens of μs), but for discrete GPU devices it can be several hundred μs.

              2. To reduce the launch overhead, the AMD OpenCL runtime combines several command submissions into a batch.

               

              Assuming the runtime is doing a decent job of combining the submissions, one could expect the several hundred μs overhead to effectively get diluted to a few μs (or less) per kernel, depending on how many submissions get batched together.

               

              Another intriguing observation I made was that when I removed some code from kernel2, thus reducing its code length, the total runtime for case 4 came down to the same ballpark as for case 3. This is what made me start suspecting the i-cache behaviour. But I may be completely wrong on this.

               

              Perhaps someone from AMD can shed more light?

               

              Thanks,

              Albert.

            • Re: Understanding scheduling and/or i-cache behaviour on AMD GPUs
              mindsporter

              Attached is a quickly hacked modification of the MatrixMulImage sample in SDK 2.7 that reproduces the behaviour I am observing. Only the cpp and hpp files have been touched. Hence, only these have been attached. Backup the original cpp and hpp files in the MatrixMulImage source directory in the SDK and put the new ones in their place. Build and run the MatrixMulImage sample with the -q argument.

               

              The changes of interest in the source are towards the end of runCLKernels() in MatrixMulImage.cpp, where you will find that you can run the different cases from my experiments by setting the appropriate #define to 1 (set the others to 0). The observant eye would notice that "kernel" is mmmKernel3 and "kernel2" is mmmKernel2.

               

              Here are the numbers from a run on my machine:

              Case 1: 3.62 seconds

              Case 2: 3.76 seconds

              Case 3: 13.06 seconds

              Case 4: 28.40 seconds

               

              We are essentially observing the same behaviour, i.e., the run time for case 4 is much higher than that for case 3. It is also interesting that case 3 runtime is now almost 2 x (case 1 runtime + case 2 runtime), which is not what I had seen in the original experiment (with different kernels).

               

              Hope someone can explain the observations.

               

              Thanks,

              Albert.