AnsweredAssumed Answered

Understanding scheduling and/or i-cache behaviour on AMD GPUs

Question asked by mindsporter on Sep 12, 2012
Latest reply on Sep 17, 2012 by realhet

Hi,

 

Noting OpenCL kernel run times under Linux (Ubuntu 12.04, 3.2.0-29-generic, AMD APP SDK v2.7) with the HD7970, I made the following observations in an experiment:

 

Case 1: kernel1 enqueued 10000 times. Run time: 0.20 seconds

clFinish(gpu_queue);

gettimeofday(start_time);

for (i = 0; i < 10000; i++) {

    clEnqueueNDRangeKernel(gpu_queue, kernel1);

}

clFinish(gpu_queue);

gettimeofday(end_time);

 

Case 2: kernel2 enqueued 10000 times. Run time: 0.24 seconds

clFinish(gpu_queue);

gettimeofday(start_time);

for (i = 0; i < 10000; i++) {

   clEnqueueNDRangeKernel(gpu_queue, kernel2);

}

clFinish(gpu_queue);

gettimeofday(end_time);

 

Case 3: kernel 1 enqueued 10000 times, followed by kernel2 10000 times. Run time: 0.40 seconds.

clFinish(gpu_queue);

gettimeofday(start_time);

for (i = 0; i < 10000; i++) {

   clEnqueueNDRangeKernel(gpu_queue, kernel1);

}

for (i = 0; i < 10000; i++) {

   clEnqueueNDRangeKernel(qpu_queue, kernel2);

}

clFinish(gpu_queue);

gettimeofday(end_time);

 

Case 4: both kernel1 and kernel2 enqueued 10000 times, in an interleaved manner. Run time: 2.07 seconds!

clFinish(gpu_queue);

gettimeofday(start_time);

for (i = 0; i < 10000; i++) {

   clEnqueueNDRangeKernel(gpu_queue, kernel1);

   clEnqueueNDRangeKernel(gpu_queue, kernel2);

}

clFinish(gpu_queue);

gettimeofday(end_time);

 

The reported run times were in the same ballpark over repeated runs of the experiment.

Why is the run time in case 4 so much higher than in case 3? Is it to do with the scheduling of the kernels and/or instruction cache behaviour? Is the i-cache invalidated at the start of every kernel run?

 

Thanks.

Albert.

 

---------------------

Message was edited by: Albert Antony

 

Scroll down a few responses for code that reproduces the behaviour

Outcomes