cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

mindsporter
Journeyman III

Understanding scheduling and/or i-cache behaviour on AMD GPUs

Hi,

Noting OpenCL kernel run times under Linux (Ubuntu 12.04, 3.2.0-29-generic, AMD APP SDK v2.7) with the HD7970, I made the following observations in an experiment:

Case 1: kernel1 enqueued 10000 times. Run time: 0.20 seconds

clFinish(gpu_queue);

gettimeofday(start_time);

for (i = 0; i < 10000; i++) {

    clEnqueueNDRangeKernel(gpu_queue, kernel1);

}

clFinish(gpu_queue);

gettimeofday(end_time);

Case 2: kernel2 enqueued 10000 times. Run time: 0.24 seconds

clFinish(gpu_queue);

gettimeofday(start_time);

for (i = 0; i < 10000; i++) {

   clEnqueueNDRangeKernel(gpu_queue, kernel2);

}

clFinish(gpu_queue);

gettimeofday(end_time);

Case 3: kernel 1 enqueued 10000 times, followed by kernel2 10000 times. Run time: 0.40 seconds.

clFinish(gpu_queue);

gettimeofday(start_time);

for (i = 0; i < 10000; i++) {

   clEnqueueNDRangeKernel(gpu_queue, kernel1);

}

for (i = 0; i < 10000; i++) {

   clEnqueueNDRangeKernel(qpu_queue, kernel2);

}

clFinish(gpu_queue);

gettimeofday(end_time);

Case 4: both kernel1 and kernel2 enqueued 10000 times, in an interleaved manner. Run time: 2.07 seconds!

clFinish(gpu_queue);

gettimeofday(start_time);

for (i = 0; i < 10000; i++) {

   clEnqueueNDRangeKernel(gpu_queue, kernel1);

   clEnqueueNDRangeKernel(gpu_queue, kernel2);

}

clFinish(gpu_queue);

gettimeofday(end_time);

The reported run times were in the same ballpark over repeated runs of the experiment.

Why is the run time in case 4 so much higher than in case 3? Is it to do with the scheduling of the kernels and/or instruction cache behaviour? Is the i-cache invalidated at the start of every kernel run?

Thanks.

Albert.

---------------------

Message was edited by: Albert Antony

Scroll down a few responses for code that reproduces the behaviour

0 Likes
7 Replies
Wenju
Elite

Hi Albert,

Let's assume: run time = setup time(not related to calculation, maybe it does not exist) + calculate time(+,-,*,/, read, write operation and so on) . And you have two kernels: kernel1 and kernel2.    case1: I think the run time = 1 setup time + 10000 calculate time(because the cache,  the time may be smaller). case2 is the same with case1. case3: run time = 2 setup time(kernel1+kernel2)  + 20000 calculate time(kernel1+kernel2). case4: run time = 20000 setup time(kernel1+kernel2) + calculate time(kernel1+ kernel2). Just speculating, maybe I'm wrong.

0 Likes

Hi Wenju,

Thanks for your answer. I had started out thinking the same. However one must note a couple of points mentioned in the APP SDK Programming Guide in relation to this:

1. For CPU devices, the kernel launch time is fast (tens of μs), but for discrete GPU devices it can be several hundred μs.

2. To reduce the launch overhead, the AMD OpenCL runtime combines several command submissions into a batch.

Assuming the runtime is doing a decent job of combining the submissions, one could expect the several hundred μs overhead to effectively get diluted to a few μs (or less) per kernel, depending on how many submissions get batched together.

Another intriguing observation I made was that when I removed some code from kernel2, thus reducing its code length, the total runtime for case 4 came down to the same ballpark as for case 3. This is what made me start suspecting the i-cache behaviour. But I may be completely wrong on this.

Perhaps someone from AMD can shed more light?

Thanks,

Albert.

0 Likes

IIRC there is 64kB instruction memory. if is your kernel bigger than that you got huge slowdown as it is stored in global memory. this may be the case as it must switch between kernels in instruction memory.

0 Likes

Hi nou,

The ISA dump for kernel2 had reported:

codeLenInByte        = 1136;Bytes

I don't know how this translates to the total length in instructions, but I would guess that on its own, the kernel is small enough to fit in the i-cache. But what you say could be true if the i-cache is shared with multiple kernels from different contexts and if it is not invalidated at the start of every kernel run. Does anyone know if this is the case? If it is, is there a way to make the kernel more i-cache friendly?

Thanks,

Albert.

0 Likes
mindsporter
Journeyman III

Attached is a quickly hacked modification of the MatrixMulImage sample in SDK 2.7 that reproduces the behaviour I am observing. Only the cpp and hpp files have been touched. Hence, only these have been attached. Backup the original cpp and hpp files in the MatrixMulImage source directory in the SDK and put the new ones in their place. Build and run the MatrixMulImage sample with the -q argument.

The changes of interest in the source are towards the end of runCLKernels() in MatrixMulImage.cpp, where you will find that you can run the different cases from my experiments by setting the appropriate #define to 1 (set the others to 0). The observant eye would notice that "kernel" is mmmKernel3 and "kernel2" is mmmKernel2.

Here are the numbers from a run on my machine:

Case 1: 3.62 seconds

Case 2: 3.76 seconds

Case 3: 13.06 seconds

Case 4: 28.40 seconds

We are essentially observing the same behaviour, i.e., the run time for case 4 is much higher than that for case 3. It is also interesting that case 3 runtime is now almost 2 x (case 1 runtime + case 2 runtime), which is not what I had seen in the original experiment (with different kernels).

Hope someone can explain the observations.

Thanks,

Albert.

0 Likes

Anybody?

0 Likes

I think:

In case 1 and case 2 u use the memory on a way that memory is cached well.

In case 4 the thing's local memory usage became twice as big, so cache is less effective.

In case 3 there could be an overlapping time interval which is similar to case 4 -> 1111111212121212222222

Recently I've found that the 7970 is über sensitive for memory access patterns, so that's why I think this can be the situation.

(ICache: 32KB on GCN, 48KB on VLIW (shared across 4 CUs), so your 1KB code fits in perfectly.)

0 Likes