Hi to everybody.
I'm performing some benchmarks on discrete and integrated GPUs, measuring completion time, energy consumption and collecting GPU counters.
For quite all the algorithms that I'm executing, ranging from Saxpy, through Reduction to Convolution, I'm obtaining results difficult to be interpreted when the input data gets bigger than 32MB.
For example, in convolution, the completion time varies quite linearly increasing the matrix size (9ms, 16ms, 30ms ...) for matrix sizes smaller than 8M elements (32MB). For 64MB the completion time jumps to 130ms, i.e. 4 times the completion time for 32 MB.
In Saxpy I found the same situation, with the completion time jumping from 25ms, 50ms, 100ms, to 300ms for 64MB data, which is 3x the completion time for half the input size.
An huge increase of the completion time for input data bigger than 32MB seems to affect all my benchmarks.
Moreover, looking at the GPU counters such as GPUBusy, it seems that for such an input size the GPU resources are underemployed.
Is the increasing of completion time due to memory pinning cost?
Can you help me justifying the decrease of most of the GPU counters values?
I show you two tables, the first relative to Saxpy, executed on the Discrete GPU with no-flags buffers (device allocation) and the second relative to Convolution (3x3 filter, single precision), executed on the A8 integrated GPU with ALLOC_HOST | READ_ONLY flag (host visible pre-pinned allocation).
These are only two examples, but the jump of the completion time and the decrease of GPU counters are actually spreading across all the buffer allocation strategies and the devices used.
Saxpy (vector size expressed in bytes): http://www.gabrielecocco.it/Workbook2.htm
Convolution (matrix size expressed in total elements): http://www.gabrielecocco.it/Workbook3.htm
Thank you very much for you help!