cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ke0m
Journeyman III

Benchmarking using OpenCL on AMD GPU

Hello all,

I am new to OpenCL development and I am currently doing some benchmark tests using OpenCL on an AMD Radeon HD 7870.

The code I have written in JOCL (the Java bindings for OpenCL) simply adds two 2D arrays (z= x + y) but it does so many times (z=x+y+y+y+y+y+y...).

The size of the two arrays is 500 by 501 and I am looping over the number of iterations I want to add them together on the GPU. So first I add them once, then ten times, then one thousand times, etc.

The maximum number of iterations that I loop to is 100,000,000. Below is what the log file looks like when I run my code (counter is the number of times my program executes in 5 seconds):

Number of Iterations: 1

Counter: 87

FLOPS Rate: 0.0043310947 GFLOPs/s

Number of Iterations: 10

Counter: 88

FLOPS Rate: 0.043691948 GFLOPs/s

Number of Iterations: 100

Counter: 84

FLOPS Rate: 0.41841218 GFLOPs/s

Number of Iterations: 1000

Counter: 71

FLOPS Rate: 3.5104263 GFLOPs/s

Number of Iterations: 10000

Counter: 8

FLOPS Rate: 3.8689642 GFLOPs/s

Number of Iterations: 100000

Counter: 62

FLOPS Rate: 309.70895 GFLOPs/s

Number of Iterations: 1000000

Counter: 17

FLOPS Rate: 832.0814 GFLOPs/s

Number of Iterations: 10000000

Counter: 2

FLOPS Rate: 974.4635 GFLOPs/s

Number of Iterations: 100000000

Counter: 1

FLOPS Rate: 893.7945 GFLOPs/s

Do these numbers make sense? I feel that 0.97 TeraFLOPS is quite high and that I must be calculating the number of FLOPs incorrectly.

Just for reference, I am calculating the FLOPS in the following way:

FLOPS = counter*(500)*(501)*(iterations)/(time_elapsed)

Any help with this issue will be greatly appreciated.

Thank you

0 Likes
3 Replies
nou
Exemplar

maximum theoretical peak FLOPS of radeon 7870 is 2.5TFLOPS. but is for MAD instruction which is x*y+z so it count as two operations. so you can achieve only ~1.2TFLOPS with simple ADD. so with your 0.9TFLOPS you are pretty close to theoretical maximum.

ke0m
Journeyman III

Thank you nou for the response. I am now a bit concerned with some other benchmark results that I am encountering. I have created a surface plot that shows how the value of GigaFLOPS changes when the size of the arrays and the number of iterations changes. A link to the plot can be seen here:

Senior_Design/JOCL/Graphing/GoodGPUPlot.PNG at master · ke0m/Senior_Design · GitHub

I showed this plot to an expert and he said that the results, while feasible, were artificially high. The reason for this being that the slope from 32,000 iterations to 130,000 iterations is too great and physically does not make sense. He mentioned that the reason for this is because that when I reach the 32,000 iteration mark, the number (of type int) which I am iterating over, is converted to a short by the OpenCL compiler and therefore, from that number on, the number of iterations remains constant. I am calculating these GigaFLOPS assuming that this number is increasing and therefore, the GigaFLOPS I calculate is artificially high. I tested this buy adding two 2X2 arrays (full of ones) 40000 times and found the results to be 40002, suggesting that it does iterate greater than 32000 times and that the OpenCL compiler does not cast the int to a short. I am very much confused by these results. Any sort of explanation about this steep slope in my graph would be greatly appreciated.

My OpenCL kernel code can be seen here below:

__kernel void arraysum(__global const float *d_xx, __global const float *d_yy, __global float *d_zz, int iters)

{

    int i = get_global_id(0);

    float t_d_zz = d_xx + d_yy;

    for(int j = 0; j <iters; j++){

        t_d_zz += d_yy;

    }  

    d_zz = t_d_zz;

   

}

The number that I am saying that is of type int that actually is converted to a short and can therefore only reach a certain value is "iters"

0 Likes

well you should manually unroll your loop as there is high overhead with it. to something like 64 additions in one loop. also it doesn't make sense that it is casted to short. also you should run your kernel 10 times in row and measure median of this so GPU is "warmed up" because first kernel execution is alway much slower.

for(int j = 0; j <iters; j++){

        t_d_zz += d_yy;

t_d_zz += d_yy;

t_d_zz += d_yy;

t_d_zz += d_yy;

...

    }  

0 Likes