cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bjang
Journeyman III

GFLOPS calculation and relationship with input size

I have two questions.

1. I am wondering why SDK sample includes streamRead and streamWrite time when calculating GLOPs? I admit it is one of GPU performance figure but it is nothing to do with GPU computing power. I believe many people don't include data transfer time when it comes to GFLOPs (including many research papers with CUDA).

2. Considering a simple_matmult example in Brook SDK as an example, its GFLOPs increases as input size increases until some points and stay almost flat (saturation). I am wondering why it is so. Below is my data with 3870 X2. As you can see, GFLOPs is being saturated at around 2048*2048. Any thoughts are more than welcome. I don't think it is due to lack of threads.

(input_matrix_size)   (GFLOPs)

128*128    0.13293
256*256    0.8292
384*384    1.90313
512*512    2.73721
640*640    3.24699
768*768    4.20032
896*896    5.58104
1024*1024    6.62299
1152*1152    7.3679
1280*1280    7.85343
1408*1408    8.17544
1536*1536    7.50756
1664*1664    8.01407
1792*1792    8.36521
1920*1920    8.77919
2048*2048    8.94158
2176*2176    8.93896
2304*2304    8.80759
2432*2432    9.00353
2560*2560    9.18522
2688*2688    9.11353
2816*2816    9.23887
2944*2944    9.18581
3072*3072    9.09957
3200*3200    9.20422
3328*3328    9.20185
3456*3456    9.30503
3584*3584    9.29831
3712*3712    9.28772

0 Likes
5 Replies
ryta1203
Journeyman III

1. Just my thought: Most papers don't include how they got the GFLOPS, which is, IMO, a HUGE problem in the academic community as many people will take traditional algorithms and tweak and tweak them to get maximum performance BUT the program they end up with often times does not give "real world" results. This is not the case for simple algorithms but can be the case for more complex algorithms. IMO, it is not worth considering the transfer time IF the transfers exist outside of the computation.

2. The only thing I can think of is that the card is being saturated, that is, you are getting 100% occupancy. All the processors are being used for calculations.
0 Likes

If I measure time for only kernel function (I mean after streamRead() and before streamWrite()) I easily got more than 3 TFLOPs on HD3870 for large input matrix in simple_matmult SDK example. I think it doesn't make sense. Is this because of asynchronous operation between CPU and GPU? If so, how can I measure pure kernel time (gflops) excluding transfer time?

0 Likes

streamRead and kernels are asynchronous calls -- they return immediately without waiting for the data transfer or the kernel execution to complete. streamWrite waits for the data transfer and the previous streamReads and kernels to complete.

So you would put a streamWrite for a small stream, say 1 element wide, and put the timer after that. The sequence will be 1. start timer, 2. call the kernel many times, say in a loop for 100 or 1000 iterations, 3. streamWrite 1 element, 4. end timer.

The samples in the SDK should have examples that use timers properly.

0 Likes

Thanks, udeepta@amd, now I understand.

BTW, would you be kind enough to explain my original second question asking why the performance saturates at certain point as input size increase? I don't think the number of active thread explain this behavior 100%.

Any thoughts are welcome.

0 Likes

Ok, but first, what's your question?
1) Why the numbers aren't exactly and vary from one input size to another a little bigger? Thread utilization, caching algorithms, interference of other process, etc.

2) Why it tops at only 9 GFlops? Because the bottleneck isn't the execution units but probably cache and/or texture units.
0 Likes