I have a question about reporting performance of a kernel, specifically the throughput in terms of FLOPS. In one of the kernels (executed twice per iteration), there is a single line at the very end where all work-items write their results (single float4) to global memory. I would really love showcase the performance of the GPUs but here's my dilemma.
With this single line that writes to global memory I get very low FLOPS but without it I get very good FLOPS. To put things into perspective, for instance running on a single AMD HD 7970, the results are ~200 GFLOPS versus 1.3 TFLOPS with and without that line respectively.
My question is if I wanted to report the computational prowess of this card in my experiment, is it okay (ethical or in every sense of the word okay) to report the performance without the global memory write bottleneck? Meanwhile I would like to mention that the write is coalesced but I read in the docs that coalesced writes are not supported only coalesced reads, hence the massive degradation in performance. Otherwise the kernels perform a couple of reads in the beginning which does not affect the figures above.
You cannot be sure because omitting writing the results back to device memory can potentially lead to code elimination by the optimizer. I would try placing an "if" statement just above the memory write operation using a condition that is definitely false but unable to be resolved during compilation time. That would trick the compiler in order not to omit any code.
Thanks for your feedback. I can guarantee that the compiler does not omit the necessary code because the value being written by each work-item is re-used within the code a number of times before this value is finally written to global memory. However, I will try your suggestion and compare difference just to be sure.
Having said that, if only this line (writing value to global memory) is omitted, is it okay to report the amount of FLOPS as a metric for computation throughput? Of course, I will include the part that explains it so nothing is hidden. I just need opinions as I haven't really had to document such things before Thanks
If you just wanted to measure the kernel performance then its okay if you skip this line. If you want to measure overall sample/application performance then its required
Thanks very much for your response. I was thinking exactly along this same line of reporting both figures in terms of kernel performance and overall performance.