I have a question about reporting performance of a kernel, specifically the throughput in terms of FLOPS. In one of the kernels (executed twice per iteration), there is a single line at the very end where all work-items write their results (single float4) to global memory. I would really love showcase the performance of the GPUs but here's my dilemma.
With this single line that writes to global memory I get very low FLOPS but without it I get very good FLOPS. To put things into perspective, for instance running on a single AMD HD 7970, the results are ~200 GFLOPS versus 1.3 TFLOPS with and without that line respectively.
My question is if I wanted to report the computational prowess of this card in my experiment, is it okay (ethical or in every sense of the word okay) to report the performance without the global memory write bottleneck? Meanwhile I would like to mention that the write is coalesced but I read in the docs that coalesced writes are not supported only coalesced reads, hence the massive degradation in performance. Otherwise the kernels perform a couple of reads in the beginning which does not affect the figures above.