4 Replies Latest reply on Nov 22, 2013 4:02 AM by wayne_static

    Reporting results on kernel performance




      I have a question about reporting performance of a kernel, specifically the throughput in terms of FLOPS. In one of the kernels (executed twice per iteration), there is a single line at the very end where all work-items write their results (single float4) to global memory. I would really love showcase the performance of the GPUs but here's my dilemma.


      With this single line that writes to global memory I get very low FLOPS but without it I get very good FLOPS. To put things into perspective, for instance running on a single AMD HD 7970, the results are ~200 GFLOPS versus 1.3 TFLOPS with and without that line respectively.


      My question is if I wanted to report the computational prowess of this card in my experiment, is it okay (ethical or in every sense of the word okay) to report the performance without the global memory write bottleneck? Meanwhile I would like to mention that the write is coalesced but I read in the docs that coalesced writes are not supported only coalesced reads, hence the massive degradation in performance. Otherwise the kernels perform a couple of reads in the beginning which does not affect the figures above.



        • Re: Reporting results on kernel performance

          You cannot be sure because omitting writing the results back to device memory can potentially lead to code elimination by the optimizer. I would try placing an "if" statement just above the memory write operation using a condition that is definitely false but unable to be resolved during compilation time. That would trick the compiler in order not to omit any code.