12 Replies Latest reply on Aug 31, 2018 4:26 AM by dipak

    Disappointing opencl half-precision performance on vega - any advice?


      I bought a Vega 64 recently. From the specs, it has 23 TFLOPs fp16 throughput compared to 12 TFLOP fp32. so I converted portion of my Monte Carlo code to half, expecting to gain some noticeable speed up. Disappointingly, instead of gaining speed, I got a 5% speed drop.


      the changes were done for a core function, which I believe is the bottleneck of the code (maybe account for 1/4 of the run-time), see the key


      add half precision raytracer, support AMD Vega · fangq/mcxcl@0c11f79 · GitHub


      in comparison, here is the float counter-part:


      mcxcl/mcx_core.cl at master · fangq/mcxcl · GitHub


      my kernel is a compute-bound kernel.


      I don't know what is the common scenario when converting to half will bring speedup. in my case, were the conversions or extra registers responsible for the drop? any dos and not-dos when using half?





      PS: the code can be tested by

      git clone https://github.com/fangq/mcxcl.git 
      cd mcxcl
      git checkout
      cd src
      make clean all
      cd ../example/benchmark
      ./run_benchmark1.sh -G 1 -J "-DUSE_HALF"

      removing the -J "-DUSE_HALF" option will enable the original fp32 code