5 Replies Latest reply on Nov 22, 2017 2:25 AM by dipak

    Disappointing opencl half-precision performance on vega - any advice?

    FangQ

      I bought a Vega 64 recently. From the specs, it has 23 TFLOPs fp16 throughput compared to 12 TFLOP fp32. so I converted portion of my Monte Carlo code to half, expecting to gain some noticeable speed up. Disappointingly, instead of gaining speed, I got a 5% speed drop.

       

      the changes were done for a core function, which I believe is the bottleneck of the code (maybe account for 1/4 of the run-time), see the key

       

      add half precision raytracer, support AMD Vega · fangq/mcxcl@0c11f79 · GitHub

       

      in comparison, here is the float counter-part:

       

      mcxcl/mcx_core.cl at master · fangq/mcxcl · GitHub

       

      my kernel is a compute-bound kernel.

       

      I don't know what is the common scenario when converting to half will bring speedup. in my case, were the conversions or extra registers responsible for the drop? any dos and not-dos when using half?

       

      thanks

       

       

      PS: the code can be tested by

      git clone https://github.com/fangq/mcxcl.git 
      cd mcxcl
      git checkout
      cd src
      make clean all
      cd ../example/benchmark
      ./run_benchmark1.sh -G 1 -J "-DUSE_HALF"

      removing the -J "-DUSE_HALF" option will enable the original fp32 code