Disappointing opencl half-precision performance on vega - any advice?

Discussion created by FangQ on Nov 19, 2017
Latest reply on Aug 31, 2018 by dipak

I bought a Vega 64 recently. From the specs, it has 23 TFLOPs fp16 throughput compared to 12 TFLOP fp32. so I converted portion of my Monte Carlo code to half, expecting to gain some noticeable speed up. Disappointingly, instead of gaining speed, I got a 5% speed drop.


the changes were done for a core function, which I believe is the bottleneck of the code (maybe account for 1/4 of the run-time), see the key


add half precision raytracer, support AMD Vega · fangq/mcxcl@0c11f79 · GitHub


in comparison, here is the float counter-part:


mcxcl/ at master · fangq/mcxcl · GitHub


my kernel is a compute-bound kernel.


I don't know what is the common scenario when converting to half will bring speedup. in my case, were the conversions or extra registers responsible for the drop? any dos and not-dos when using half?





PS: the code can be tested by

git clone 
cd mcxcl
git checkout
cd src
make clean all
cd ../example/benchmark
./ -G 1 -J "-DUSE_HALF"

removing the -J "-DUSE_HALF" option will enable the original fp32 code