FangQ

Disappointing opencl half-precision performance on vega - any advice?

Discussion created by FangQ on Nov 19, 2017
Latest reply on Aug 31, 2018 by dipak

I bought a Vega 64 recently. From the specs, it has 23 TFLOPs fp16 throughput compared to 12 TFLOP fp32. so I converted portion of my Monte Carlo code to half, expecting to gain some noticeable speed up. Disappointingly, instead of gaining speed, I got a 5% speed drop.

 

the changes were done for a core function, which I believe is the bottleneck of the code (maybe account for 1/4 of the run-time), see the key

 

add half precision raytracer, support AMD Vega · fangq/mcxcl@0c11f79 · GitHub

 

in comparison, here is the float counter-part:

 

mcxcl/mcx_core.cl at master · fangq/mcxcl · GitHub

 

my kernel is a compute-bound kernel.

 

I don't know what is the common scenario when converting to half will bring speedup. in my case, were the conversions or extra registers responsible for the drop? any dos and not-dos when using half?

 

thanks

 

 

PS: the code can be tested by

git clone https://github.com/fangq/mcxcl.git 
cd mcxcl
git checkout
cd src
make clean all
cd ../example/benchmark
./run_benchmark1.sh -G 1 -J "-DUSE_HALF"

removing the -J "-DUSE_HALF" option will enable the original fp32 code

Outcomes