I bought a Vega 64 recently. From the specs, it has 23 TFLOPs fp16 throughput compared to 12 TFLOP fp32. so I converted portion of my Monte Carlo code to half, expecting to gain some noticeable speed up. Disappointingly, instead of gaining speed, I got a 5% speed drop.
the changes were done for a core function, which I believe is the bottleneck of the code (maybe account for 1/4 of the run-time), see the key
add half precision raytracer, support AMD Vega · fangq/mcxcl@0c11f79 · GitHub
in comparison, here is the float counter-part:
mcxcl/mcx_core.cl at master · fangq/mcxcl · GitHub
my kernel is a compute-bound kernel.
I don't know what is the common scenario when converting to half will bring speedup. in my case, were the conversions or extra registers responsible for the drop? any dos and not-dos when using half?
thanks
PS: the code can be tested by
git clone https://github.com/fangq/mcxcl.git
cd mcxcl
git checkout
cd src
make clean all
cd ../example/benchmark
./run_benchmark1.sh -G 1 -J "-DUSE_HALF"
removing the -J "-DUSE_HALF" option will enable the original fp32 code
Actually, rapid packed math (RPM) feature, which improves the FP16 performance, is currently not exposed to opencl under amdgpu pro. That's why there might be no performance gain compared to FP32 in your case. At this moment, RPM is supported on rocm stack. Following thread suggests that rocm 1.6.4 has the support: OpenCL rapid packed math support for Vega · Issue #219 · RadeonOpenCompute/ROCm · GitHub
thanks dipak. I installed rocm on one of my Ubuntu 16.04 boxes, unfortunately it does not support my kernel well. My code runs without a problem with amdgpu-pro ocl driver and previously fglrx driver, but now start to hang with rocm libamdocl64
is there a way to enable RPM on amdgpu-pro or this is simply not possible?
Currently, the compiler tool-chain under amdgpu-pro does not support packed math.
thanks. I managed to get my code work on rocm for some specific simulation settings, but it still fails in most other tests. Even in the test that it worked, the speed is about 10% of that when using the amdgpu-pro driver.
is there a place for reporting compatibility issues like these? I saw the github repos for different modules, but not sure if there is a better place to report.
Currently, rocm related issues are managed at github only. You can report your problem here: Issues · RadeonOpenCompute/ROCm · GitHub. I can see many OpenCL related issues posted there. Here is another place to report rocm OCL issues: Issues · RadeonOpenCompute/ROCm-OpenCL-Runtime · GitHub
Regarding the performance thing, please make sure that you're using FP16/INT16 datatypes and operations properly to enable the packed math. For example, operations on vector type like half2 or short2 can be benefited from RPM if supported by the compiler.
I am curious if the latest amdgpu-pro now supports half-precision hardware in Vega64? or if there is a timeline when this support will be added?
currently, my code has lots of trouble with ROCm, very slow speed, even infinite loops in many simulations. I am not sure if it worth the trouble going the ROCm path.
thanks
I don't know its current support status under amdgpu-pro stack. I'll check and get back to you.
Half precision is supported on Vega with amdgpu-pro. What is not supported is packed F16 math, only scalar F16 operations are issued. There is no immediate plan for adding packed math support at this moment.
dipak wrote:
Half precision is supported on Vega with amdgpu-pro. What is not supported is packed F16 math, only scalar F16 operations are issued. There is no immediate plan for adding packed math support at this moment.
But why? Rapid Packed Math support was promised since the Vega Technology Preview in January 2016? Why can't you, or are you not allowed to, enable it for OpenCL in AMDGPU-Pro and for Windows?
At this moment, RPM is only supported by the newer compiler tool-chain under rocm stack. There is a plan to implement it on amdgpu-pro, but can't say an ETA.
hi dipak
just want to follow up on this previous issue, and wondering if you see any confirmation/plan if the fp16 support for vega has been added to amdgpu driver?
I am currently playing with rocm 1.8.3, rocminfo does say fp16 is supported on my card, but I did not observe any speed improvement. the biggest issue for rocm is that it has a 10-fold slow down compared to amdgpu driver.
thanks
any confirmation/plan if the fp16 support for vega has been added to amdgpu driver?
As I earlier said, half precision is already supported on Vega with amdgpu-pro. Are you referring to packed FP16 math support? If so, I need to check with the compiler team to know the current status.
Thanks.