I am running an OpenCL kernel on Piledriver CPUs (particularly A10-5750M Richland APU) using AMD's OpenCL implementation. Piledriver CPUs support FMA3 and FMA4 operations and I expected that if I use "fma" builtin in OpenCL, it will generate corresponding hardware instruction. Instead I discovered that the performance is terrible and then discovered that instead of generating a single instruction, a function call is being generated for a software implementation of FMA.
Any idea why FMA in OpenCL does not generate FMA hardware instructions? Tested on OpenSUSE 13.1 64-bit using Catalyst 13.12 and also tested on Windows 8.1 64-bit using Catalyst 13.12 using the same hardware. clinfo reports the following for the driver version: "Driver version: 1348.5 (sse2,avx,fma4)" so clearly the OpenCL runtime is detecting presence of FMA instructions.
You are right, I tried this on a machine with Catalyst 14.2 and kaveri processor and it also could not generate FMA instructions in the assembly for the CPU while the GPU code had it.
I see in clinfo for CPU:
|Driver version:||1411.4 (sse2,avx,fma4)|
Just for reference, here is the test code I used:
__kernel void myfma(__global float *a, __global float *b, __global float *c, __global float *d)
d = fma(a,b,c);
Neither does the Intel OpenCL generate FMA instructions for Haswell.
My belief is that the CPU FMA instructions don't conform to the IEEE 2008 floating point standards and hence the opencl standard.
moozoo, you seem to be incorrect. Intel OpenCL generates FMA instruction... I see the following assembly code piece when I compile with Intel (OpenCL 1.2 (Build 56860))
vmovups (%r10,%rbx,4), %xmm2
vmovups (%rdx,%rbx,4), %xmm1
vfmadd132ps %xmm0, %xmm2, %xmm1
vmovups %xmm1, (%r9,%rbx,4)