I am running an OpenCL kernel on Piledriver CPUs (particularly A10-5750M Richland APU) using AMD's OpenCL implementation. Piledriver CPUs support FMA3 and FMA4 operations and I expected that if I use "fma" builtin in OpenCL, it will generate corresponding hardware instruction. Instead I discovered that the performance is terrible and then discovered that instead of generating a single instruction, a function call is being generated for a software implementation of FMA.
Any idea why FMA in OpenCL does not generate FMA hardware instructions? Tested on OpenSUSE 13.1 64-bit using Catalyst 13.12 and also tested on Windows 8.1 64-bit using Catalyst 13.12 using the same hardware. clinfo reports the following for the driver version: "Driver version: 1348.5 (sse2,avx,fma4)" so clearly the OpenCL runtime is detecting presence of FMA instructions.