The AMD OpenCL compiler seems to have changed a lot going from Catalyst 11.11 to 11.12. A kernel source that would compile into tightly packed VLIW ALU instruction groups on 11.11 is likely to have horrible performance on 11.12.
Might it be an idea to distribute kernel binaries (in addition to source) with my program in case compiling the source on the user's computer yields bad performance?
Is there a way to compile binaries for AMD GPUs I don't have? AMD APP KernelAnalyzer seems to do this, but I see no way to do it through the OpenCL API. Nor does there seem to be a way to save the different binaries the KernelAnalyzer makes. I have a 6990 and a 5970, but nothing from the 4000-series.
Is 3 binaries enough? VLIW5 on 4000-series, VLIW5 on 5000+, VLIW4. Or do I need more specific binaries than that?
Is there an easy way to match precompiled binaries to the user's GPUs? Anything other than going by the device name reported by OpenCL and mapping that to architecture myself?
you need more than 3 binaries. it is 3 binaries per generation.
i don't see any other way to match devices as with names. why it should be a problem?
also you should make test case and send it to AMD so they can look into that regression.
Thanks for useful info on offline compilation!
For an example of performance degradation in the recent SDK, take a look at this:
Notice the difference. HD5870 version compiled under Catalyst 11.7 yields a kernel with 1363 ALU instruction groups, and Catalyst 11.12 yields 1426 - 1400 after they tweaked it. I am having similar issues. It seems very difficult to get the latest compiler to compile anything well for VLIW5.
You can download the kernel from a link at the URL above and try for yourself with different versions of the AMD compiler.