The AMD OpenCL compiler seems to have changed a lot going from Catalyst 11.11 to 11.12. A kernel source that would compile into tightly packed VLIW ALU instruction groups on 11.11 is likely to have horrible performance on 11.12.
Might it be an idea to distribute kernel binaries (in addition to source) with my program in case compiling the source on the user's computer yields bad performance?
Is there a way to compile binaries for AMD GPUs I don't have? AMD APP KernelAnalyzer seems to do this, but I see no way to do it through the OpenCL API. Nor does there seem to be a way to save the different binaries the KernelAnalyzer makes. I have a 6990 and a 5970, but nothing from the 4000-series.
Is 3 binaries enough? VLIW5 on 4000-series, VLIW5 on 5000+, VLIW4. Or do I need more specific binaries than that?
Is there an easy way to match precompiled binaries to the user's GPUs? Anything other than going by the device name reported by OpenCL and mapping that to architecture myself?