For awhile now I've been tracking down a uint4 swizzle bug that first appeared in the Catalyst 11.6 driver, and has continued to exist in all subsequent versions (including the just released 11.9).
I can't post the kernel here since I've not been able to make a simplified enough test case that is generic enough to be released. But, I can describe the scenario, and AMD can get the reproducing OpenCL kernel from me, and as an attachement in KnowledgeBase ticket #1507.
Basically, I have a four element vector (a uint4) that I initialize before entering a loop that goes around four times. So that I can avoid using a switch statement, I rotate the elements one position using a swizzle so that the loop body can work on element s0 of the vector, and yet will process each of the four elements by the end of all the loop iterations.
Then, here's the bug... if I reference the vector elements after the loop is all done, they should be back in the original order... and they are with Catalyst 11.5. Yet, with Catalyst 11.6 through 11.9, the later references to the vector elements are scrambled. Well, actually, they appear to have only been rotated three times, not four. This I think is a simple off-by-one value tracking bug in the IL to ISA compile built in to the Catalyst 11.6+ driver, or bad interaction of dead-code-elimination, since the final "rotate" isn't needed, if subsequent references are adjusted. If the IL to ISA compiler was open sourced, I could have fixed this a month ago...
Anyway, my workaround is to save a copy of the uint4 vector before entering the loop, and then using the copy after the loop. I'm sure it would work to instead copy the uint4 to a temporary variable that is only used inside the loop... either way all the workaround is doing is bypassing the erronious loop-unrolled value tracking of a series of swizzles.
Unfortunately in my testing, simple constructed examples do not exhibit the bug, and only when the loop body has enough complexity does it appear.
I've only tested under Linux (Centos 5.6, the RHEL 5.6 clone), but I have tested on three different GPU families: Radeon HD 4650, 6670, and 6970. I've also tested both APP SDK 2.4 and 2.5 versions, and while the different SDK versions generate different IL and ISA for the kernel, the error is still the same.
My point of this post is that I really wish AMD would add more unit testing and more "real codes" testing to their software releases for OpenCL and the underlying Catalyst driver, or put more of the infrastructure as open source so the communicty can augment your QA/development efforts.
Finally, I have a workaround that, at the moment doesn't seem to impact performance of my code, but it could increase register pressure.