For awhile now I've been tracking down a uint4 swizzle bug that first appeared in the Catalyst 11.6 driver, and has continued to exist in all subsequent versions (including the just released 11.9).
I can't post the kernel here since I've not been able to make a simplified enough test case that is generic enough to be released. But, I can describe the scenario, and AMD can get the reproducing OpenCL kernel from me, and as an attachement in KnowledgeBase ticket #1507.
Basically, I have a four element vector (a uint4) that I initialize before entering a loop that goes around four times. So that I can avoid using a switch statement, I rotate the elements one position using a swizzle so that the loop body can work on element s0 of the vector, and yet will process each of the four elements by the end of all the loop iterations.
Then, here's the bug... if I reference the vector elements after the loop is all done, they should be back in the original order... and they are with Catalyst 11.5. Yet, with Catalyst 11.6 through 11.9, the later references to the vector elements are scrambled. Well, actually, they appear to have only been rotated three times, not four. This I think is a simple off-by-one value tracking bug in the IL to ISA compile built in to the Catalyst 11.6+ driver, or bad interaction of dead-code-elimination, since the final "rotate" isn't needed, if subsequent references are adjusted. If the IL to ISA compiler was open sourced, I could have fixed this a month ago...
Anyway, my workaround is to save a copy of the uint4 vector before entering the loop, and then using the copy after the loop. I'm sure it would work to instead copy the uint4 to a temporary variable that is only used inside the loop... either way all the workaround is doing is bypassing the erronious loop-unrolled value tracking of a series of swizzles.
Unfortunately in my testing, simple constructed examples do not exhibit the bug, and only when the loop body has enough complexity does it appear.
I've only tested under Linux (Centos 5.6, the RHEL 5.6 clone), but I have tested on three different GPU families: Radeon HD 4650, 6670, and 6970. I've also tested both APP SDK 2.4 and 2.5 versions, and while the different SDK versions generate different IL and ISA for the kernel, the error is still the same.
My point of this post is that I really wish AMD would add more unit testing and more "real codes" testing to their software releases for OpenCL and the underlying Catalyst driver, or put more of the infrastructure as open source so the communicty can augment your QA/development efforts.
Finally, I have a workaround that, at the moment doesn't seem to impact performance of my code, but it could increase register pressure.
I'll second open sourcing IL->ISA and OpenCL->IL compiler! I wouldn't mind being able to shut off some of the "optimizations" in the IL-ISA compiler...like its ludicrous use of registers. I was using between 25 and 30, and the ISA code is using 51. (reguardless of the 25-51 range). I'm pretty close to abandoning IL and writing ISA instead...its ugly, and a pain, but I gotta wonder, is it worth it?
I found that I can also workaround the bug by putting in a "#pragma unroll 4" for the loop that contains the swizzle. Thus, I think this "swizzle bug" is another manifestation of the bug reported here:
"Loop not executed correct number of times unless unrolled"
Where, I think the problem is that the Catalyst 11.6 driver introduced a bug in automatic loop unrolling that missed keeping track of some side-effects, such as those from swizzle operations. Thus, if you give the IL->ISA compiler already unrolled loops, we avoid the bug, and if we avoid relying on the particular side effects to occur the right number of times, we also avoid the bug.
Catalyst 11.10 changes things, and for my simplified test case it initially appears to fix this bug. However, when I run my full program with Catalyst 11.10 and APP SDK 2.5, I get differently incorrect results.
Also, now if I happen to use the "#pragma unroll"'s, my program crashes when I target a Cayman or a Turks GPU. When I target an RV730 GPU, everything is now correct, though very very slow (about 70% slower than SDK 2.4 and Catalyst 11.5).
I don't have time right now to keep diagnosing this, and am very frustrated that one of my workarounds now causes a segmentation fault in the Catalyst compiler!
Maybe I can talk to some AMD developers at SC11. This cherry picking of versions of their software is not a viable way to do technical computing on GPUs. AMD really needs to apply some serious software engineering concepts to their release process: unit testing, regression testing, and code review (hint: Open source increases the number of eyeballs finding and fixing bugs). Using AMD GPUs for more than games/entertainment purposes is seriously becoming not an option, even though the hardware is so fast.