I have a kernel which, unvectorized, works beautifully on I7, Xeon, nVidia 330m and AMD 5870.
Vectorized, it works beautifully on I7, Xeon, and the nVidia, but fails show-stoppingly on the AMD.
Attempting to write the output buffer … I've previously calculated r, g, and b values as floats (or float4s if vectorized), between 0 and 1 in value.
Nonvectorized I do this for the final assembly into the output buffer:
[ pif is a struct with float elements r, g, and b ]
__kernel void PaintCUltraQuickProxyKernel( __global varholder * vh, __global uint * obuf )
uint dout = (uint)round(pif.r*255.f) << 8 | (uint)round(pif.g*255.f) << 16 | (uint)round(pif.b*255.f) << 24;
obuf[offset] = dout;
SO, vectorized I try this (with obuf the same):
[ pif is now a struct with float4 elements r, g, and b ]
i0 = (uint)round(pif.r.s0*255.f) << 8 | (uint)round(pif.g.s0*255.f) << 16 | (uint)round(pif.b.s0*255.f) << 24;
i1 = (uint)round(pif.r.s1*255.f) << 8 | (uint)round(pif.g.s1*255.f) << 16 | (uint)round(pif.b.s1*255.f) << 24;
i2 = (uint)round(pif.r.s2*255.f) << 8 | (uint)round(pif.g.s2*255.f) << 16 | (uint)round(pif.b.s2*255.f) << 24;
i3 = (uint)round(pif.r.s3*255.f) << 8 | (uint)round(pif.g.s3*255.f) << 16 | (uint)round(pif.b.s3*255.f) << 24;
obuf[offset ] = i0;
obuf[offset+1] = i1;
obuf[offset+2] = i2;
obuf[offset+3] = i3;
… thus setting the values of all four pixels calculated by the vectorized form of this kernel.
The truly weird part is that I can do any three of the above assignments to obuf, and the kernel performs perfectly.
I can write the first three elements … I can write the last three … I can write any three as long as I do not attempt to write all four.
If I do try to write the fourth, then my system hangs on calling clFinish. Lion requires a hard shutdown and reboot.
(Don't worry; offset is in bounds at all times; note that I can write anywhere within the range, just so long as I don't write all four pixels calculated by the kernel!)
[ I have tried this with many different ways of assembling the data for writing to obuf . . .
I have tried this by constructing a single uint16 (with obuf type uint16);
I have tried this by writing 16 individual uchars (with obuf type uchar);
Tried vstores . . .
and other ways . . .
all with the same failure mode; they all work perfectly on I7, Xeon, and nVidia. ]
. . . Aaaand, I guess that's about it. I'm not asking for support from AMD, because I am using Lion on a Macbook Pro and on a Mac Pro, and I know that AMD does not support the Apple drivers. HOWEVER, I thought that this would be the place to ask!!! I've tried over on Khronos' forums, but have not scored a solution; I figure that here I will find more folks coding for AMD. Perhaps someone has run into this kind of problem before and can shine a flashlight in the right direction . . . .
Thank you for any attention!
Well, Micah, I think that it must be Apple-specific, or else there would be no way to assemble and write a 16-byte chunk to global memory, which would cripple vectorization of a kernel whose scalar-version work-item outputs, for example, a four-byte pixel.
Just for kicks, here's another way it fails: (oops, code chunk attached, removing it here)
... I can do all four assignments to .s0->.s3, without the vstore, and it runs. I can do the vstore, and any two non-adjacent assignments to the subelements, (0 & 1, 1 & 3, 0 & 3) and I get good data back. If I assign any three elements, or all four, Lion hangs. [Using separate writes to int4s I am able to do three of four writes.] Cannot be universal, or *everybody* would be complaining.
I don't know about the feasability of a test on Linux ... would need to grab a machine, then how do I execute my kernel in the absence of its extremely intimately connected support software, which itself is heavily dependent on Cocoa frameworks? [The heart of it is straight C and has been for decades, but the UI is extremely involved and then there's timing, file system, screen stuff....] It could be done, and nonsense inputs to the kernel dummied up, but for the time involved, I've calculated that I could just get a 12-core Xeon fruit crate and not need the GPU at all ... IF I wanna spend 5 grand of savings ... but might be quickest route to showtime, if I can't find a reasonable workaround for this.
I don't suppose that there is any way for me to shoehorn your newer AMD compiler into the guts of Xcode, is there??? I have done extremely low-level work and am not afraid of anything.
I guess a possible workaround is to use an obuf twice as wide as needed, not write every other pixel, and squish it on the CPU. *sigh*. I cringe a bit at the thought. But, what have I got to lose?
Thanks for your reply!
uint4 i4; i4.s0 = (uchar)round(pif.r.s0*255.f) << 8 | (uchar)round(pif.g.s0*255.f) << 16 | (uchar)round(pif.b.s0*255.f) << 24; i4.s1 = (uchar)round(pif.r.s1*255.f) << 8 | (uchar)round(pif.g.s1*255.f) << 16 | (uchar)round(pif.b.s1*255.f) << 24; i4.s2 = (uchar)round(pif.r.s2*255.f) << 8 | (uchar)round(pif.g.s2*255.f) << 16 | (uchar)round(pif.b.s2*255.f) << 24; i4.s3 = (uchar)round(pif.r.s3*255.f) << 8 | (uchar)round(pif.g.s3*255.f) << 16 | (uchar)round(pif.b.s3*255.f) << 24; vstore4( i4, 0, dad );
1) Dang, I forget what the first point was. May edit later.
2) Well, from what I've seen I should be able to strip the kernel down to 6 or 8 lines and still have it fail. A test for you would then be trivial. (And, for you, it will work!) Thank you for the offer. Tomorrow, perhaps.
3) Perhaps I shall try that workaround ... I'm pretty sure I'm not at all bound by memory bandwidth; I could have each vectorized kernel-pup write 32 bytes, of which only the odd ones will be meaningful. Should take me just minutes. [That will give me half the performance I need; I think I'll send half of the job per frame to the 4-core Xeon, and that will give me all I need.]
Micah, I thank you for your attention to a problem which looks to be ultimately outside of AMD's sphere of concern. It may, however, affect future sales of your GPUs to an unknown extent, so you should be able to justify the small time spent on it!
Well, that one was weird from nose to tail, let me tell you:
You'll recall that I was doing something like
uint i0,i1,i2,i3; float4 r,g,b; //obuf is __global uint *
i0 = (uint) round( r.s0 * 255.f ) << 8 | (uint) round( g.s0 * 255.f ) << 16 | (uint) round( b.s0 * 255.f ) << 24;
i1 = (uint) round( r.s1 * 255.f ) << 8 | (uint) round( g.s1 * 255.f ) << 16 | (uint) round( b.s1 * 255.f ) << 24;
i2 = (uint) round( r.s2 * 255.f ) << 8 | (uint) round( g.s2 * 255.f ) << 16 | (uint) round( b.s2 * 255.f ) << 24;
i3 = (uint) round( r.s3 * 255.f ) << 8 | (uint) round( g.s3 * 255.f ) << 16 | (uint) round( b.s3 * 255.f ) << 24;
obuf[offset ] = i0;
obuf[offset+1] = i1;
obuf[offset+2] = i2;
obuf[offset+3] = i3;
. . . what I could do without crashing was either: 1) all four of the assignments, but only three of the writes, or 2) all four of the writes but only three of the assignments. If I tried to do all four of both, Lion hung.
What fixed it?
Removing the call to round().
Makes perfect sense, doesn't it? Well, it might if I had access to the source code of the compiler, and I might also need detailed schematics of the 5870. And a month to bone up on chip design. But, I've got other things to do, so I'll let it remain a mystery for now. It turns out that the call to round() is unneeded anyway.
Of course now that it works it's been collapsed to a single uint4 operation on the float4s, followed by a single vstore4().
Micah, thanks for your help. It was your suggestion that led me to the solution. I tried 4 or 5 new ways to make it work since I last posted, but eventually I gave up and decided to make a little kernel for you to test. The littlest one worked, and I kept adding bits until I found the culprit. (The last bit of weirdness is that the call to round() was ok in the test kernel, and only failed when I moved that test line back into the real kernel.)
I want to add that I'm pretty pleased with the performance of the 5870; it about matches my chickenscratch calculations from before I bought it. I had a chance the other day to try this app on a Quadro FX 5600, which I think is like a $2000 card, and it wasn't nearly as fast. I don't have hard numbers but I think I may be getting something like 30-40% of the theoretical max.