I have a kernel which, unvectorized, works beautifully on I7, Xeon, nVidia 330m and AMD 5870.
Vectorized, it works beautifully on I7, Xeon, and the nVidia, but fails show-stoppingly on the AMD.
Attempting to write the output buffer … I've previously calculated r, g, and b values as floats (or float4s if vectorized), between 0 and 1 in value.
Nonvectorized I do this for the final assembly into the output buffer:
[ pif is a struct with float elements r, g, and b ]
__kernel void PaintCUltraQuickProxyKernel( __global varholder * vh, __global uint * obuf )
uint dout = (uint)round(pif.r*255.f) << 8 | (uint)round(pif.g*255.f) << 16 | (uint)round(pif.b*255.f) << 24;
obuf[offset] = dout;
SO, vectorized I try this (with obuf the same):
[ pif is now a struct with float4 elements r, g, and b ]
i0 = (uint)round(pif.r.s0*255.f) << 8 | (uint)round(pif.g.s0*255.f) << 16 | (uint)round(pif.b.s0*255.f) << 24;
i1 = (uint)round(pif.r.s1*255.f) << 8 | (uint)round(pif.g.s1*255.f) << 16 | (uint)round(pif.b.s1*255.f) << 24;
i2 = (uint)round(pif.r.s2*255.f) << 8 | (uint)round(pif.g.s2*255.f) << 16 | (uint)round(pif.b.s2*255.f) << 24;
i3 = (uint)round(pif.r.s3*255.f) << 8 | (uint)round(pif.g.s3*255.f) << 16 | (uint)round(pif.b.s3*255.f) << 24;
obuf[offset ] = i0;
obuf[offset+1] = i1;
obuf[offset+2] = i2;
obuf[offset+3] = i3;
… thus setting the values of all four pixels calculated by the vectorized form of this kernel.
The truly weird part is that I can do any three of the above assignments to obuf, and the kernel performs perfectly.
I can write the first three elements … I can write the last three … I can write any three as long as I do not attempt to write all four.
If I do try to write the fourth, then my system hangs on calling clFinish. Lion requires a hard shutdown and reboot.
(Don't worry; offset is in bounds at all times; note that I can write anywhere within the range, just so long as I don't write all four pixels calculated by the kernel!)
[ I have tried this with many different ways of assembling the data for writing to obuf . . .
I have tried this by constructing a single uint16 (with obuf type uint16);
I have tried this by writing 16 individual uchars (with obuf type uchar);
Tried vstores . . .
and other ways . . .
all with the same failure mode; they all work perfectly on I7, Xeon, and nVidia. ]
. . . Aaaand, I guess that's about it. I'm not asking for support from AMD, because I am using Lion on a Macbook Pro and on a Mac Pro, and I know that AMD does not support the Apple drivers. HOWEVER, I thought that this would be the place to ask!!! I've tried over on Khronos' forums, but have not scored a solution; I figure that here I will find more folks coding for AMD. Perhaps someone has run into this kind of problem before and can shine a flashlight in the right direction . . . .
Thank you for any attention!