A most Vexing problem; highly strange difficulty writing output buffer....

Discussion created by Photovore on Nov 6, 2011
Latest reply on Nov 13, 2011 by Photovore
can write any of 3 of a float4 output buffer, just not all 4!!


Hi folks.

I have a kernel which, unvectorized, works beautifully on I7, Xeon, nVidia 330m and AMD 5870.

Vectorized, it works beautifully on I7, Xeon, and the nVidia, but fails show-stoppingly on the AMD.

Attempting to write the output buffer … I've previously calculated r, g, and b values as floats (or float4s if vectorized), between 0 and 1 in value.

Nonvectorized I do this for the final assembly into the output buffer:

[ pif is a struct with float elements r, g, and b ]

__kernel void PaintCUltraQuickProxyKernel( __global varholder * vh, __global uint * obuf )


uint dout = (uint)round(pif.r*255.f) << 8 | (uint)round(pif.g*255.f) << 16 | (uint)round(pif.b*255.f) << 24;

obuf[offset] = dout;


SO, vectorized I try this (with obuf the same):

[ pif is now a struct with float4 elements r, g, and b ]

uint i0,i1,i2,i3;

i0 = (uint)round(pif.r.s0*255.f) << 8 | (uint)round(pif.g.s0*255.f) << 16 | (uint)round(pif.b.s0*255.f) << 24;

i1 = (uint)round(pif.r.s1*255.f) << 8 | (uint)round(pif.g.s1*255.f) << 16 | (uint)round(pif.b.s1*255.f) << 24;

i2 = (uint)round(pif.r.s2*255.f) << 8 | (uint)round(pif.g.s2*255.f) << 16 | (uint)round(pif.b.s2*255.f) << 24;

i3 = (uint)round(pif.r.s3*255.f) << 8 | (uint)round(pif.g.s3*255.f) << 16 | (uint)round(pif.b.s3*255.f) << 24;

obuf[offset  ] = i0;

obuf[offset+1] = i1;

obuf[offset+2] = i2;

obuf[offset+3] = i3;

… thus setting the values of all four pixels calculated by the vectorized form of this kernel.

The truly weird part is that I can do any three of the above assignments to obuf, and the kernel performs perfectly.

I can write the first three elements … I can write the last three … I can write any three as long as I do not attempt to write all four.

If I do try to write the fourth, then my system hangs on calling clFinish.  Lion requires a hard shutdown and reboot.

(Don't worry; offset is in bounds at all times; note that I can write anywhere within the range, just so long as I don't write all four pixels calculated by the kernel!)

[ I have tried this with many different ways of assembling the data for writing to obuf . . .

  I have tried this by constructing a single uint16 (with obuf type uint16);

  I have tried this by writing 16 individual uchars (with obuf type uchar);

  Tried vstores . . .

  and other ways . . .

  all with the same failure mode; they all work perfectly on I7, Xeon, and nVidia. ]

. . . Aaaand, I guess that's about it.  I'm not asking for support from AMD, because I am using Lion on a Macbook Pro and on a Mac Pro, and I know that AMD does not support the Apple drivers.  HOWEVER, I thought that this would be the place to ask!!!  I've tried over on Khronos' forums, but have not scored a solution; I figure that here I will find more folks coding for AMD.  Perhaps someone has run into this kind of problem before and can shine a flashlight in the right direction . . . .

Thank you for any attention!