Hi,
just ran into an interesting problem. I realize that byte addressable store is an extension, but shouldn't the following code work regardless of the extension?
That is, is it really the case that vstore4 is illegal? Unfortunately I'm unable to look at a disassembly, but from the result I get, it looks like the vstore4 is implemented as a sequence of "1 byte" stores, which of course in reality is a series of 32 bit stores, so there's a lot of overwriting going on and the end result is 3 zero bytes and one original byte.
Looking at the standard I can see how you could interpret it either way in this case, but when the extension isn't supported, the result of the generated code will always be wrong in the case of small-datatype vectors, so it's a bit strange to allow it to go through with a successful compile.
Naturally, it works when compiling for CPU, but not GPU.
In case it isn't obvious, the reason this behavior surprised me is that it seems to me that the functionality of vstore must always be implementable on the byte level if it is possible to implement it with the bigger vectors. That is, it will always involve reading bigger aligned blocks, modifying them in some way, and writing them back, so read/write access is always needed. The only question is whether data merging is done in hardware or software.
And that, in a nutshell, is what I thought vload and vstore was for - specifying to the compiler that we're sorry for not respecting hardware alignments and request software intervention during vector loads and stores if needed.
kernel void CrossFade( float ratio, size_t width, size_t height, global uchar* pInput0, size_t inputStride0, global uchar* pInput1, size_t inputStride1, global uchar* pOutput, size_t outputStride ) { size_t x = get_global_id(0); size_t y = get_global_id(1); uchar4 argb0; uchar4 argb1; uchar4 mixed; argb0 = vload4( x, pInput0+y*inputStride0 ); argb1 = vload4( x, pInput1+y*inputStride1 ); mixed = convert_uchar4_sat_rte( mix( convert_float4(argb0), convert_float4(argb1), ratio ) ); vstore4(mixed, x, pOutput+y*outputStride ); }