cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Illusio
Journeyman III

vstoreN behavior with byte pointers

Hi,

just ran into an interesting problem. I realize that byte addressable store is an extension, but shouldn't the following code work regardless of the extension?

That is, is it really the case that vstore4 is illegal? Unfortunately I'm unable to look at a disassembly, but from the result I get, it looks like the vstore4 is implemented as a sequence of "1 byte" stores, which of course in reality is a series of 32 bit stores, so there's a lot of overwriting going on and the end result is 3 zero bytes and one original byte.

Looking at the standard I can see how you could interpret it either way in this case, but when the extension isn't supported, the result of the generated code will always be wrong in the case of small-datatype vectors, so it's a bit strange to allow it to go through with a successful compile.

Naturally, it works when compiling for CPU, but not GPU.

In case it isn't obvious, the reason this behavior surprised me is that it seems to me that the functionality of vstore must always be implementable on the byte level if it is possible to implement it with the bigger vectors. That is, it will always involve reading bigger aligned blocks, modifying them in some way, and writing them back, so read/write access is always needed. The only question is whether data merging is done in hardware or software.

And that, in a nutshell, is what I thought vload and vstore was for - specifying to the compiler that we're sorry for not respecting hardware alignments and request software intervention during vector loads and stores if needed.

 

kernel void CrossFade( float ratio, size_t width, size_t height, global uchar* pInput0, size_t inputStride0, global uchar* pInput1, size_t inputStride1, global uchar* pOutput, size_t outputStride ) { size_t x = get_global_id(0); size_t y = get_global_id(1); uchar4 argb0; uchar4 argb1; uchar4 mixed; argb0 = vload4( x, pInput0+y*inputStride0 ); argb1 = vload4( x, pInput1+y*inputStride1 ); mixed = convert_uchar4_sat_rte( mix( convert_float4(argb0), convert_float4(argb1), ratio ) ); vstore4(mixed, x, pOutput+y*outputStride ); }

0 Likes
3 Replies

Illusio,
The byte addressable extension is not supported on the GPU, so any writes to a char* pointer are currently illegal.

Edit:
Our newer releases will detect this case and report a proper error message.
0 Likes

Thanks Micah,

I suppose there is a serious race condition there if it was implemented in software without any locks as well.  Didn't really think it through. Even in the best case the overhead of software merging would be fairly high in the case where the output vector actually is unaligned.

I suppose the spec really is clear too, now that I read it again. The two-line intro to 6.8 m) deflected my skimming, so I took the restriction to be less severe than it really is.

 

0 Likes

Illusio,
You can play around with the experimental byte features as mentioned in other threads by setting GPU_BYTE_ADDRESSABLE_STORE=1, but this can cause very bad behavior in some situations.
0 Likes