again it is a kind of memory transpose kernel I am working on. I realized that when using the uchar16 data type the compiler generates 4 read and 4 write instructions to transfer one element ( dest[idx] = src[idx2] ), whereas declaring the pointers to point to float4 only generates one read and one write instruction to transfer the same amount of data.
What prevents the compiler from doing the same operation for the uchar16 data type?
Originally posted by: landmannI realized that when using the uchar16 data type the compiler generates 4 read and 4 write instructions ...
May I ask which tool you use to get this information (num of reads/writes) ?
I am using Stream Kernel Analyzer 1.7. Although the trust to its numbers is sometimes questionable I hope that at least the disasm view is correct.
Although I am not very sure on this and it would be nice to hear from others.
What i feel is that it would not be possible for a processing element to process more than one vector element at a time. With float4 we can process four floats with 4 general purpose processing elements but with uchar16 it will process just four uchars at a time. So it should take about 4x the time.
Sure, but my question is "why" should I do these nasty tricks at all? My kernel does not even evaluate the memory content, I just started using the native data type. Now that I am using float4 it looks much better.
I was looking for an explanation, to check what I did wrong, or ,of course, hoping to read "will be fixed in 2.4"
Originally posted by: jeff_golds If you feel you are input-bound, you could try something like:
as_uchar16(((uint4*)a)[idx]) in placeof a[idx]. Jeff
If I would do something like that - how much overhead would that be?
Or a more general question: How much overhead is type casting?
E.g. something like
int a = 13;
float b = (float) a;
Originally posted by: nou but when i load uchar4 why it can't load as int and then put into four registers?
It does that already, right? That's why uchar16 takes 4 loads.