Archives Discussions

landmann · ‎01-29-2011

Hi,

again it is a kind of memory transpose kernel I am working on. I realized that when using the uchar16 data type the compiler generates 4 read and 4 write instructions to transfer one element ( dest[idx] = src[idx2] ), whereas declaring the pointers to point to float4 only generates one read and one write instruction to transfer the same amount of data.

What prevents the compiler from doing the same operation for the uchar16 data type?

Thanks!

Joerg

FrodoTheGiant · ‎01-30-2011

Originally posted by: landmannI realized that when using the uchar16 data type the compiler generates 4 read and 4 write instructions ...

May I ask which tool you use to get this information (num of reads/writes) ?

landmann · ‎01-31-2011

I am using Stream Kernel Analyzer 1.7. Although the trust to its numbers is sometimes questionable I hope that at least the disasm view is correct.

himanshu_gautam · ‎02-01-2011

landman,

Although I am not very sure on this and it would be nice to hear from others.

What i feel is that it would not be possible for a processing element to process more than one vector element at a time. With float4 we can process four floats with 4 general purpose processing elements but with uchar16 it will process just four uchars at a time. So it should take about 4x the time.

jeff_golds · ‎02-01-2011

If you feel you are input-bound, you could try something like:

as_uchar16(((uint4*)a)[idx]) in placeof a[idx].

Jeff

landmann · ‎02-01-2011

Sure, but my question is "why" should I do these nasty tricks at all? My kernel does not even evaluate the memory content, I just started using the native data type. Now that I am using float4 it looks much better.

I was looking for an explanation, to check what I did wrong, or ,of course, hoping to read "will be fixed in 2.4"

FrodoTheGiant · ‎02-03-2011

Originally posted by: jeff_golds If you feel you are input-bound, you could try something like:

as_uchar16(((uint4*)a)[idx]) in placeof a[idx]. Jeff

If I would do something like that - how much overhead would that be?

`

Or a more general question: How much overhead is type casting?

E.g. something like

int a = 13;

float b = (float) a;

MicahVillmow · ‎02-01-2011

Our hardware does not support uchar16 natively, so we emulate it with integers, and the largest integer type we support natively is vec4, so the uchar16 gets broken down into vec4 which is why you see 4x as many loads. This will not be fixed in 2.4.

nou · ‎02-01-2011

but when i load uchar4 why it can't load as int and then put into four registers?

jeff_golds · ‎02-01-2011

Originally posted by: nou but when i load uchar4 why it can't load as int and then put into four registers?

It does that already, right? That's why uchar16 takes 4 loads.

Jeff

MicahVillmow · ‎02-03-2011

FrodoTheGiant,
There is a difference between type casting and bit casting.

as_uchar16 is a bitcast and the overhead is the unpacking of the char types from the uint4.
Typecasting follows the OpenCL conversion rules and in some cases can be fairly expensive. Type casting of pointers has no overhead.

In the case of the as_uchar16 bitcast, you are explicitly doing what the compiler does implicitly. The only difference between the code snippets is loading a uint4* is done in a single load, but loads with a uchar16 is done with 4 loads. Both approaches require unpacking of the data into 32bit registers.

FrodoTheGiant · ‎02-04-2011

Thanks Micah,

how expensive exactly is a cast from int to float?

MicahVillmow · ‎02-04-2011

bitcast is free, typecast is 1 instruction per component.

Archives Discussions

uchar16 vs. float4