12 Replies Latest reply on Feb 4, 2011 8:08 PM by MicahVillmow

    uchar16 vs. float4

    landmann

      Hi,

      again it is a kind of memory transpose kernel I am working on. I realized that when using the uchar16 data type the compiler generates 4 read and 4 write instructions to transfer one element ( dest[idx] = src[idx2] ), whereas declaring the pointers to point to float4 only generates one read and one write instruction to transfer the same amount of data.

      What prevents the compiler from doing the same operation for the uchar16 data type?

      Thanks!

      Joerg

        • uchar16 vs. float4
          FrodoTheGiant

           

          Originally posted by: landmannI realized that when using the uchar16 data type the compiler generates 4 read and 4 write instructions ...


           

          May I ask which tool you use to get this information (num of reads/writes) ?

            • uchar16 vs. float4
              landmann

              I am using Stream Kernel Analyzer 1.7. Although the trust to its numbers is sometimes questionable I hope that at least the disasm view is correct.

                • uchar16 vs. float4
                  himanshu.gautam

                  landman,

                  Although I am not very sure on this and it would be nice to hear from others.

                  What i feel is that it would not be possible for a processing element to process more than one vector element at a time. With float4 we can process four floats with 4 general purpose processing elements  but with uchar16 it will process just four uchars at a time. So it should take about 4x the time.

                   

                    • uchar16 vs. float4
                      jeff_golds

                      If you feel you are input-bound, you could try something like:

                      as_uchar16(((uint4*)a)[idx]) in placeof a[idx].

                      Jeff

                        • uchar16 vs. float4
                          landmann

                          Sure, but my question is "why" should I do these nasty tricks at all? My kernel does not even evaluate the memory content, I just started using the native data type. Now that I am using float4 it looks much better.

                          I was looking for an explanation, to check what I did wrong, or ,of course, hoping to read "will be fixed in 2.4"

                          • uchar16 vs. float4
                            FrodoTheGiant

                             

                            Originally posted by: jeff_golds If you feel you are input-bound, you could try something like:

                             

                            as_uchar16(((uint4*)a)[idx]) in placeof a[idx]. Jeff

                             

                             

                            If I would do something like that - how much overhead would that be?

                            `

                            Or a more general question: How much overhead is type casting?

                             

                            E.g. something like

                            int a = 13;

                            float b = (float) a;

                    • uchar16 vs. float4
                      MicahVillmow
                      Our hardware does not support uchar16 natively, so we emulate it with integers, and the largest integer type we support natively is vec4, so the uchar16 gets broken down into vec4 which is why you see 4x as many loads. This will not be fixed in 2.4.
                      • uchar16 vs. float4
                        MicahVillmow
                        FrodoTheGiant,
                        There is a difference between type casting and bit casting.

                        as_uchar16 is a bitcast and the overhead is the unpacking of the char types from the uint4.
                        Typecasting follows the OpenCL conversion rules and in some cases can be fairly expensive. Type casting of pointers has no overhead.

                        In the case of the as_uchar16 bitcast, you are explicitly doing what the compiler does implicitly. The only difference between the code snippets is loading a uint4* is done in a single load, but loads with a uchar16 is done with 4 loads. Both approaches require unpacking of the data into 32bit registers.
                        • uchar16 vs. float4
                          MicahVillmow
                          bitcast is free, typecast is 1 instruction per component.