3 Replies Latest reply on Jun 18, 2011 6:43 PM by katayama

    Question about dot prod intrinsic

    bubu

      Does the "dot" intrinsic take into cosideration the .w component of a float4 register?

       

      For example,

       

      const float4 a = (float4)(1.0f,2.0f,3.0f,4.0f);

      const float4 b = (float4)(1.0f,1.0f,1.0f,1.0f);

      what's the result of dot(a,b)? 10 or 6?

       

      And what's more efficient?

      1. const float k = dot(a,b)

      or

      2. const float k = a.x*b.x + a.y*b.y + a.z*b.z

       

      thx

        • Question about dot prod intrinsic
          katayama

          Hi bubu,

           

          what's the result of dot(a,b)? 10 or 6?


           It will be 10.

           

          And what's more efficient?


          Thus a and b are both const, k is computed at compile-time. There is no difference in efficiency.

          If one of a and b are not const, second one has less add operation, and it also be written as

          float k = dot(a.xyz, b.xyz);

          or consider using float3. (OpenCL 1.1 feature.)

            • Question about dot prod intrinsic
              bubu

              I've heard float3 are very inneficient.

              #2 has less operations but, as the Radeon's SIMD loves float4 and not scalar ops I'm not completely sure...

               

                • Question about dot prod intrinsic
                  katayama

                  Sorry, in my previous post, I did some misunderstanding.

                  Thus there is 'DOT4' instruction, dot(float4, float4) can be executed in 1cycle and occupy all of XYZW pipeline.

                  In other hands, dot(float3, float3) need 3 cycles (MUL, MULADD, MULADD) but occupy only one pipeline. So, multiple dot(float3, float3) can be executed in 3 cycles. (4x on 69xx, 5x on 68xx or older.)

                  So efficiency is depending on your kernel code.