
Question about dot prod intrinsic
katayama Jun 17, 2011 2:47 AM (in response to bubu)Hi bubu,
what's the result of dot(a,b)? 10 or 6?
It will be 10.
And what's more efficient?
Thus a and b are both const, k is computed at compiletime. There is no difference in efficiency.
If one of a and b are not const, second one has less add operation, and it also be written as
float k = dot(a.xyz, b.xyz);
or consider using float3. (OpenCL 1.1 feature.)

bubu Jun 17, 2011 2:43 PM (in response to katayama)I've heard float3 are very inneficient.
#2 has less operations but, as the Radeon's SIMD loves float4 and not scalar ops I'm not completely sure...

katayama Jun 18, 2011 6:43 PM (in response to bubu)Sorry, in my previous post, I did some misunderstanding.
Thus there is 'DOT4' instruction, dot(float4, float4) can be executed in 1cycle and occupy all of XYZW pipeline.
In other hands, dot(float3, float3) need 3 cycles (MUL, MULADD, MULADD) but occupy only one pipeline. So, multiple dot(float3, float3) can be executed in 3 cycles. (4x on 69xx, 5x on 68xx or older.)
So efficiency is depending on your kernel code.

