Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept II

DP kernel vs FP one - strange results

There are 2 kernels that do essentioally the same (DP does it more accurate).

BUT on HD6950 that DP-capable DP kernel tokes more time (and its ISA much bigger) than DP-emulation one.

Any hints why so and what can be changed to speedup DP kernel on compatible hardware?

And another question:

ISA for DP emulation is different when double precision extention is enabled or not enabled (if DP not enabled ISA shorter and the kernel is faster !!!!).

Why single-precision kernel "feels" DP mode and why compiler makes it worse if DP allowed?

DP emulation:

__kernel __attribute__((vec_type_hint(float4)))

void CalcChirpData_kernel2_cl(const float4 chirp_rate, __global float4* cx_DataArray,

                                                                                             __global float4* cx_ChirpDataArray,__global float4* sample_rate) { 

          uint i= get_global_id(0);  

          float4 cx = cx_DataArray;

          float4 time = sample_rate;

          float sin1,sin2, cos1,cos2;

          float4 chirped;

          float tmp;

          float a = chirp_rate.x;

          float b = time.x;

          float a1= chirp_rate.z;

          float b1= time.z;

          float fa = chirp_rate.y;

          float fb = time.y;    

          float fa1 = chirp_rate.w;

          float fb1 = time.w;    

          float ang  = fract(2 * a * fract(b * fb,&tmp),&tmp) + fract(a * fb * fb,&tmp) +

                    fract(b * fract(b * fa,&tmp),&tmp) + fract(2 * b * fa * fb,&tmp) + fa * fb * fb;

          ang -= floor(ang);

          ang *= M_2PI;




          chirped.x = cx.x * cos1 - cx.y * sin1;

          chirped.y = cx.x * sin1 + cx.y * cos1;

          ang  = fract(2 * a1 * fract(b1 * fb1,&tmp),&tmp) + fract(a1 * fb1 * fb1,&tmp) +

                    fract(b1 * fract(b1 * fa1,&tmp),&tmp) + fract(2 * b1 * fa1 * fb1,&tmp) + fa1 * fb1 * fb1;

          ang -= floor(ang);

          ang *= M_2PI;




          chirped.z = cx.z * cos2 - cx.w * sin2;

          chirped.w = cx.z * sin2 + cx.w * cos2;

                    cx_ChirpDataArray = chirped;


DP kernel:


void CalcChirpData_kernel_dp_cl(const double chirp_rate, __global float4* cx_DataArray, __global float4* cx_ChirpDataArray,

                                                                                 const double recip_sample_rate)

                    uint i= get_global_id(0);

                     float4 cx=cx_DataArray;

                    float4 chirp;

                    float2 c, d;

                    float2 angf;

                    double2 time;

                    time.x = (2*i)*recip_sample_rate;

                    time.y= (2*i+1)*recip_sample_rate;

              double2 ang  = chirp_rate*(time*time);      

                    ang  = ang-floor(ang);

                    angf.x =(float)ang.x;

                    angf.y =(float)ang.y;



                              // real = cx.x * c - cx.y * d;

                    chirp.xz = cx.xz*c.xy-cx.yw*d.xy;

                              // imag = cx.x * d + cx.y * c;

                    chirp.yw = cx.xz*d.xy+cx.yw*c.xy;

                    cx_ChirpDataArray = chirp;


5 Replies
Adept III

The emulated one takes advantage of native_sin and native_cos. These are much faster (using SIN and COS instructions) than a fully accurate, double precision sincos in the double kernel. The accurate sincos uses a huge number of instructions. If you look at the ISA it's about 4x longer for the DP kernel. If you create a kernel that just reads numbers and uses the sincos, it looks like it makes up about 85% of the instructions in that kernel.


Though it was single precision sincos it responsible for many instructions indeed.

I replaced sincos with native sin and cos - now DP kernel consists of ~same 5 clauses only.And its execution time much closer to emulation one though still slightly longer... (accordingly KA).

But another questions arise:

   49  x: COS         R7.x,  R5.z     

         y: COS         ____,  R5.z     

         z: COS         ____,  R5.z     

     50  x: SIN         ____,  R8.x     

         y: SIN         ____,  R8.x     

         z: SIN         R5.z,  R8.x     

         w: MUL_e       R0.w,  PV49.x,  R9.z     

     51  x: SIN         ____,  R4.w     

         y: SIN         R8.y,  R4.w     

         z: SIN         ____,  R4.w     

         w: MUL_e       R4.w,  R9.x,  PV50.z     

     52  x: COS         ____,  R2.w     

         y: COS         ____,  R2.w     

         z: COS         ____,  R2.w     

         w: MUL_e       ____,  R9.z,  PV51.y     

Why 3 instruction slots busy with sin and cos? Is it required ?

And another question. In disassembly i see many conditional instructions though this code should be pure computational w/o any branching. Perhaps this comes from usage of floor() that tries to account for some edge cases like NaN, infinity and so on... Can it be so?

And if yes, what should I use instead? I see FRACT instruction in code - should smth like fract() be used instead of a-floor(a) ?

EDIT: commenting out string ang=ang-floor(ang); shortens kernel time by 2 fold .

How to replace floor() with something closer to hardware ? It's known that input data always ordinary numbers moreover, ang has always positive value.

What language construct can be used to simplify kernel in this case?


on 5xxx and earliers there was special unit t which computed cos, sin etc. so you always saw COS,SIN instructions in t: slot. on 6xxx card with 4D VLIW architecture all units are same. they compute sin,cos and DP operations together.


ermmm...probably completely off, but you could try these two methods:

1) use double->integer conversion routine (if supported )

double floor_emu(double val) ( convert_double( convert_long_rtn(val) ) );

or, if your code permits negative values to be floored towards zero, then you could try something this:

2) extracts the exponent, constructs a bitmask from this that is applied to the fraction to steadily truncate the fraction, simulating floor() for positive numbers and integer round towards zero for negative numbers. e.g. (UNTESTED, more pseudocode):

double RoundTowardsZero(double val)

   long bitmaskshift =  min(   max(1075 - as_long((as_ulong(val) >> 52) & 2047), 0)    ,    52);

   return as_double( as_ulong(val) & (0xFFFFFFFFFFFFFFFF << bitmaskshift) );

not sure any of this would help though...



I tried

ang  = ang-trunc(ang);


trunc() variant was faster accordingly KA