0 Replies Latest reply on Feb 24, 2009 5:42 PM by ryta1203

    Curious Case

    ryta1203

      I have a kernel that runs significantly faster if this part of the kernel:

       

      //g0to3.w=g0to3.w-0.0fg0to3.w-g8.z);

      g0to3.x=g0to3.x-1.64f*(g0to3.x-(-2.0f*g8.z+3.0f*(tmp0to3.w+tmp0to3.x)));

      g0to3.y=g0to3.y-1.54f*(g0to3.y-(g8.z-3.0f*(tmp0to3.w+tmp0to3.x)));

      //g0to3.z=g0to3.z-0.0f*(g0to3.z-g8.x);

      g4to7.w=g4to7.w-(8.0f*(2.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f)))/(8.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f))))*(g4to7.w+g8.x);

      //g4to7.x=g4to7.x-0.0f*(g4to7.x-g8.y);

      g4to7.y=g4to7.y-(8.0f*(2.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f)))/(8.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f))))*(g4to7.y+g8.y);

      g4to7.z=g4to7.z-(1.0f/(3.0f*(1.0f/6.0f)+.5f))*(g4to7.z-(tmp0to3.w-tmp0to3.x));

      g8.w=g8.w-(1.0f/(3.0f*(1.0f/6.0f)+.5f))*(g8.w-g8.x*g8.y);

       

      Is changed to this:

       

      g0to3.w=g0to3.w-ss[0]*(g0to3.w-g8.z);

      g0to3.x=g0to3.x-1.64f*(g0to3.x-(-2.0f*g8.z+3.0f*(tmp0to3.w+tmp0to3.x)));

      g0to3.y=g0to3.y-1.54f*(g0to3.y-(g8.z-3.0f*(tmp0to3.w+tmp0to3.x)));

      //g0to3.z=g0to3.z-0.0f*(g0to3.z-g8.x);

      g4to7.w=g4to7.w-(8.0f*(2.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f)))/(8.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f))))*(g4to7.w+g8.x);

      //g4to7.x=g4to7.x-0.0f*(g4to7.x-g8.y);

      g4to7.y=g4to7.y-(8.0f*(2.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f)))/(8.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f))))*(g4to7.y+g8.y);

      g4to7.z=g4to7.z-(1.0f/(3.0f*(1.0f/6.0f)+.5f))*(g4to7.z-(tmp0to3.w-tmp0to3.x));

      g8.w=g8.w-(1.0f/(3.0f*(1.0f/6.0f)+.5f))*(g8.w-g8.x*g8.y);

       

      NOTICE only the first line is changed, instead of using a consant 0.0f OR commenting out the line altogether (both of these run at the same time speed), I use a gather array and just index the first element of the gather array (size 9). Using the gather array works much faster. Could this be caused by the cache?