ryta1203

Curious Case

Discussion created by ryta1203 on Feb 24, 2009

I have a kernel that runs significantly faster if this part of the kernel:

 

//g0to3.w=g0to3.w-0.0fg0to3.w-g8.z);

g0to3.x=g0to3.x-1.64f*(g0to3.x-(-2.0f*g8.z+3.0f*(tmp0to3.w+tmp0to3.x)));

g0to3.y=g0to3.y-1.54f*(g0to3.y-(g8.z-3.0f*(tmp0to3.w+tmp0to3.x)));

//g0to3.z=g0to3.z-0.0f*(g0to3.z-g8.x);

g4to7.w=g4to7.w-(8.0f*(2.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f)))/(8.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f))))*(g4to7.w+g8.x);

//g4to7.x=g4to7.x-0.0f*(g4to7.x-g8.y);

g4to7.y=g4to7.y-(8.0f*(2.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f)))/(8.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f))))*(g4to7.y+g8.y);

g4to7.z=g4to7.z-(1.0f/(3.0f*(1.0f/6.0f)+.5f))*(g4to7.z-(tmp0to3.w-tmp0to3.x));

g8.w=g8.w-(1.0f/(3.0f*(1.0f/6.0f)+.5f))*(g8.w-g8.x*g8.y);

 

Is changed to this:

 

g0to3.w=g0to3.w-ss[0]*(g0to3.w-g8.z);

g0to3.x=g0to3.x-1.64f*(g0to3.x-(-2.0f*g8.z+3.0f*(tmp0to3.w+tmp0to3.x)));

g0to3.y=g0to3.y-1.54f*(g0to3.y-(g8.z-3.0f*(tmp0to3.w+tmp0to3.x)));

//g0to3.z=g0to3.z-0.0f*(g0to3.z-g8.x);

g4to7.w=g4to7.w-(8.0f*(2.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f)))/(8.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f))))*(g4to7.w+g8.x);

//g4to7.x=g4to7.x-0.0f*(g4to7.x-g8.y);

g4to7.y=g4to7.y-(8.0f*(2.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f)))/(8.0f-(1.0f/(3.0f*(1.0f/6.0f)+.5f))))*(g4to7.y+g8.y);

g4to7.z=g4to7.z-(1.0f/(3.0f*(1.0f/6.0f)+.5f))*(g4to7.z-(tmp0to3.w-tmp0to3.x));

g8.w=g8.w-(1.0f/(3.0f*(1.0f/6.0f)+.5f))*(g8.w-g8.x*g8.y);

 

NOTICE only the first line is changed, instead of using a consant 0.0f OR commenting out the line altogether (both of these run at the same time speed), I use a gather array and just index the first element of the gather array (size 9). Using the gather array works much faster. Could this be caused by the cache?





Outcomes