# Simulate local array using shared memory , but no performance improvement?

Discussion created by codeboycjy on Jun 17, 2009

Currently i'm building kd-tree on Brook+, here is the problem i've encountered now.
i need to pick the edge with the longest length to split in the median.

Here is the brute force way of doing it:

if( splitAxis == 0 )
{
if( v1.x < splitPosition ) { ... }
if( v2.x < splitPosition ) { ... }
if( v3.x < splitPosition ) { ... }
}else if( splitAxis == 1 )
{
if( v1.y < splitPosition ) { ... }
if( v2.y < splitPosition ) { ... }
if( v3.y < splitPosition ) { ... }
}else if( splitAxis == 2 )
{
if( v1.z < splitPosition ) { ... }
if( v2.z < splitPosition ) { ... }
if( v3.z < splitPosition ) { ... }
}

I assume there could be much divergency in the above code.

if the float4 could be access this way:
float4 data; data = 1.0f; which is same with:
float4 data; data.x = 1.0f;

The above code could be improved like this
if( v1[splitAxis] < splitPosition ) { ... }
if( v2[splitAxis] < splitPosition ) { ... }
if( v3[splitAxis] < splitPosition ) { ... }

So i simulated the process by an alternative way.

shared float4 lds;

lds[ 4 * instanceInGroup().x + 0 ] = float4( v1.x , v2.x , v3.x , 1.0f );
lds[ 4 * instanceInGroup().x + 1 ] = float4( v1.y , v2.y , v3.y , 1.0f );
lds[ 4 * instanceInGroup().x + 2 ] = float4( v1.z , v2.z , v3.z , 1.0f );

if( lds[ 4 * instanceInGroup().x + splitAxis ].x < splitPosition ) { ... }
if( lds[ 4 * instanceInGroup().x + splitAxis ].y < splitPosition ) { ... }
if( lds[ 4 * instanceInGroup().x + splitAxis ].z < splitPosition ) { ... }

In the current code , i thought there could be much more improvement, but when the code is compile on kernel analyzer.]
The performance is worse than the old one... I don't get it. And the bottle neck of the current code is ALU Ops.
But actualy there are three times ALP ops in the old code than the current one. Why there is no performance??