0 Replies Latest reply on Jun 17, 2009 6:41 AM by codeboycjy

    Simulate local array using shared memory , but no performance improvement?

    codeboycjy

      Currently i'm building kd-tree on Brook+, here is the problem i've encountered now.
      i need to pick the edge with the longest length to split in the median.

      Here is the brute force way of doing it:

      if( splitAxis == 0 )
      {
         if( v1.x < splitPosition ) { ... }
         if( v2.x < splitPosition ) { ... }
         if( v3.x < splitPosition ) { ... }
      }else if( splitAxis == 1 )
      {
         if( v1.y < splitPosition ) { ... }
         if( v2.y < splitPosition ) { ... }
         if( v3.y < splitPosition ) { ... }
      }else if( splitAxis == 2 )
      {
         if( v1.z < splitPosition ) { ... }
         if( v2.z < splitPosition ) { ... }
         if( v3.z < splitPosition ) { ... }
      }

      I assume there could be much divergency in the above code.

      if the float4 could be access this way:
        float4 data; data[0] = 1.0f; which is same with:
        float4 data; data.x = 1.0f;

      The above code could be improved like this
      if( v1[splitAxis] < splitPosition ) { ... }
      if( v2[splitAxis] < splitPosition ) { ... }
      if( v3[splitAxis] < splitPosition ) { ... }

      So i simulated the process by an alternative way.

      shared float4 lds[256];

      lds[ 4 * instanceInGroup().x + 0 ] = float4( v1.x , v2.x , v3.x , 1.0f );
      lds[ 4 * instanceInGroup().x + 1 ] = float4( v1.y , v2.y , v3.y , 1.0f );
      lds[ 4 * instanceInGroup().x + 2 ] = float4( v1.z , v2.z , v3.z , 1.0f );

      if( lds[ 4 * instanceInGroup().x + splitAxis ].x < splitPosition ) { ... }
      if( lds[ 4 * instanceInGroup().x + splitAxis ].y < splitPosition ) { ... }
      if( lds[ 4 * instanceInGroup().x + splitAxis ].z < splitPosition ) { ... }

      In the current code , i thought there could be much more improvement, but when the code is compile on kernel analyzer.]
      The performance is worse than the old one... I don't get it. And the bottle neck of the current code is ALU Ops.
      But actualy there are three times ALP ops in the old code than the current one. Why there is no performance??