Archives Discussions

yours3lf · ‎02-06-2012

Hi,

I tried to get the culling working today, but I ran into the same freezing issue, but this time everything seems to be working the scene goes on with ~70fps, but after a few seconds it freezes (as opposed to immediate freezing...)

I'm not sure if it's related to the same problem because all the barriers must be hit by every work-item, since I used the method MicahVillmow showed me here: http://devgurus.amd.com/thread/158555

So this is how my kernel looks like:

__constant float far = -10000.0f; //far plane distance
__constant float near = -1.0f; //near plane distance
__constant sampler_t the_sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
__constant float cutoff = 0.25f; //0.005f
__constant int attenuation_type = 0; //linear or full attenuation?
float my_abs( float var ) //these floating point operations aren't supported in opencl 1.1
{
  if ( var < 0 )
  {
    return -var;
  }
  else
  {
    return var;
  }
}
float my_mix( float x, float y, float weigth )
{
  return x * ( 1.0f - weigth ) + y * weigth;
}
float3 my_reflect( float3 incident, float3 normal )
{
  return incident - 2.0f * dot( normal, incident ) * normal;
}
float3 decode_normals_spheremap( float4 n ) //decode normals from spheremap encoding
{
  float4 nn = n * ( float4 )( 2.0f, 2.0f, 0.0f, 0.0f ) + ( float4 )( -1.0f, -1.0f, 1.0f, -1.0f );
  float l = dot( nn.xyz, -nn.xyw );
  nn.z = l;
  nn.xy *= sqrt( l );
  return nn.xyz * 2.0f + ( float3 )( 0.0f, 0.0f, -1.0f );
}
float3 decode_linear_depth( float4 linear_depth, float4 position ) //decode linear depth into view space position
{
  return ( float3 )( position.xy * ( far / position.z ), far ) * linear_depth.x;
}
__kernel void main( __read_only image2d_t albedo, //diffuse surface color from the g-buffer
                    __read_only image2d_t normals, //normals encoded using spheremap encoding
                    __read_only image2d_t depth, //linear depth
                    __write_only image2d_t result, //the output buffer that stores lighting data
                    __global const float* far_plane, //the lower left and upper right corners of the far plane
                    __global const float* in_view_pos, //view space camera position
                    __global const float* in_lights, //1024 lights {light pos[3], diffuse_color[3], radius[1], specular intensity[1] }
                    __global const float* in_num_of_lights, //number of incoming lights (1024)
                    __global const float* in_projection_matrix ) //the projection matrix is used for frustum culling
{
  /*
   * Per pixel calculations (global)
   */
  int2 coords = ( int2 )( get_global_id( 0 ), get_global_id( 1 ) );
  float4 raw_albedo = read_imagef( albedo, the_sampler, coords );
  float4 raw_normal = read_imagef( normals, the_sampler, coords ); //this will store the decoded normals
  float4 out_color = ( float4 )( 0.0f ); //this will store the resulting color
  out_color.w = 1.0f;
  float4 raw_depth; //this will store the decoded view space position
  int2 global_size = ( int2 )( get_global_size( 0 ), get_global_size( 1 ) );
  /*
   * Per tile data
   */
  float3 view_pos; //this will store the view space position (uniform among the workgroups, but stored as local for speedup)
  int num_of_lights; //num of lights (same here)
  int2 local_coords;
  int2 local_size;
  int workgroup_index;
  __local int tile_lights[1024]; //index of the lights visible per tile
  __local int num_of_tile_lights; //number of lights per tile
  float2 tile_scale; //used for calculating frustum culling, taken from Intel's sample
  float2 tile_bias;
  float4 column_1;
  float4 column_2;
  float4 column_4;
  float4 frustum_planes[6];
  /*
   * Check for skybox
   */
  bool early_rejection = ( length( raw_normal.xy ) == 0.0f );
  if ( early_rejection )
  {
    out_color = raw_albedo;
  }
  else
  {
    local_coords = ( int2 )( get_local_id( 0 ), get_local_id( 1 ) );
    local_size = ( int2 )( get_local_size( 0 ), get_local_size( 1 ) );
    workgroup_index = local_coords.y * local_size.x + local_coords.x;
    float4 ll, ur;
    raw_depth = read_imagef( depth, the_sampler, coords );
    ll = ( float4 )( far_plane[0], far_plane[1], far_plane[2], 1.0f );
    ur = ( float4 )( far_plane[3], far_plane[4], far_plane[5], 1.0f );
    //texture coordinate [0...1] for input processing
    float2 texel = ( float2 )(( float )( coords.x ) / ( float )( global_size.x ), ( float )( coords.y ) / ( float )( global_size.y ) );
    raw_depth.xyz = decode_linear_depth( raw_depth, ( float4 )( my_mix( ll.x, ur.x, texel.x ), my_mix( ll.y, ur.y, texel.y ), ll.z, 1.0f ) );
    raw_normal.xyz = decode_normals_spheremap( raw_normal );
    view_pos = vload3( 0, in_view_pos );
    num_of_lights = ( int )in_num_of_lights[0];
    //I'm not sure if the maths here is correct due to OpenGL and DirectX using different matrices, but this shouldnt matter
    tile_scale = ( float2 )( global_size.x, global_size.y ) * ( 1.0f / ( float )( 2.0f * local_size.x ) );
    tile_bias = tile_scale - ( float2 )( local_coords.x, local_coords.y );
    column_1 = ( float4 )( in_projection_matrix[5] * tile_scale.x, 0.0f, tile_bias.x, 0.0f );
    column_2 = ( float4 )( 0.0f, -in_projection_matrix[10] * tile_scale.y, tile_bias.y, 0.0f );
    column_4 = ( float4 )( 0.0f, 0.0f, 1.0f, 0.0f );
    frustum_planes[0] = column_4 - column_1;
    frustum_planes[1] = column_4 + column_1;
    frustum_planes[2] = column_4 - column_2;
    frustum_planes[3] = column_4 + column_2;
    frustum_planes[4] = ( float4 )( 0.0f, 0.0f, -1.0f, near );
    frustum_planes[5] = ( float4 )( 0.0f, 0.0f, 1.0f, far );
    for ( int c = 0; c < 4; c++ ) //normalize frustum plane normals
    {
      frustum_planes.xyz *= 1.0f / length( frustum_planes.xyz );
    }
    /*
    * Per workgroup (tile) calculations (local)
    */
    if ( workgroup_index == 0 )
    {
      num_of_tile_lights = 0;
    }
  }
  barrier( CLK_LOCAL_MEM_FENCE );
  if ( !early_rejection )
  {
    for ( int c = workgroup_index; c < num_of_lights; c += local_size.x * local_size.y ) //cull each light per tile, each thread in a tile processes one light
    {
      if ( c < num_of_lights )
      {
        bool in_frustum = true;
        float attenuation_end = 0.0f;
        if ( attenuation_type == 0 )
        {
          attenuation_end = ( float )( in_lights[c * 8 + 6] ) / ( float )( cutoff ); //radius / cutoff
        }
        else
        {
          attenuation_end = ( float )( in_lights[c * 8 + 6] ); //radius
        }
        for ( int d = 0; d < 6; d++ ) //cull each light based on the distance where it will shine and the frustum defined by the tile
        {
          float e = dot( frustum_planes, ( float4 )( in_lights[c * 8 + 0], in_lights[c * 8 + 1], in_lights[c * 8 + 2], 1.0f ) );
          in_frustum = in_frustum && ( e >= -attenuation_end );
        }
        if ( in_frustum ) //if the light is in the frustum, then store its index (if I comment this out, the kernel runs, but doesn't cull lights)
        {
          int index = atomic_inc( &num_of_tile_lights );
          tile_lights[index] = c;
          if ( num_of_tile_lights != 0 )
          {
            out_color = ( float4 )( 0.0f, 1.0f, 0.0f, 1.0f );
          }
        }
      }
    }
  }
  barrier( CLK_LOCAL_MEM_FENCE );
  /*
   * Per light calculations
   */
  if ( !early_rejection )
  {
    for ( int c = 0; c < num_of_tile_lights; c++ ) //draw each light per tile
    {
      int index = tile_lights; //get back the light index
      float3 light_pos = ( float3 )( in_lights[index * 8 + 0], in_lights[index * 8 + 1], in_lights[index * 8 + 2] ); //gather light data using the index
      float light_radius = in_lights[index * 8 + 6];
      //calculate blinn-phong lighting with custom attenuation
      float3 light_dir = light_pos - raw_depth.xyz;
      float distance = length( light_dir );
      light_dir /= distance;
      float coeff, attenuation;
      if ( attenuation_type == 0 )
      {
        coeff = max( distance - light_radius, 0.0f ) / light_radius + 1.0f;
        attenuation = max(( 1.0f / ( coeff * coeff ) - cutoff ) / ( 1.0f - cutoff ), 0.0f );
      }
      else
      {
        attenuation = ( light_radius - distance ) / ( light_radius * 0.01f ) * 0.01f;
      }
      if ( attenuation > 0.0f )
      {
        float3 light_diffuse_color = ( float3 )( in_lights[index * 8 + 3], in_lights[index * 8 + 4], in_lights[index * 8 + 5] );
        float light_specular_power = ( float )in_lights[index * 8 + 3];
        float3 view_dir = normalize( view_pos - raw_depth.xyz );
        float3 half_vector = ( light_dir + view_dir ) * 0.5f;
        float n_dot_l = max( dot( raw_normal.xyz, light_dir ), 0.0f );
        out_color.xyz += raw_albedo.xyz * light_diffuse_color * n_dot_l * attenuation;
        float n_dot_h = pow( max( dot( raw_normal.xyz, half_vector ), 0.0f ), light_specular_power );
        out_color.xyz += light_diffuse_color * n_dot_h * attenuation;
      } 
    }
  }
  if ( coords.x < global_size.x && coords.y < global_size.y )
  {
    write_imagef( result, coords, out_color ); //write the calculated light data to the result buffer (texture)
  }
}

Best regards,

Yours3lf

yours3lf · ‎03-01-2012

ok, so I tried to make a small demo for every error prone thing, and I noticed the following:

if you create a kernel with 1 read only image2d_t and 1 write only image2d_t, and copy from the read only to the write only it works.

__kernel void main(__read_only image2d_t texture_read, __write_only image2d_t texture_write)
{
    int2 coords = (int2)(get_global_id(0), get_global_id(1));
    const sampler_t the_sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
    float4 texture_color = read_imagef(texture_read, the_sampler, coords);
    float4 final_color = texture_color;
    write_imagef(texture_write, coords, final_color); //fills the write only image2d_t with green
}

if you add another read ony image2d_t but you don't actually read in anything, (which might happen in the above kernel in case it draws the background) then it doesn't work, freezes the video output.

__kernel void main(__read_only image2d_t texture_read, __read_only image2d_t texture_read2, __write_only image2d_t texture_write)
{
    int2 coords = (int2)(get_global_id(0), get_global_id(1));
    const sampler_t the_sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
    float4 texture_color = read_imagef(texture_read, the_sampler, coords);
    float4 final_color = texture_color;
    write_imagef(texture_write, coords, final_color); //fails
}

if you don't actually include the 2nd image2d_t's value (that you read in) in the final output, that is you don't write it to the write only image2d_t, it fails again...

__kernel void main(__read_only image2d_t texture_read, __read_only image2d_t texture_read2, __write_only image2d_t texture_write)
{
    int2 coords = (int2)(get_global_id(0), get_global_id(1));
    const sampler_t the_sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
    float4 texture_color = read_imagef(texture_read, the_sampler, coords);
    float4 texture_color2 = read_imagef(texture_read2, the_sampler, coords);
    float4 final_color = texture_color;
    write_imagef(texture_write, coords, final_color); //fails
}

it only works if you include the second value as well

__kernel void main(__read_only image2d_t texture_read, __read_only image2d_t texture_read2, __write_only image2d_t texture_write)
{
    int2 coords = (int2)(get_global_id(0), get_global_id(1));
    const sampler_t the_sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
    float4 texture_color = read_imagef(texture_read, the_sampler, coords);
    float4 texture_color2 = read_imagef(texture_read2, the_sampler, coords);
    float4 final_color = texture_color + texture_color2;
    write_imagef(texture_write, coords, final_color); //works but the output is green again instead of white...
}

the first image2d_t is filled with purple (1, 0, 1)

the second image2d_t is filled with green (0, 1, 0)

the result is expected to be either green (only the first image2d_t is used in the final color), or white if both

is this a driver related issue (I'm using Catalyst 12.1 64 bit) or am I missing something from the OpenCL specifications?

notzed · ‎03-01-2012

if you don't actually include the 2nd image2d_t's value (that you read in) in the final output, that is you don't write it to the write only image2d_t, it fails again...

Well FWIW this will equate to your second example: the compiler will optimise out the unused read.

Also FWIW I tried running your cut-down code on it's own (second example), and nothing untoward happened, although there's more going on than just the kernel so I made some assumptions.

float my_abs( float var ) //these floating point operations aren't supported in opencl 1.1

sure they are, they're called fabs(), fmin()

yours3lf · ‎03-02-2012

thanks for the reply,

yeah I know this from GLSL, but shouldn't it give me an error when executing, or just simply don't freeze?

I've attached an example which freezes the video output for me. Be careful with it if you're not on Win 7.

https://docs.google.com/open?id=0B2XGeuoAIb6gYWxuM1F5cDZTSS1qSFVxZm50S3hrUQ

thanks for the fabs, I didn't notice these in the specs.

notzed · ‎03-02-2012

Usually crashes are bugs in your code, possibly elsewhere such as the gl

interop.

Sorry i'm on Linux and `don't do' c++ either, but maybe someone else can

look at the code. I did run it in virtualbox but it just hung the app;

so yeah i can confirm it crashes

If you're having crashes, try the cpu driver, at least it will just

crash and might help you debug it. (not sure how it works with opengl

stuff tho, been a while since i tried that).

yours3lf · ‎03-02-2012

the source is cross-platform, you only have to compile it and link the executable against: sfml-system sfml-window GLEW OpenCL

But make sure you save your work before running it, because you won't be able to do anything after the freeze but restart your computer.

Ok I'll try, but is this a driver related issue? I use the latest 12.1, but if it is, could someone tell the dev guys to look at it?

yours3lf · ‎03-02-2012

so I modified the little example to use my cpu, and surprisingly it didn't fail, so I suppose this is a driver issue...

AMD dev guys please fix this!

notzed · ‎03-02-2012

Not necessarily, because invalid memory reads/writes will necessarily

have a different effect. Just writing to an unused bit of the stack is

harmless for example. Even with valid memory accesses the parallelism

affects execution order and how conflicting memory accesses are resolved

which will affect program state. In short, just because it runs on a

cpu doesn't mean the code is correct or will run safely on a gpu. GPU

'cpu's don't have the exception handling of modern cpu's either, so it's

easier to hard-crash them (I would think).

FWIW I'm tracking down a hard crash too right now: cpu works fine, no

segfaults, I think the code is ok but it's complicated enough that I

can't be sure, and yesterday it would regularly hard lock X if I ran it

on a GPU every 2-3 runs. Then I went to try to track it down this

morning, and it just wouldn't die - only run super-slowly once in about

10 goes. After a bit of code redesign (unrelated to this routine), the

crash is back ...

Of course one can't rule out a driver bug.

yours3lf · ‎03-20-2012

Hi,

I've solved the freezing problem, and finished implementing the technique.

You can take a look at it here:

http://www.gamedev.net/page/community/iotd/index.html/_/tile-based-deferred-shading-via-opencl-r233

Archives Discussions

kernel freezes video output... again