AnsweredAssumed Answered

BUG: OpenCL 2 work_group_reduce is x18 slower

Question asked by tomer_gal on Oct 23, 2015
Latest reply on Oct 24, 2015 by ekondis

Hi,

If this is truly the case, this is getting disappointing as this adds up to other bugs I am encountering.

(There is another bug which I haven't reported where clCreateProgramWithBinary has to be disabled, otherwise it breaks some kernels)

 

For a simple reduction kernel, getting that using the new work_group_reduce_max is x18 times slower than the previous implementation.

Previously in CodeXL, 100 invocations consumed: 0.84030

Now in CodeXL, 100 invocations consume 14.38ms

 

 

This is the new kernel:

kernel void max_reduction(global float * input, int length, local float * data, global float * groupsResult)

{

  const int globalId = get_global_id(0);

  const int localId = get_local_id(0);

  float maxValue = work_group_reduce_max(globalId<length ? input[globalId] : -INFINITY);

 

  if (localId==0)

  {

      groupsResult[get_group_id(0)] = maxValue;

  }

}

 

This is the previous kernel:

 

kernel void max_reduction(global float * input, int length, local float * data, global float * groupsResult)

{

  const int globalId = get_global_id(0);

  const int localId = get_local_id(0);

 

  if (globalId<length)

      data[localId] = input[globalId];

  else

      data[localId] = -INFINITY;

 

  barrier(CLK_LOCAL_MEM_FENCE);

  for (int offset=get_local_size(0)/2;offset>0;offset=offset/2)

  {

      if (localId<offset)

      {

          data[localId] = max(data[localId],data[localId+offset]);

      }

   barrier(CLK_LOCAL_MEM_FENCE);

  }

  if (localId==0)

  {

      groupsResult[get_group_id(0)] = data[localId];

  }

}

 

Regards,

Tomer Gal, CTO at OpTeamizer

Outcomes