cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

tomer_gal
Adept I

BUG: OpenCL 2 work_group_reduce is x18 slower

Hi,

If this is truly the case, this is getting disappointing as this adds up to other bugs I am encountering.

(There is another bug which I haven't reported where clCreateProgramWithBinary has to be disabled, otherwise it breaks some kernels)

For a simple reduction kernel, getting that using the new work_group_reduce_max is x18 times slower than the previous implementation.

Previously in CodeXL, 100 invocations consumed: 0.84030

Now in CodeXL, 100 invocations consume 14.38ms

This is the new kernel:

kernel void max_reduction(global float * input, int length, local float * data, global float * groupsResult)

{

  const int globalId = get_global_id(0);

  const int localId = get_local_id(0);

  float maxValue = work_group_reduce_max(globalId<length ? input[globalId] : -INFINITY);

  if (localId==0)

  {

      groupsResult[get_group_id(0)] = maxValue;

  }

}

This is the previous kernel:

kernel void max_reduction(global float * input, int length, local float * data, global float * groupsResult)

{

  const int globalId = get_global_id(0);

  const int localId = get_local_id(0);

  if (globalId<length)

      data[localId] = input[globalId];

  else

      data[localId] = -INFINITY;

  barrier(CLK_LOCAL_MEM_FENCE);

  for (int offset=get_local_size(0)/2;offset>0;offset=offset/2)

  {

      if (localId<offset)

      {

          data[localId] = max(data[localId],data[localId+offset]);

      }

   barrier(CLK_LOCAL_MEM_FENCE);

  }

  if (localId==0)

  {

      groupsResult[get_group_id(0)] = data[localId];

  }

}

Regards,

Tomer Gal, CTO at OpTeamizer

0 Likes
3 Replies
ekondis
Adept II

It seems that you have experienced the same bug as I did here Poor workgroup reduction function performance (OpenCL 2.0)​.

I still find workgroup reductions to be quite slow, producing a huge stream of instructions.

What type of GPU are you using?

0 Likes

Hi ekonid,

Using FirePro W7100.

I wonder with the long time the SDK v3.0 has been at beta, what kind of testing was done that neglected performance in such a way.

Essentially, it's like getting an emulation of the OpenCL 2 features without being able to use it for production environment.

Regards,

Tomer Gal, CTO at OpTeamizer

0 Likes

Hmmm... It's sad that even the W7100 exhibits such low performance. I was hoping that the Tonga based W7100, as featuring the most recent instruction set (volcanic islands) would perform better but that seems not to be the case. This absolutely eliminates the benefits of workgroup reduction functions and forces programmers to use handcrafted implementations. GPU programming is about performance after all.

0 Likes