Archives Discussions

tomer_gal · ‎10-23-2015

Hi,

If this is truly the case, this is getting disappointing as this adds up to other bugs I am encountering.

(There is another bug which I haven't reported where clCreateProgramWithBinary has to be disabled, otherwise it breaks some kernels)

For a simple reduction kernel, getting that using the new work_group_reduce_max is x18 times slower than the previous implementation.

Previously in CodeXL, 100 invocations consumed: 0.84030

Now in CodeXL, 100 invocations consume 14.38ms

This is the new kernel:

kernel void max_reduction(global float * input, int length, local float * data, global float * groupsResult)

{

const int globalId = get_global_id(0);

const int localId = get_local_id(0);

float maxValue = work_group_reduce_max(globalId<length ? input[globalId] : -INFINITY);

if (localId==0)

{

groupsResult[get_group_id(0)] = maxValue;

}

This is the previous kernel:

kernel void max_reduction(global float * input, int length, local float * data, global float * groupsResult)

{

const int globalId = get_global_id(0);

const int localId = get_local_id(0);

if (globalId<length)

data[localId] = input[globalId];

else

data[localId] = -INFINITY;

barrier(CLK_LOCAL_MEM_FENCE);

for (int offset=get_local_size(0)/2;offset>0;offset=offset/2)

{

if (localId<offset)

{

data[localId] = max(data[localId],data[localId+offset]);

}

barrier(CLK_LOCAL_MEM_FENCE);

}

if (localId==0)

{

groupsResult[get_group_id(0)] = data[localId];

}

Regards,

Tomer Gal, CTO at OpTeamizer

ekondis · ‎10-23-2015

It seems that you have experienced the same bug as I did here Poor workgroup reduction function performance (OpenCL 2.0).

I still find workgroup reductions to be quite slow, producing a huge stream of instructions.

What type of GPU are you using?

tomer_gal · ‎10-23-2015

Hi ekonid,

Using FirePro W7100.

I wonder with the long time the SDK v3.0 has been at beta, what kind of testing was done that neglected performance in such a way.

Essentially, it's like getting an emulation of the OpenCL 2 features without being able to use it for production environment.

Regards,

Tomer Gal, CTO at OpTeamizer

ekondis · ‎10-24-2015

Hmmm... It's sad that even the W7100 exhibits such low performance. I was hoping that the Tonga based W7100, as featuring the most recent instruction set (volcanic islands) would perform better but that seems not to be the case. This absolutely eliminates the benefits of workgroup reduction functions and forces programmers to use handcrafted implementations. GPU programming is about performance after all.

Archives Discussions

BUG: OpenCL 2 work_group_reduce is x18 slower