Hi,
If this is truly the case, this is getting disappointing as this adds up to other bugs I am encountering.
(There is another bug which I haven't reported where clCreateProgramWithBinary has to be disabled, otherwise it breaks some kernels)
For a simple reduction kernel, getting that using the new work_group_reduce_max is x18 times slower than the previous implementation.
Previously in CodeXL, 100 invocations consumed: 0.84030
Now in CodeXL, 100 invocations consume 14.38ms
This is the new kernel:
kernel void max_reduction(global float * input, int length, local float * data, global float * groupsResult)
{
const int globalId = get_global_id(0);
const int localId = get_local_id(0);
float maxValue = work_group_reduce_max(globalId<length ? input[globalId] : -INFINITY);
if (localId==0)
{
groupsResult[get_group_id(0)] = maxValue;
}
}
This is the previous kernel:
kernel void max_reduction(global float * input, int length, local float * data, global float * groupsResult)
{
const int globalId = get_global_id(0);
const int localId = get_local_id(0);
if (globalId<length)
data[localId] = input[globalId];
else
data[localId] = -INFINITY;
barrier(CLK_LOCAL_MEM_FENCE);
for (int offset=get_local_size(0)/2;offset>0;offset=offset/2)
{
if (localId<offset)
{
data[localId] = max(data[localId],data[localId+offset]);
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (localId==0)
{
groupsResult[get_group_id(0)] = data[localId];
}
}
Regards,
Tomer Gal, CTO at OpTeamizer