3 Replies Latest reply on Oct 24, 2015 3:20 AM by ekondis

    BUG: OpenCL 2 work_group_reduce is x18 slower

    tomer_gal

      Hi,

      If this is truly the case, this is getting disappointing as this adds up to other bugs I am encountering.

      (There is another bug which I haven't reported where clCreateProgramWithBinary has to be disabled, otherwise it breaks some kernels)

       

      For a simple reduction kernel, getting that using the new work_group_reduce_max is x18 times slower than the previous implementation.

      Previously in CodeXL, 100 invocations consumed: 0.84030

      Now in CodeXL, 100 invocations consume 14.38ms

       

       

      This is the new kernel:

      kernel void max_reduction(global float * input, int length, local float * data, global float * groupsResult)

      {

        const int globalId = get_global_id(0);

        const int localId = get_local_id(0);

        float maxValue = work_group_reduce_max(globalId<length ? input[globalId] : -INFINITY);

       

        if (localId==0)

        {

            groupsResult[get_group_id(0)] = maxValue;

        }

      }

       

      This is the previous kernel:

       

      kernel void max_reduction(global float * input, int length, local float * data, global float * groupsResult)

      {

        const int globalId = get_global_id(0);

        const int localId = get_local_id(0);

       

        if (globalId<length)

            data[localId] = input[globalId];

        else

            data[localId] = -INFINITY;

       

        barrier(CLK_LOCAL_MEM_FENCE);

        for (int offset=get_local_size(0)/2;offset>0;offset=offset/2)

        {

            if (localId<offset)

            {

                data[localId] = max(data[localId],data[localId+offset]);

            }

         barrier(CLK_LOCAL_MEM_FENCE);

        }

        if (localId==0)

        {

            groupsResult[get_group_id(0)] = data[localId];

        }

      }

       

      Regards,

      Tomer Gal, CTO at OpTeamizer