3 Replies Latest reply on Feb 2, 2018 5:55 AM by dipak

    Very strange LDS access behavior, help needed.

    liwoog

      Hi,

       

      I have reduced my problem to the test below.

       

      The first version of this kernel executes 100x faster than the second.

      For testing sake and to only execute the minimum of wavefronts, loadCount is set to use the full LDS (32KB).

       

      Any help would be greatly appreciated.

       

      Lionel

       

      PS: stuffing the if statement with a register-based computation, makes no difference.

       

      kernel __attribute__((reqd_work_group_size(64, 1, 1)))

      void testKernel(int  toiCount,

        local int * loadCount)

      {

        int lid = get_local_id(0);

       

        loadCount[lid] = 0;

       

        // Loop over all beams numbers

        for (int toiIndex = 0; toiIndex < toiCount; ++toiIndex) {

       

        if (0 == toiIndex) {

      // loadCount[lid] = 0;

        }

        }

      }

       

      kernel __attribute__((reqd_work_group_size(64, 1, 1)))

      void testKernel(int  toiCount,

        local int * loadCount)

      {

        int lid = get_local_id(0);

       

      // loadCount[lid] = 0;

       

        // Loop over all beams numbers

        for (int toiIndex = 0; toiIndex < toiCount; ++toiIndex) {

       

        if (0 == toiIndex) {

        loadCount[lid] = 0;

        }

        }

      }