3 Replies Latest reply on Feb 4, 2011 8:01 PM by nou

    barriers in kernels

    rick.weber

      Hello,

      I'm trying to implement radix sort in 1 dimension on a 2 dimensional array. So, global_size(0) = 64, as does local_size(0), and global/local_size(1) = numRows. The second work dimension corresponds to the row that work group should sort. Unfortunately, the data in my rows aren't guaranteed to be a multiple of 64, which causes heartache with respect to loops and barriers.

      Basically, I have this for loop with barriers (other code removed):

      for(unsigned int j = get_local_id(0); j < len; j += get_local_size(0))
      {
      barrier(CLK_LOCAL mem);

      where len = 100. The problem I'm having is that thread divergence causes some threads to execute the barrier and some to not. I'm not sure if this is a problem on the GPU since threads execute in lockstep within a block, but on the CPU, it definitely is. I know the OpenCL spec says all work items in a group must execute the barrier, but can I do the following? Each thread should execute the same number of barriers before moving past the if statement.

      unsigned int j = get_local_id(0);
      for(; j < len; j += get_local_size(0))
      {
      barrier(CLK_LOCAL mem);
      }
      if(j >= len)
      {
      barrier(CLK_LOCAL mem);
      }

        • barriers in kernels
          Meteorhead

          As far as I know it should work, although what you wrote code does not do what you explained in text.

          If it doesn't, you can definately rearrange your code so that all work-items hit the same barrier.

          while( i < longestRun )
          {
              if( I_am_thread_to_do_work )
              { ...
              }

              i += whatever;
              barrier();
          }

          Hope that helps.

            • barriers in kernels
              rick.weber

              Yeah, that's pretty much what I ended up doing. Unfortunately, my code has tons of barriers in the main loop, because I have to compute a scan in shared memory each iteration and update some indices. I can get rid of a bunch of them if I use a private register instead of a local memory variable. But this has the unfortunate side effect of hurting memory bandwidth when finding the pivot point. I guess I'll have to experiment to see which is the lesser evil.