5 Replies Latest reply on Jul 9, 2011 2:45 AM by maximmoroz

    Is barrier allowed inside cycle?

    maximmoroz

      I have a kernel. The kernel produces slightly different results each execution, which bothers me greatly :) I have already spent enourmous amount of time trying to figure out the problem but with no success yet.

      The kernel has an outer cycle of a fixed size. There is barrier(CLK_LOCAL_MEM_FENCE) inside the cycle. A wild idea came to my mind: Is barrier allowed inside cycle? honestly I don't remeber seeing other code with barrier inside the cycle.

      Below is stripped verion of the kernel. Any ideas are welcomed.

      __kernel __attribute__((reqd_work_group_size(16, 16, 1))) void ConvolutionRegister( const __global float * restrict input, __global float * restrict output, const __global float * restrict weights, const __global int * restrict weights_offsets, const __global float * restrict biases ) { __local float input_buffer[IN_SIZE]; __local float weight_buffer[W_SIZE]; float sum = 0.0F; for(uint input_feature_map_id = 0; input_feature_map_id < INPUT_FEATURE_MAP_COUNT; input_feature_map_id++) { const int weights_offset = weights_offsets[input_feature_map_id]; // fill local weight_buffer and input_buffer // ... // end of fill local weight_buffer and input_buffer barrier(CLK_LOCAL_MEM_FENCE); // update weighted sum // ... // end of update weighted sum } // write result to output feature map output[some_index] = sum; }

        • Is barrier allowed inside cycle?
          tonyo_au

          I have done this with no problem. You only have to be carefully all work items reach a barrier

            • Is barrier allowed inside cycle?
              maximmoroz

              Oh, thanks. I will dig in other direction.

                • Is barrier allowed inside cycle?
                  maximmoroz

                  Wow, I solved this so annoying problem. It is not enough to have single barrier inside the cycle. While one wavefront might be doing "update weighted sum" in iteration 1 the other at the same time might already "fill local weight_buffer and input_buffer" in iteration 2 thus ruining the work of the 1st wavefront.

                  While now I see that the solution looks obvious and the error is silly, it was not obvious at all until I found (almost accidently) the solution. The cycle itself visually looked... so nice and compact that I assumed that I need to take care about syncronization inside cycle only, as if cycle will take care of itself.

                  I am attaching the fixed code.

                  __kernel __attribute__((reqd_work_group_size(16, 16, 1))) void ConvolutionRegister( const __global float * restrict input, __global float * restrict output, const __global float * restrict weights, const __global int * restrict weights_offsets, const __global float * restrict biases ) { __local float input_buffer[IN_SIZE]; __local float weight_buffer[W_SIZE]; float sum = 0.0F; for(uint input_feature_map_id = 0; input_feature_map_id < INPUT_FEATURE_MAP_COUNT; input_feature_map_id++) { const int weights_offset = weights_offsets[input_feature_map_id]; // fill local weight_buffer and input_buffer // ... // end of fill local weight_buffer and input_buffer barrier(CLK_LOCAL_MEM_FENCE); // update weighted sum // ... // end of update weighted sum barrier(CLK_LOCAL_MEM_FENCE); } // write result to output feature map output[some_index] = sum; }