Archives Discussions

maximmoroz · ‎07-07-2011

I have a kernel. The kernel produces slightly different results each execution, which bothers me greatly 🙂 I have already spent enourmous amount of time trying to figure out the problem but with no success yet.

The kernel has an outer cycle of a fixed size. There is barrier(CLK_LOCAL_MEM_FENCE) inside the cycle. A wild idea came to my mind: Is barrier allowed inside cycle? honestly I don't remeber seeing other code with barrier inside the cycle.

Below is stripped verion of the kernel. Any ideas are welcomed.

__kernel __attribute__((reqd_work_group_size(16, 16, 1))) void ConvolutionRegister( const __global float * restrict input, __global float * restrict output, const __global float * restrict weights, const __global int * restrict weights_offsets, const __global float * restrict biases ) { __local float input_buffer[IN_SIZE]; __local float weight_buffer[W_SIZE]; float sum = 0.0F; for(uint input_feature_map_id = 0; input_feature_map_id < INPUT_FEATURE_MAP_COUNT; input_feature_map_id++) { const int weights_offset = weights_offsets[input_feature_map_id]; // fill local weight_buffer and input_buffer // ... // end of fill local weight_buffer and input_buffer barrier(CLK_LOCAL_MEM_FENCE); // update weighted sum // ... // end of update weighted sum } // write result to output feature map output[some_index] = sum; }

tonyo_au · ‎07-07-2011

I have done this with no problem. You only have to be carefully all work items reach a barrier

maximmoroz · ‎07-08-2011

Oh, thanks. I will dig in other direction.

maximmoroz · ‎07-08-2011

Wow, I solved this so annoying problem. It is not enough to have single barrier inside the cycle. While one wavefront might be doing "update weighted sum" in iteration 1 the other at the same time might already "fill local weight_buffer and input_buffer" in iteration 2 thus ruining the work of the 1st wavefront.

While now I see that the solution looks obvious and the error is silly, it was not obvious at all until I found (almost accidently) the solution. The cycle itself visually looked... so nice and compact that I assumed that I need to take care about syncronization inside cycle only, as if cycle will take care of itself.

I am attaching the fixed code.

__kernel __attribute__((reqd_work_group_size(16, 16, 1))) void ConvolutionRegister( const __global float * restrict input, __global float * restrict output, const __global float * restrict weights, const __global int * restrict weights_offsets, const __global float * restrict biases ) { __local float input_buffer[IN_SIZE]; __local float weight_buffer[W_SIZE]; float sum = 0.0F; for(uint input_feature_map_id = 0; input_feature_map_id < INPUT_FEATURE_MAP_COUNT; input_feature_map_id++) { const int weights_offset = weights_offsets[input_feature_map_id]; // fill local weight_buffer and input_buffer // ... // end of fill local weight_buffer and input_buffer barrier(CLK_LOCAL_MEM_FENCE); // update weighted sum // ... // end of update weighted sum barrier(CLK_LOCAL_MEM_FENCE); } // write result to output feature map output[some_index] = sum; }

tonyo_au · ‎07-09-2011

I believe minimizing barriers is a good thing. Think about algorithms that have the least number of barriers even at the expense of more code. The idea is to get more concurrent execution to maximizing the number of gpus staying occupied

maximmoroz · ‎07-09-2011

I know. But if you are able to put more than one workgroup into each compute unit then barriers is not a big deal.