Archives Discussions

maximmoroz · ‎06-18-2011

I have a very strange problem: a little different results of subsequent kernel execution on the same data.

AMD R6950, AMD APP 2.4, ATI Catalyst 11.5, Windows 7 Ultimate.

The kernel does convolutions. It is compiled with -cl-fast-relaxed-math option.

The kernel keeps 2 local buffers: one for region of the input feature map and another for weights. There is "by input feature map" cycle in the kernel. At each iteration of the cycle these 2 local buffers are filled from global buffer, then barrier with local memory fence is issued. Then calculations are done using these data already located in local buffers; the results are being aggregated in several registers. In the end the data from registers are saved to global buffer.

The problem is in the code which fills one of the local buffer.

The following code always works fine, gives the same results, but it is slow:

#pragma unroll
for(uint i = 0; i < max_iteration_count; i++)
{
const bool input_in_bounds = (input_window_y < input_buffer_fill_max_y) && (input_window_x < input_buffer_fill_max_x);
const uint offset = input_buffer_fill_input_base_offset + mad24(input_window_y, (uint)input_width, input_window_x);

if (input_in_bounds)
{
input_buffer[local_it] = input[offset];
}
else
{
input_buffer[local_it] = 0.0F;
}

local_it += forward_register_local_work_width * forward_register_local_work_height;
input_window_x += input_window_x_step;
const bool is_next_line = input_window_x >= input_window_width;
input_window_x -= is_next_line ? input_window_width : 0;
input_window_y += is_next_line ? input_window_y_step + 1 : input_window_y_step;
}

If I replace the middle part with predicate code, I will have much faster kernel:

#pragma unroll
for(uint i = 0; i < max_iteration_count; i++)
{
const bool input_in_bounds = (input_window_y < input_buffer_fill_max_y) && (input_window_x < input_buffer_fill_max_x);
const uint offset = input_buffer_fill_input_base_offset + mad24(input_window_y, (uint)input_width, input_window_x);

const uint actual_offset = input_in_bounds ? offset : dest_local;
const float val = input[actual_offset];
input_buffer[local_it] = input_in_bounds ? val : 0.0F;

local_it += forward_register_local_work_width * forward_register_local_work_height;
input_window_x += input_window_x_step;
const bool is_next_line = input_window_x >= input_window_width;
input_window_x -= is_next_line ? input_window_width : 0;
input_window_y += is_next_line ? input_window_y_step + 1 : input_window_y_step;
}

This version doesn't have conditions and thus is able to put several VFETCH instructions into one TEX clause:

51 x: ADD_INT     R1.x, R16.z, PV50.x
09 TEX: ADDR(2066) CNT(3)
52 VFETCH R2.x___, R0.w, fc153
   FETCH_TYPE(NO_INDEX_OFFSET)
53 VFETCH R3.x___, R3.w, fc153
   FETCH_TYPE(NO_INDEX_OFFSET)
54 VFETCH R4.x___, R4.w, fc153
   FETCH_TYPE(NO_INDEX_OFFSET)
10 ALU: ADDR(284) CNT(79) KCACHE0(CB1:0-15)
55 x: CNDE_INT    R0.x, R0.y, 0.0f, R2.x
   y: AND_INT     R0.y, R2.y, 1      VEC_120

But the problem is that this code gives different results each time I run the kernel for millions of work-items. "Different" here means that average aggregated result (single float number) is fluctuating within 0,01%.

The problem arises only in case the group size is bigger than 64 (that is 128, 192, 256). Yes, I have barrier(CLK_LOCAL_MEM_FENCE).

There is a third version of the kernel which is filling local buffer from global one even more efficiently (all 11 VFETCH instructions are in the single TEX clause there). It shows even greater fluctuations (in comparison with the second version).

While 0,01% is not very important for me it is very annoying to see different results. Maybe I have some grave error in the algorithm or implementation and just don't see it due to these problems? I feel very uncomfortable.

Please, help me with any ideas. I have already spent a lot of time on this issue. I would be glad to hear:
1) It is a known problem and will be fixed in the next driver release OR
2) Hey, maximmoroz, you dummy. What did you think when you were writing that line of code?!
But I am kind of realist and I actually be glad to here any ideas.

Thanks to all who read until that point. I am attaching the full source code of the kernel. You will not be able to run it (host code is large), but you can compile it in the Kernel Analyzer, for example.

#define output_width 92 #define output_height 92 #define output_feature_map_count 12 #define input_feature_map_count 2 #define source_width 5 #define source_height 5 #define input_width 96 #define input_height 96 #define output_entry_size 101568 #define input_entry_size 18432 #define forward_register_local_work_width 16 #define forward_register_local_work_height 16 #define forward_register_workload_width 3 #define forward_register_workload_height 3 #define testing_forward_register_reqd_work_group_size_qualifier __attribute__((reqd_work_group_size(16, 16, 1))) __kernel testing_forward_register_reqd_work_group_size_qualifier void ConvolutionRegister( const __global float * restrict input, __global float * restrict output, const __global float * restrict weights, const __global int * restrict weights_offsets, const __global float * restrict biases ) { __local float weights_buffer[((source_width * source_height) + (forward_register_local_work_width * forward_register_local_work_height - 1)) / (forward_register_local_work_width * forward_register_local_work_height) * (forward_register_local_work_width * forward_register_local_work_height)]; __local float input_buffer[((forward_register_local_work_width * forward_register_workload_width + source_width - 1) * (forward_register_local_work_height * forward_register_workload_height + source_height - 1) + (forward_register_local_work_width * forward_register_local_work_height - 1)) / (forward_register_local_work_width * forward_register_local_work_height) * (forward_register_local_work_width * forward_register_local_work_height)]; const uint dest_x_base = mul24((uint)(get_group_id(0)), (uint)(forward_register_local_work_width * forward_register_workload_width)); const uint dest_y_base = mul24((uint)(get_group_id(1)), (uint)(forward_register_local_work_height * forward_register_workload_height)); const uint input_window_width = forward_register_local_work_width * forward_register_workload_width + source_width - 1; const uint input_window_height = forward_register_local_work_height * forward_register_workload_height + source_height - 1; const uint input_window_size = input_window_width * input_window_height; const uint fill_weights_buffer_iteration_count = ((source_width * source_height) + (forward_register_local_work_width * forward_register_local_work_height - 1)) / (forward_register_local_work_width * forward_register_local_work_height); const uint dd = get_global_id(2); const uint entry_id = dd / output_feature_map_count; const uint output_feature_map_id = dd % output_feature_map_count; const uint dest_x_local = get_local_id(0); const uint dest_y_local = get_local_id(1); const uint dest_local = mad24(dest_y_local, (uint)forward_register_local_work_width, dest_x_local); float sum[forward_register_workload_width * forward_register_workload_height]; for(uint i = 0; i < forward_register_workload_height * forward_register_workload_width; i++) { sum = 0.0F; } uint input_buffer_fill_input_base_offset = (entry_id * input_entry_size) + mad24(dest_y_base, (uint)input_width, dest_x_base); const uint input_window_y_backoff = dest_local / input_window_width; const uint input_window_x_backoff = dest_local % input_window_width; const uint max_iteration_count = (input_window_size + (forward_register_local_work_width * forward_register_local_work_height - 1)) / (forward_register_local_work_width * forward_register_local_work_height); const uint input_window_y_step = (forward_register_local_work_width * forward_register_local_work_height) / input_window_width; const uint input_window_x_step = (forward_register_local_work_width * forward_register_local_work_height) % input_window_width; const uint input_buffer_fill_max_y = input_height - dest_y_base; const uint input_buffer_fill_max_x = input_width - dest_x_base; const uint start_input_local = mad24(dest_y_local, (uint)(forward_register_workload_height * input_window_width), mul24(dest_x_local, (uint)forward_register_workload_width)); const uint weights_offset_offset = mul24(output_feature_map_id, input_feature_map_count); #pragma unroll 1 for(uint input_feature_map_id = 0; input_feature_map_id < input_feature_map_count; input_feature_map_id++) { const int weights_offset = weights_offsets[weights_offset_offset + input_feature_map_id]; if (weights_offset >= 0) { if (dest_local < input_window_size) { uint input_window_y = input_window_y_backoff; uint input_window_x = input_window_x_backoff; uint local_it = dest_local; #pragma unroll for(uint i = 0; i < max_iteration_count; i++) { const bool input_in_bounds = (input_window_y < input_buffer_fill_max_y) && (input_window_x < input_buffer_fill_max_x); const uint offset = input_buffer_fill_input_base_offset + mad24(input_window_y, (uint)input_width, input_window_x); const uint actual_offset = input_in_bounds ? offset : dest_local; const float val = input[actual_offset]; input_buffer[local_it] = input_in_bounds ? val : 0.0F; //if (input_in_bounds) { input_buffer[local_it] = input[actual_offset]; } else { input_buffer[local_it] = 0.0F; } local_it += forward_register_local_work_width * forward_register_local_work_height; input_window_x += input_window_x_step; const bool is_next_line = input_window_x >= input_window_width; input_window_x -= is_next_line ? input_window_width : 0; input_window_y += is_next_line ? input_window_y_step + 1 : input_window_y_step; } } uint ind = dest_local; for(uint fill_weights_buffer_iteration = 0; fill_weights_buffer_iteration < fill_weights_buffer_iteration_count; fill_weights_buffer_iteration++) { if (ind < source_width * source_height) { weights_buffer[ind] = weights[weights_offset + ind]; } ind += forward_register_local_work_width * forward_register_local_work_height; } barrier(CLK_LOCAL_MEM_FENCE); uint weights_base_offset2 = 0; uint input_base_offset2 = start_input_local; #pragma unroll for(uint source_y = 0; source_y < source_height; source_y++) { #pragma unroll for(uint source_x = 0; source_x < source_width; source_x++) { const float w = weights_buffer[weights_base_offset2 + source_x]; uint input_base_offset3 = input_base_offset2 + source_x; uint reg_base_offset = 0; #pragma unroll for(uint y = 0; y < forward_register_workload_height; y++) { #pragma unroll for(uint x = 0; x < forward_register_workload_width; x++) { sum[reg_base_offset + x] += input_buffer[input_base_offset3 + x] * w; } input_base_offset3 += input_window_width; reg_base_offset += forward_register_workload_width; } } weights_base_offset2 += source_width; input_base_offset2 += input_window_width; } } input_buffer_fill_input_base_offset += input_width * input_height; } const uint dest_x_initial = mul24((uint)(get_global_id(0)), (uint)forward_register_workload_width); const uint dest_y_initial = mul24((uint)(get_global_id(1)), (uint)forward_register_workload_height); const bool is_actual_neuron_initial = (dest_x_initial < output_width) && (dest_y_initial < output_height); if (is_actual_neuron_initial) { const uint max_y = output_height - dest_y_initial; const uint max_x = output_width - dest_x_initial; const uint initial_offset = (entry_id * output_entry_size) + (output_feature_map_id * (output_width * output_height)) + mad24(dest_y_initial, (uint)output_width, dest_x_initial); uint offset = initial_offset; uint buffer_offset = 0; const float bias = biases[output_feature_map_id]; for(uint y = 0; y < forward_register_workload_height; y++) { const bool valid_y = (y < max_y); for(uint x = 0; x < forward_register_workload_width; x++) { const bool valid_full = valid_y && (x < max_x); const uint actual_offset = valid_full ? offset + x : initial_offset; const uint actual_buffer_offset = valid_full ? buffer_offset + x : 0; output[actual_offset] = sum[actual_buffer_offset] + bias; } offset += output_width; buffer_offset += forward_register_workload_width; } } }

Jawed · ‎07-09-2011

I suspect that the compiler is re-ordering your memory fetches. In the fast code I imagine it has fetched some values from memory too early.

Try to test this by removing loop unroll, with #pragma unroll 1. With a bit of luck the memory fetches will remain inside the loop and will therefore be ordered correctly.

maximmoroz · ‎07-09-2011

Thanks. I solved the problem yesterday: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=152540&enterthread=y

Jawed · ‎07-09-2011

Nice.

Months ago I had to wait for a new SDK to fix similar issues I had - a barrier that was being ignored 😞 So I'm scarred for life by the question of memory ordering problems.

maximmoroz · ‎07-09-2011

What doesn' kill us makes us stronger, Jawed 🙂

Archives Discussions

Different results of subsequent kernel execution on the same data