2 Replies Latest reply on Feb 23, 2015 7:26 PM by jason

    best strategy for many scattered reductions

    jason

      I'm doing several statistics/reductions on a labeled image to compute bounding box, area, maximum and minimum values of related pixels in a source image (4 million pixels for both label and source, each).  The maximum number of labels in general is close to a 10th of that.  Each entry in the label image can occur in any distribution but you will often get large portions of the image concentrating on a few labels.

       

      So far I have 2 implementations:

      1 using global datastore and using atomic max/mins/incs.  This achieves 30 ms/frame (sucks!)

       

      The other is in lds memory and assumes a small known upper bound and had the workgroup do those atomic reductions in lds.  Then the workgroup merges those to the gds via atomic maxs/mins/incs. Higher maximums will start to limit wavefronts and also increase the number of writes.  This achieves 2-3ms per frame with a maximum of 512 labels.

       

      So the small upper bound is achievable in some situations but really I'd like to know if there's a better strategy.  I'd rather not just go exploring willy nilly more than I already have without checking if there's any suggestions on dealing with this kind of scattered atomic write/inc problem.

       

      One idea I had was splitting things up spatially to distribute the atomic operations over different areas of memory.  This didn't help a bit though.  I tried varying with odd sizes the stride between elements also seeming to not effect anything.

        • Re: best strategy for many scattered reductions
          sudarshan

          Hi,

          It is difficult from your post to follow what your code is trying to do. It would be helpful if you can bring more clarity to it.

          LDS with atomics will be faster than on GDS.

           

          Thanks,

            • Re: Re: best strategy for many scattered reductions
              jason
              #include "common/common.h"
              #include "common/image.h"
              
              #ifndef PIXELT
              #define PIXELT int
              #endif
              #ifndef LABELT
              #define LABELT uint
              #endif
              #ifndef LDSPIXELT
              #define LDSPIXELT uint
              #endif
              #ifndef LDSLABELT
              #define LDSLABELT uint
              #endif
              
              typedef PIXELT PixelT;
              typedef LDSPIXELT LDSPixelT;
              typedef LABELT LabelT;
              typedef LDSLABELT LDSLabelT;
              
              #ifndef BG_VALUE
              #define BG_VALUE 0
              #endif
              
              #ifndef WORKGROUP_TILE_COLS
              #define WORKGROUP_TILE_COLS 32
              #endif
              #ifndef WORKGROUP_TILE_ROWS
              #define WORKGROUP_TILE_ROWS 8
              #endif
              
              #ifndef WORKITEM_REPEAT_COLS
              #define WORKITEM_REPEAT_COLS 1
              #endif
              #ifndef WORKITEM_REPEAT_ROWS
              #define WORKITEM_REPEAT_ROWS 16
              #endif
              
              //effective work TILEs
              #define EFFECTIVE_WORKGROUP_TILE_COLS (WORKGROUP_TILE_COLS * WORKITEM_REPEAT_COLS)
              #define EFFECTIVE_WORKGROUP_TILE_ROWS (WORKGROUP_TILE_ROWS * WORKITEM_REPEAT_ROWS)
              
              //LDS variant used on MAX_LABELS define
              
              struct __attribute__((packed)) stats_record_t{
                  int r_min;
                  int c_min;
                  int r_max;
                  int c_max;
                  int area;
                  int min_pixel;
                  int max_pixel;
              };
              
              #define pixel_at(type, basename, r, c) image_pixel_at(type, PASTE2(basename, _p), im_rows, im_cols, PASTE2(basename, _pitch), (r), (c))
              
              //requirement: stats_p has max_label + 1 elements
              
              //ldims: work efficient number of labels to process
              //gdims: total number of labels
              __kernel void init_blob_stats(
                  __global struct stats_record_t *stats_p, const uint max_labels
              ){
                  const size_t label = get_global_id(0);
                  struct stats_record_t stats;
                  stats.r_min = INT_MAX;
                  stats.c_min = INT_MAX;
                  stats.r_max = INT_MIN;
                  stats.c_max = INT_MIN;
                  stats.area = 0;
                  stats.min_pixel = INT_MAX;
                  stats.max_pixel = INT_MIN;
                  if(label < max_labels){
                      stats_p[label] = stats;
                  };
              
              }
              
              #define WORKITEM_REPEAT_FOR                                              \
              _Pragma("unroll")                                                        \
              for (int i = 0; i < WORKITEM_REPEAT_ROWS; ++i){                          \
                  _Pragma("unroll")                                                    \
                  for (int j = 0; j < WORKITEM_REPEAT_COLS; ++j){                      \
                      const uint wg_t_c = wg_tile_col + WORKGROUP_TILE_COLS * j;       \
                      const uint wg_t_r = wg_tile_row + WORKGROUP_TILE_ROWS * i;       \
                      const uint c = wg_t_c + effective_wg_tile_col_start;             \
                      const uint r = wg_t_r + effective_wg_tile_row_start;             \
                      const bool valid_pixel_task = (r < im_rows) & (c < im_cols);     \
                      (void) valid_pixel_task;
              
              #define WORKITEM_REPEAT_FOR_VALUES    \
                  WORKITEM_REPEAT_FOR               \
                  const PixelT pixel = pixels[i][j];\
                  const LabelT label = labels[i][j];\
                  (void) pixel;                     \
                  (void) label;
              
              
              #define INIT_LDS_FIELD(value)                                    \
                  for(uint l = tid; l < max_labels; l += wg_size){             \
                      stat_shm[l] = value;                                     \
                  }                                                            \
                  lds_barrier();                                               \
              
              #define SET_GDS_FIELD(reduction, field)                         \
                  for(uint l = tid; l < max_labels; l += wg_size){            \
                      __global struct stats_record_t *stats = stats_p + l;    \
                      reduction(&stats->field, stat_shm[l]);                  \
                  }
              
              #define WORKGROUP_TILE_HEADER                                                                                        \
                  const uint effective_wg_tile_col_blocksize = EFFECTIVE_WORKGROUP_TILE_COLS;                                      \
                  const uint effective_wg_tile_row_blocksize = EFFECTIVE_WORKGROUP_TILE_ROWS;                                      \
                  const uint effective_wg_tile_col_block = get_group_id(0) + get_global_offset(0) / get_local_size(0);             \
                  const uint effective_wg_tile_row_block = get_group_id(1) + get_global_offset(1) / get_local_size(1);             \
                  const uint wg_tile_col = get_local_id(0);                                                                        \
                  const uint wg_tile_row = get_local_id(1);                                                                        \
                                                                                                                                   \
                  uint effective_wg_tile_rows = effective_wg_tile_row_blocksize;                                                   \
                  uint effective_wg_tile_cols = effective_wg_tile_col_blocksize;                                                   \
                                                                                                                                   \
                  const uint effective_wg_tile_row_start = effective_wg_tile_row_block * effective_wg_tile_rows;                   \
                  const uint effective_wg_tile_col_start = effective_wg_tile_col_block * effective_wg_tile_cols;                   \
                  const uint effective_wg_tile_row_end = min(effective_wg_tile_row_start + effective_wg_tile_rows, (uint) im_rows);\
                  const uint effective_wg_tile_col_end = min(effective_wg_tile_col_start + effective_wg_tile_cols, (uint) im_cols);
              
              
              #ifdef MAX_LABELS
              //assumption: nlabels << pixels, uint[MAX_LABELS] fits in lds
              //ldims: work factor of EFFECTIVE_WORKGROUP_TILE_ROWSxEFFECTIVE_WORKGROUP_TILE_COLS according to the repeat dimensions: WORKGROUP_TILE_DIMS
              //gdims: total number of labels divUp(img_dims, TILE_DIMS) * ldims
              __attribute__((reqd_work_group_size(WORKGROUP_TILE_COLS, WORKGROUP_TILE_ROWS, 1)))
              __kernel void blob_stats(
                  const uint im_rows, const uint im_cols,
                  __global const PixelT *image_p, uint image_pitch,
                  __global const LabelT* labelim_p, const uint labelim_pitch,
                  __global struct stats_record_t *stats_p, const uint max_labels
              ){
                  const size_t tid = get_local_linear_id();
                  const size_t wg_size = get_workgroup_size();
                  WORKGROUP_TILE_HEADER
              
                  ////adjust to true tile dimensions
                  effective_wg_tile_rows = effective_wg_tile_row_end - effective_wg_tile_row_start;
                  effective_wg_tile_cols = effective_wg_tile_col_end - effective_wg_tile_col_start;
              
                  PixelT pixels[WORKITEM_REPEAT_ROWS][WORKITEM_REPEAT_COLS];
                  LabelT labels[WORKITEM_REPEAT_ROWS][WORKITEM_REPEAT_COLS];
              
                  WORKITEM_REPEAT_FOR
                          //const bool valid_pixel_task = (c < im_cols) & (r < im_rows);
              
                          PixelT pixel = 0;
                          LabelT label = max_labels;//1 past valid records
              
                          if(valid_pixel_task){
                              pixel = pixel_at(PixelT, image, r, c);
                              label = min(label, pixel_at(LabelT, labelim, r, c));
                          }
                          assert_val(label <= max_labels, label);//includes catchall label
                          assert_val(label <= MAX_LABELS, label);//includes catchall label
              
                          pixels[i][j] = pixel;
                          labels[i][j] = label;
                      }
              
                  }
              
                  __local int stat_shm[MAX_LABELS + 1];
              
                  //bounding box topleft r
                  INIT_LDS_FIELD(INT_MAX)
                  WORKITEM_REPEAT_FOR_VALUES
                          atomic_min(&stat_shm[label], valid_pixel_task ? (int) r : INT_MAX);
                      }
                  }
                  lds_barrier();
                  SET_GDS_FIELD(atomic_min, r_min)
              
                  //bounding box topleft c
                  INIT_LDS_FIELD(INT_MAX)
                  WORKITEM_REPEAT_FOR_VALUES
                          atomic_min(&stat_shm[label], valid_pixel_task ? (int) c : INT_MAX);
                      }
                  }
                  lds_barrier();
                  SET_GDS_FIELD(atomic_min, c_min)
              
                  //bounding box bottomright r
                  INIT_LDS_FIELD(INT_MIN)
                  WORKITEM_REPEAT_FOR_VALUES
                          atomic_max(&stat_shm[label], valid_pixel_task ? (int) r : INT_MIN);
                      }
                  }
                  lds_barrier();
                  SET_GDS_FIELD(atomic_max, r_max)
              
                  //bounding box bottomright c
                  INIT_LDS_FIELD(INT_MIN)
                  WORKITEM_REPEAT_FOR_VALUES
                          atomic_max(&stat_shm[label], valid_pixel_task ? (int) c : INT_MIN);
                      }
                  }
                  lds_barrier();
                  SET_GDS_FIELD(atomic_max, c_max)
              
                  //area of component
                  INIT_LDS_FIELD(0)
                  WORKITEM_REPEAT_FOR_VALUES
                          atomic_inc(&stat_shm[label]);
                      }
                  }
                  lds_barrier();
                  SET_GDS_FIELD(atomic_add, area)
              
                  //max pixel of component
                  INIT_LDS_FIELD(INT_MIN)
                  //for(uint l = tid; l < max_labels; l += get_local_size()){
                  //    stat_shm[l] = paste(PIXELT, _MIN);
                  //}
                  //lds_barrier();
                  WORKITEM_REPEAT_FOR_VALUES
                          atomic_max(&stat_shm[label], valid_pixel_task ? (int) pixel : INT_MIN);
                      }
                  }
                  lds_barrier();
                  SET_GDS_FIELD(atomic_max, max_pixel)
              
                  //min pixel of component
                  INIT_LDS_FIELD(INT_MAX)
                  //for(uint l = tid; l < max_labels; l += get_local_size()){
                  //    stat_shm[l] = paste(PIXELT, _MAX);
                  //}
                  //lds_barrier();
                  WORKITEM_REPEAT_FOR_VALUES
                          atomic_min(&stat_shm[label], valid_pixel_task ? (int) pixel : INT_MAX);
                      }
                  }
                  lds_barrier();
                  SET_GDS_FIELD(atomic_min, min_pixel)
              }
              #endif
              
              #ifdef MAX_LABELS
              __kernel void blob_stats_gds
              #else
              __kernel void blob_stats
              #endif
              (
                  const uint im_rows, const uint im_cols,
                  __global const PixelT *image_p, uint image_pitch,
                  __global const LabelT *labelim_p, const uint labelim_pitch,
                  __global struct stats_record_t *stats_p, const uint max_labels
              ){
                  const uint r = get_global_id(1);
                  const uint c = get_global_id(0);
                  const bool valid_pixel_task = (r < im_rows) & (c < im_cols);
              
                  if(valid_pixel_task){
                      const PixelT pixel = pixel_at(PixelT, image, r, c);
                      const LabelT label = min(max_labels, pixel_at(LabelT, labelim, r, c));
                      assert_val(label < max_labels, label);
                      __global struct stats_record_t *stats = stats_p + label;
              
                      atomic_min(&stats->r_min, (int) r);
                      atomic_min(&stats->c_min, (int) c);
                      atomic_max(&stats->r_max, (int) r);
                      atomic_max(&stats->c_max, (int) c);
                      atomic_inc(&stats->area);
                      atomic_max(&stats->max_pixel, pixel);
                      atomic_min(&stats->min_pixel, pixel);
                  }
              }
              
              

               

               

               

              OK there's the source.  MAX_LABELS, when defined really could be something like 1/10 the number of pixels in the general case which easily exceeds shared memory limitations.  So I was looking for other strategies than the one above.  The end goal, no matter how it's done, is to compute bounding boxes of these labeled regions, number of pixels with that label (area), and minimum and maximum response values (see Connected Components Labeling for the algorithm that generates the input to this algorithm).  Something that is significantly better than the 30ms necessary for the GDS version but can handle more labels than the LDS version is what I'm after.

               

              The first thing I did was break the problem up spatially for tiles for localizing the atomic updates but that didn't change the timings.