4 Replies Latest reply on Jul 19, 2010 6:50 PM by psath

    kernel creation help

    Chocrates

      I'm trying to figure out how to port some sequential code to a kernel so i can test the speedup, but im having a lot of trouble.  None of the tutorials ive found help with this aspect, only the setting up the kernel to run and recieve data.  So far i have this for multiplying polynomials together.

      sequential:

      int *mult(int size, int *a, int *b)
      {
          int *out, i, j;
          out = (int *)malloc(sizeof(int) * size * 2 - 1);
         
          for(i = 0; i < size; i++){
              for(j = 0; j < size; j++){
                  out[i + j] = (a * b[j]) + out[i + j];
              }
          }
          return out;
         
      }

       

      Naive Kernel:

       

      #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

      __kernel void poly_mul( __global int *a, __global int *b, __global int *c)
      {
        
          size_t i = get_global_id(0);
          size_t j = get_global_id(1);
        
              c[i + j] = (a * b[j]) + c[i + j];   
         
      }

      this doesnt return the correct result, i think this is because it of the c[i + j] on the end.  Anyone have any advice or a link to a good tutorial on this aspect of opencl?

       

      Thanks

        • kernel creation help
          mihzaha

          I'm a beginner, but it seams to me that you don't protect memory:

          for example, in your code both threads (1, 2) and (2, 1) write to the memory address c[3] at the same time different results, anything can end up there, or first result or second or a combination of the two (they are all in parallel so they do all the operations at the same time). So the result is just junk.

            • kernel creation help
              psath

              Yep, mihzaha is right. But he's missing an explanation of the underlying reason:

              Data layout in memory: I assume your source memory is a 2d array of ints, correct? Well in memory (and after passing to your kernel) these items have to be laid out sequentially, and in C its done in what's called "Row Major" format. You need to adjust your indexing from c[i+j] to something along the lines of c[i*rowSize+j] to switch from 2d indexing to 1d.

              Also, I presume you're doing a small problem size, so get_global_id is excusable, but you'll want to get used to using get_local_id and get_group_id. Once your problem size gets bigger you'll have to use multi-tiered indexing.

            • kernel creation help
              fpaboim

              If you want to use a 2d range, for this problem you have to be careful not to write to the same address concurrently with another kernel (race condition). I guess you could output a matrix size*size with:  matrix[i*size + j] = (a * b[j]) to perform all the multiplications in parallel then reduce the diagonals.. Its still a horrible algorithmn since you´ll have load balancing problems which arise given the nature of the polinomial multiplication producing more values in the midrange values (~principal diagonal) given the higher quantity of possible permutations, and therefore youll have a few units doing most of the work. Added to that youll have uncoalesced acesses along the diagonal, making the parallel reduction performance dismal. Developing an algorithmn to make this better than on a cpu - especially considering transfer costs - is not straightforward. Nevertheless, im sure you can find some literature online on this, since doing the same operation on openmp, for example, will produce the same race condition drawback.

                • kernel creation help
                  psath

                  Wow, I completely blew it on my first look at this problem. I blame lack of coffee

                  Algorithmically, your kernel is correct, but fpaboim hit the nail on the head. This problem results in a lot of threads trying to write to the same memory address simultaneously and bashing each other's data.

                  For polynomials of lower order you could probably get away with using atomic adds for the c[i + j] = (a * b[j]) + c[i + j]; calculation.

                  Is there any way you can map the algorithm so that each thread writes to a unique memory location and then afterwards call a reduction kernel which spawns a thread for each power? (AKA one thread sums up all the x^2 terms, another for x^3, another for x^4, ... x^n)