13 Replies Latest reply on Apr 2, 2013 10:31 PM by shreedhar_pawar

    How do I parallelize the following sequential code?


      I wanted to to make the following kernel code parallel,

        In the code below size of a is n, b and c is 8*n and of d is some value less than n(eg : 3*n/4)





           if (a[b[i]]!=a[c[i]])







      Since the number of elements of a and d aren't the same I am facing a problem to give i=get_global_id(0), since by doing this, in some elements of d there would be nothing placed if the 'if'' condition violates...! So how do I make parallel..?

      If not this then, is it possible to delete the "no value" positions of d in the kernel if I store the positions where the values are placed in d in a different array..?

        • How do I parallelize the following sequential code?

          you can try use for j atomic counter. AMD have extension which expose fast atomic counter which is about 5 times faster than normal atomic counter then look into that.

            • Re: How do I parallelize the following sequential code?

              Atomics is a good idea to use within a workgroup.


              However, since Shreedhar is looking for using "get_global_id()" for "i", he probably needs global atomics -- which is going to kill performance



              You need to allocate per-workgroup "D" buffer. Every workgroup will have its own D buffer that it will write to.

              Where to write into it, can be decided using some local-atomics among work-items.


              atomic_inc() will only increment the value so that you will only know how many values you are going to write.

              It will not tell you which index you need to write to.

              To get that point, you need to do a "prefix sum scan". please see below


              The best bet is to have a local array, and write the values on get_local_id()th entry out there.

              In the else part, you can probably write "-1" (or some invalid value) to signal that the entry is invalid.

              Once this is done, you need to copy this local array to per-workgroup D buffer.

              You have 2 options here:


              After the kernel is over, you can probably selectively copy the D buffer output into a CPU buffer.

              This work can be done by CPU.


              We can prune the local array (lets call this data array) and then copy it to the per-workgroup D buffer.

              Create another local array and fill it up with 1 or 0 depending whether you enter IF or ELSE.

              Perform a "prefix sum scan" on this array. Let us call this as "index array"

              Now, Copy the Local Data Array onto Per-workgroup D-buffer.

              Every workitem will look at value in the Data-Array... If this value is VALID then it will look into the "INDEX" array to find the location within the Per-workgroup D buffer and WRITE into it.


              I wll give a small example later.

                • Re: How do I parallelize the following sequential code?

                  Let us say

                  A = [10 , 14,  10, 12,  14,  89,  33,  89]

                  B = [0, 7, 3 ,4]

                  C = [2, 5, 4, 1]


                  Let us say 4 threads are working.

                  Workitem 0: A[0] == A[2]  => LOCAL_D[0] = 0,             LOCAL_INDEX_ARRAY[0] = 1

                  Workitem 1: A[7] == A[5] => LOCAL_D[1] = 7,              LOCAL_INDEX_ARRAY[1] = 1

                  Workitem2: A[3] != A[4]   => LOCAL_D[2] = INVALID, LOCAL_INDEX_ARRAY[2] = 0

                  Workitem3: A[4] == A[1] =>  LOCAL_D[3] = 4,              LOCAL_INDEX_ARRAY[3] = 1




                  Now, LOCAL_INDEX_ARRAY[] = (1, 1, 0, 1)

                  Perform Inclusive Scan Parallel:

                  LOCAL_INDEX_NEW_ARRAY[] = (0, 1, 2, 2)



                  if (LOCAL_D[get_local_id(0)] != INVALID)


                       PER_WORKGROUP_D[LOCAL_INDEX_NEW_ARRAY[get_local_id(0)]] = LOCAL_D[get_local_id(0)];



                  The sequential algorithm to perform "inclusive scan is as follows". There are ample examples out there that shows how to do this on GPU parallely.

                  B[0] = 0;

                  for (int i=1; i<N;i++)


                     B[i] = B[i-1] + A[i-1];




                    • Re: How do I parallelize the following sequential code?

                      Hey Himanshu,

                          I din't understand the explanation given by you about the prefix sum scan and even the example given by you.Some parts like you said "allocating per-workgroup "D" buffer" are really hard for me to grasp at this stage. . And if there is a way to increment the d buffer using atomic_inc, i'll declare all these arrays as local and then do the atomic operation, so that I gain some performance efficiency. I just want to know how this atomic_inc is used. By reading its explanation in specs, I seriously din't understand how to use it in my code, Please illustrate the atomic_inc operation by giving simple example code that involves increment as in my case ...!


                      Below is the parallel code which is giving a problem,


                      if (a[b[i]]!=a[c[i]])






                      I will give an eg: of what I exactly want to do.


                      a=[1, 6, 6 ,10 ,6, 20, 16, 20, 16, 20, 1, 6, 10]

                      b=[1, 2, 4, 6, 8,   9,   11, 13]

                      c=[3, 4, 6, 7 ,13 ,10   12  9]


                      Now here in this case a[b[0]]!=a[c[0]], so condition is satisfied and d[0]=b[0]=1; Now a[b[1]]==a[c[1]], so d[1] is left blank if I execute the parallel code. and a[b[2]]!=a[c[2]], so now d gets filled at its 2nd value i.e. in d[2], whereas I want it to be filled in d[1]... hence how do I make use of atomic_inc in this...?


                        • Re: How do I parallelize the following sequential code?



                          I understand your problem correctly.  And my answer perfectly answers what you are asking for.

                          Let me explain what i mean by "per-workgroup D buffer"


                          1. We don't know the size of "D" buffer because the actual size can be found only by running the kernel.

                              This is a Chicken-Egg Problem.


                          2. One way to solve this is to allocate a "D" buffer as big as "A" and divide it among the workgroups just like how you are doing it for A.


                          3. Now every workgroup works on a "Chunk" of D. This is what I meant by per-workgroup D-buffer.... Well, I have to accept that my nomenclature is very confusing. Sorry about that.


                          4. Now, Let us say that this per-workgroup D buffer (i.e D Chunk) has 256  elements (THus according to your code above, the workgroup size has to be 256)


                          5. Now, for any given workgroup -- it can populate anywhere between 0 to 256 elements into it. We will now this only by running the kernel.


                          6. Let us say you synchronize within the workgroup to create a local D array that holds the required elements.

                              Now, every workgroup will write consecutively in its D-chunk.

                              However, not all D-chunk will be utilized

                              So, there is a need to know which part of D-chunk has valid and which one has invalid.

                              This is easy to solve.

                              Every workgroup can first initialize their D-chunk to some invalid values like 0xDEADBEEF at start of day.

                              Note: 0xdeadbeef is a valid hexa decimal 32-bit unsigned integer.

                             After execution, you will find a D-chunk like this:

                             Valid, Valid, Valid, Valid ................., 0xdeadbeef, 0xdeadbeef, 0xdeadbeef


                          7. In the host code, you can copy out the D-array properly (from different D-chunks) into another buffer and copy that buffer to GPU.


                              In the original kernel itself, you can count the number of valid items (the last member of the prefix sum array + last element in Local_D_index array)

                              and write it in a per-workgroup integer (i.e. An integer array allocated by host -- whose size is equal to the number of workgroups spawned)

                              Once the kernel (Kerne A) is over, you can perform prefix-sum on this array using another kernel (Kernel B).

                              Once this new kernel (Kernel B) is over, you will know what is the actual size of "D".

                              Allocate a new buffer DNew

                             Now this prefix-sum array will have the indices in the new DNew where D-chunks need to be copied.

                             Now, Spawn another kernel (Kernel C) such that it has the same number of workgroups as Kernel A.

                             Each workgroup will look at the D-chunk output and read the index in DNew where it has to copy it and will do the copy.



                          Atomic Inc will not work for you. Because at global level, different workgroups can win over and your D array will not be written in proper order.

                          For more details on atomics, you can check the AtomicAdd and other atomic related samples in the APP SDK.

                          1 of 1 people found this helpful
                            • Re: How do I parallelize the following sequential code?

                              I came across a code which I feel solves this doubt, but I din't understand how is atomic_inc used here and what effect does it have on result array in the last 4 lines... Please Explain...!


                              __constant int tOffset = 8196; // one template size in memory (in bytes)
                              __kernel void matchImage6( __global unsigned char* image, // pointer to the base image
                                          int imgWidth, // base image width
                                          int imgHeight, // base image height
                                          int imgPitch, // base image pitch (in bytes)
                                          int imgBpp, // base image bytes per pixel
                                          __constant unsigned char* templates, // pointer to the array of templates
                                          int tWidth, // templates width (the same for all)
                                          int tHeight, // templates height (the same for all)
                                          int tPitch, // templates pitch (in bytes, the same for all)
                                          int tBpp, // templates bytes per pixel (the same for all)
                                          int diff, // max allowed difference of intensity
                                          int maxNonmatchQt, // max number of nonmatched pixels
                                          __global int* result, // results
                                                          ) {
                              int x0 = (int)get_global_id(0);
                              int y0 = (int)get_global_id(1);
                              if( x0 + tWidth > imgWidth || y0 + tHeight > imgHeight)
                              int nonmatchQt[] = {0, 0, 0, 0, 0, 0};
                              for( int y = 0; y < tHeight; y++) {
                                  int ind = y * tPitch;
                                  int baseImgInd = (y0 + y) * imgPitch + x0 * imgBpp;
                                  for( int x = 0; x < tWidth; x++) {
                                      unsigned char c0 = image[baseImgInd];
                                      unsigned char c1 = image[baseImgInd + 1];
                                      unsigned char c2 = image[baseImgInd + 2];
                                      for( int i = 0; i < 6; i++)
                                          if( abs( c0 - templates[i * tOffset + ind]) > diff ||
                                                          abs( c1 - templates[i * tOffset + ind + 1]) > diff ||
                                                          abs( c2 - templates[i * tOffset + ind + 2]) > diff)
                                      ind += tBpp;
                                      baseImgInd += imgBpp;
                                  if( nonmatchQt[0] > maxNonmatchQt && nonmatchQt[1] > maxNonmatchQt && nonmatchQt[2] > maxNonmatchQt && nonmatchQt[3] > maxNonmatchQt && nonmatchQt[4] > maxNonmatchQt && nonmatchQt[5] > maxNonmatchQt)
                              for( int i = 0; i < 6; i++)
                                  if( nonmatchQt[i] < maxNonmatchQt) {
                                      unsigned int pos = atom_inc( &result[0]) * 3;
                                      result[pos + 1] = i;
                                      result[pos + 2] = x0;
                                      result[pos + 3] = y0;

                                • Re: How do I parallelize the following sequential code?

                                  Well, let me try to show you the issue here:


                                  for( int i = 0; i < 6; i++)

                                      if( nonmatchQt[i] < maxNonmatchQt) {

                                          unsigned int pos = atom_inc( &result[0]) * 3;

                                          result[pos + 1] = i;

                                          result[pos + 2] = x0;

                                          result[pos + 3] = y0;




                                  If you have say, 256 threads in total. All these threads have to go through above code, individually. Since the snippet has a atomic operation, all threads will be serialized, and this code will run serially.Now Performance may be good if you have 256 threads, but what if you have 8192 threads. That would result in massive serialization. So the approach above was suggested, to create small bins for result, and do atomics within a workgroup. This would enable different workgroups to run parallely to different CUs. And you would get better performance. But this would require another call later to accumulate the partial results.


                                  As about the current implementation, this would become slower and slower as problem size increase. And You still need to create the "d" array out of the "result" array, using a sorting algorithm.

                                  • Re: How do I parallelize the following sequential code?



                                    1. Global Atomics are slow as I pointed out above.


                                    2. In your code, Do you care about the order in which the D array is filled up? Because, when you do "atomics" like this, Any workitem could win the atomic increment and that thread would deposit his values out there. So, you cannot enforce the order if you do "atomics" like above. If you don't care about the order in which "D" array is filled, You can look at this kind of solution. but then, you should probably do a local atomics among workitems of a workgroup and then allow 1 work-item in a workgroup to perform global atomics. This will avoid the performance penalties

                                    1 of 1 people found this helpful
                                      • Re: How do I parallelize the following sequential code?

                                        Of course I care for the order of the D Array...  I also need to sort the D array ,... I have posted a radix sort code in another discussion, please have a look... Even there I face a same problem... so I have made the incrementing part on the host code and in the parallel code  I have inserted -1 values for the INVALID elements... but I think that going to the host code from a kernel and then again enqueing a new kernel also will degrade the performance...! Please look at that post of mine and tell whether I have written the code efficiently or no...?(the post is named as "Error :INVALID ADDRESS SPACE")