5 Replies Latest reply on Jun 2, 2013 8:55 AM by shreedhar_pawar

    Difficulty in making a minor code parallel.


      I have an array of 100 elements and I want to put some elements of it in a different array in the kernel code, The problem is that if 'p' is an array with 100 elements and say 'q' with 40 elements, p can be incremented by "get_global_id(0)", but then how do I increment q in the kernel code...? When these elements are above 60000 or so, local atomics won't work, so how do I do this ...?


      To make my doubt more clear I'll give an example.....  for convenience p has 10 elements { 44, 64, 58, 93, 21, 32, 9, 4, 26, 47} and I want to insert {44, 93, 9,4,26} in q so I may do ,



      if(some condition)

          q[ j ]= p[ i ];

      My doubt is how can j be incremented...?

         And also I can't put i in place of j, cause then according to the if condition some elements in q would remain non entered and it wont lead to continuous array , also I don't want any of the part  to be done sequentially(by going back to the host code), since I have to do this a number of times in the actual full code and with more than 1,00,000 elements everytime.....!


      Message was edited by: Shreedhar Pawar

        • Re: Difficulty in making a minor code parallel.

          The fastest way seems to be to use j=i in the code, which would result in non-contigous result storage. Otherwise you can look into global atomics (Atomics Counter Sample from APP SDK)

          1 of 1 people found this helpful
          • Re: Difficulty in making a minor code parallel.

            Buddy Shreedhar,


            We have seen this problem before too... And, I gave a big note on how inclusive scan can solve your problem.

            All you need to do is:

            1. Allocate another array which has same size as "p". Let us call this array as "t"

            if (some condition is true)


               t[get_global_id] = 1;

            } else {

              t[get_global_id] = 0;


            2. Perform an inclusive scan on "t"

            Thus is "t" was "1, 0, 0, 1, 0 , 0, 0, 1, 1, 1 , 0" before scan -- it will be "0, 1, 1, 2, 2, 2, 2, 3, 4, 5, 5" after the scan.

            3. Now launch another kernel, which will do:

            if (some condition is true)


               q[t[get_global_id]] = p[get_global_id];



            Note that "scan" is now available as part of "Bolt" library. Bolt is going to be of lot of use to you all.

            1 of 1 people found this helpful
              • Re: Difficulty in making a minor code parallel.

                Thanks Himanshu for your answer, but here I don't know the size of q priorly which makes the problem in putting the scanned elements of p in it, how do I count the no. of ones and zeroes in t...? If I am able to count them, I would know the size of q then.


                Also Bolt Library is in C++ and I am not familiar with the C++ wrapper(in fact I don't even know the C++ language and I am really sorry for this), please can you tell how I do the scan myself...?

                  • Re: Difficulty in making a minor code parallel.

                    Count the number of 1s and you will get the size of "q".

                    And, that wil also be the last value after the scan anyway.


                    Bolt is not that hard man....and C++ is not magic. It is mostly C and then just a few plusses here n there.

                    Here is the smallest Bolt program to perform Scan - presented to you by Devaguru Bruhaspati in association with Himanshu Gautam,


                    Happy reading Shreedharji..

                    // BoltScan.cpp : Defines the entry point for the console application.

                    #include "stdafx.h"
                    #include <bolt/cl/scan.h>
                    #include <bolt/PerformanceCounter.h>

                    #include <vector>
                    #include <numeric>



                    _tmain(int argc, _TCHAR* argv[])
                        size_t length = 1024*1024*8;
                        //Create device_vector and initialize it to 1
                        std::cout << "\nScan EXAMPLE \n";

                        bolt::cl::device_vector< int > boltInput( length, 1 );
                        bolt::cl::device_vector< int >::iterator boltEnd = bolt::cl::inclusive_scan( boltInput.begin( ), boltInput.end( ), boltInput.begin( ) );
                        return 0;