      I'm currently working on a application that use a "recursive" schema, something like


      N(i) = f( N(i-1) )


      So, this computation cannot be put in parallel, but in my context I can use several 'N', ie. I can start P threads each computing a sequence of N.

      The goal is to minimize the number of N and I would like to know how many computation I can put in parallel ?


      ie. what is the number of threads (work-item) I can launch together before reusing the same set of N ?

          simple rule is as many workitems as you can. for older card absolutly minimum of wotkitems is number of compute units*64. for new GCN architecture as 7xxx it is number of compute units*64*4

              Thanks nou,


              But the goal is to have the minimum number (This greatly boost the performance of my application) ! If possible, something that can be computed for both CPU and GPU !


              I don't want to hard-code this number !!! But, I will already be happy if I can have something like ;


              int GetMinWorkSize()


                int unitsCount = clGetInfo(Units); // Does this information is available

                int rule = 64 * 4;

                return rule * unitsCount;



              Maybe I have also to play with the kind of device (CPU or GPU) and maybe other informations can help ?