12 Replies Latest reply on Jun 10, 2013 5:49 PM by realhet

    Useful method in CUDA '__any' : no equivalent in OpenCL ?

    spectral

      Hi,

      In CUDA there is an interesting method '__any' and I can't find anything equivalent in OpenCL.

      Here is what the documentation state :

      int __any(int predicate);

       

      "evaluates predicate for all threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for any of them."

      But it is difficult to have something equivalent in OpenCL ?



        • Useful method in CUDA '__any' : no equivalent in OpenCL ?
          Meteorhead

          You could make a __local int variable, that you atomically increase (__local atomic might not be as costly as __global) according to predicate, do a sync, ant after check value. It will only be non-zero, if somebody hit ATOMINC with predicate true.

          I don't know about relative speed, it might just be what the any() function does under CUDA, but it might be a bit slower. Anyhow... this should do pretty much the same.

            • Useful method in CUDA '__any' : no equivalent in OpenCL ?
              Fr4nz

               

              Originally posted by: Meteorhead You could make a __local int variable, that you atomically increase (__local atomic might not be as costly as __global) according to predicate, do a sync, ant after check value. It will only be non-zero, if somebody hit ATOMINC with predicate true.

               

              I don't know about relative speed, it might just be what the any() function does under CUDA, but it might be a bit slower. Anyhow... this should do pretty much the same.

               

              Good idea. And an OR, with the value 1 for example, should be enough instead of an increment (only to be done if a thread meets a specific condition).

              This also permits to avoid the need of an atomic OR write (we are only setting the first bit, so there's no information loss risk).

                • Useful method in CUDA '__any' : no equivalent in OpenCL ?
                  spectral

                  An 'or' ? I don't see how and why it can avoid the use of an atomic operation !

                  '__any' is looking for the 32 (Warp = 32) values and check their state, so for me it should be something like this :

                   

                   



                  // CUDA : bool searchingLeaf = true; __local int searchingLeaf; if (lid < 1) searchingLeaf = 0; // CUDA : searchingLeaf = false; atomic_inc(&searchingLeaf); // CUDA : if (!__any(searchingLeaf)) break; if (searchingLeaf > 31) break;

                    • Useful method in CUDA '__any' : no equivalent in OpenCL ?
                      Meteorhead

                      What Fr4anz is saying, is that if you have a control bit, and you're only interested in the fact, has anyone from the workgroup encountered this branch, than bitwise OR might be a good solution.

                      [See attached code here]

                      If you use a structure like this, you can tell whether any thread evaluated statement to be true between the two barriers. This isn't nice, but it sure is efficient.

                      __kernel void(...) { __local char ControlBit; barrier(CLK_LOCAL_MEM_FENCE); ... if(statement) ControlBit |= 0x1; ... barrier(CLK_LOCAL_MEM_FENCE); ... if(ControlBit) {...} else {...} ... }

                • Useful method in CUDA '__any' : no equivalent in OpenCL ?
                  arsenm

                  I also am missing the warp voting functions from CUDA. Currently I need __all(), and I have a crude replacement where everything writes to a __local array, and then ands the pieces together. It also relies on the wavefront/warp lockstep behaviour to avoid using a barrier. OpenCL likes to hide these behind "workgroups" but it would be nice to have extensions to access features like this.

                    • Useful method in CUDA '__any' : no equivalent in OpenCL ?
                      spectral

                      Thanks,

                      But "ControlBit |= 0x1;" is not atomic and you need something like to mimics __any :

                      ControlBit |= 1 << get_local_id(0);

                      Because it is not an atomic operation, the behavior is not determined and you can have lock in some cases ! So, I think that it's not really a solution :P

                        • Useful method in CUDA '__any' : no equivalent in OpenCL ?
                          Fr4nz

                           

                          Originally posted by: viewon01 Thanks,

                           

                          But "ControlBit |= 0x1;" is not atomic and you need something like to mimics __any :

                           

                          ControlBit |= 1 << get_local_id(0);

                           

                          Because it is not an atomic operation, the behavior is not determined and you can have lock in some cases ! So, I think that it's not really a solution :P

                           

                           

                          I don't see what is the problem in avoiding the atomic op...as soon as:

                          1) you're only interested in the fact that at least one thread meets a given condition in a warp;

                          2) you're not interested in which was the last thread to meet the given condition;

                          3) every thread will write the same value (0x1) if it meets the condition;

                          you don't need an atomic access to the check variable, only a barrier before you check it.

                            • Useful method in CUDA '__any' : no equivalent in OpenCL ?
                              galmok

                               

                              Originally posted by: Fr4nz 

                               

                              I don't see what is the problem in avoiding the atomic op...as soon as:

                               

                              1) you're only interested in the fact that at least one thread meets a given condition in a warp;

                               

                              2) you're not interested in which was the last thread to meet the given condition;

                               

                              3) every thread will write the same value (0x1) if it meets the condition;

                               

                              you don't need an atomic access to the check variable, only a barrier before you check it.

                               



                              I completely agree with this. I just have 1 question: Why use "ControlBit |= 0x1" and not simply "ControlBit = 1"? It would save 1 read/modify cycle and just uses the save cycle.

                                • Useful method in CUDA '__any' : no equivalent in OpenCL ?
                                  antzrhere

                                  Basically...

                                   

                                  If you want to detect if ANY work item has completed then simply set ControlBit = 1, no atomics required. Then test if ControlBit==1

                                  You do not need to use atomics as a value of 1 will be set no matter which
                                  threads try to write and in any order, whether all threads write or just one.

                                   

                                   

                                   

                                   

                                  If you want to detect if ALL work items have completed a particular subtask then perform an atomic_inc(&ControlBit) and then test if ControlBit >= local_size.

                                  Atomics are required in the above case as two or more threads could read the same ControlBit value which *could* only result in an increment of +1 no matter how many threads write to this value.

                                   

                                   

                                  If you want to test if a particular work item have completed then use a
                                  32bit uint atomic_or(&ControlBit[work_item_id>>5],1<<(work_item_id&31))  and test if (ControlBit[work_item_id>>5]>>(work_item_id&31))&1 == 1

                                  Atomics are required in this case as threads could read the same value at the same time and only the thread that writes to memory last would have it's control bit set

                                   

                          • Useful method in CUDA '__any' : no equivalent in OpenCL ?
                            MicahVillmow
                            OpenCL does not have a concept of coding at the wavefront level, so this would not be efficient to do at the work-group level.

                            Basically you want to use an atomic_inc on the local variable and then check if it is >0 for __any and == get_group_size() for __all.

                            Micah
                              • Re: Useful method in CUDA '__any' : no equivalent in OpenCL ?
                                mannyk

                                Hi Micah,

                                 

                                I am interesting in using your recommendation. However the formatting of your response is hard for me to interpret:

                                "then check if it is >0 for __any and == get_group_size() for __all."

                                 

                                I assume '__any' is a local variable in LDS.  But what exact is the expression you suggest to check the result?

                                 

                                Thanks.

                                  • Re: Useful method in CUDA '__any' : no equivalent in OpenCL ?
                                    realhet

                                    Hi,

                                     

                                    "Basically you want to use an atomic_inc on the local variable"

                                    So you can have a local temp variable, and atomic_inc it if the predicate is true:

                                     

                                    local int tmp=0;

                                    if(predicate) atomic_inc(tmp);

                                     

                                    At this point you can check the value of tmp:

                                    "then check if it is >0 for __any"  -> means that if the value of tmp is greater than 0, then you've got the functionality of __any() on CUDA. Any workitem in the workgroup has predicate=TRUE.

                                    "== get_group_size() for __all." -> and this is when tmp==get_group_size(), so it's when all your workitems met the predicate. __all() on CUDA.