3 Replies Latest reply on Mar 2, 2015 5:54 AM by gpgpucoder

    Performance of work-item built-in functions

    gpgpucoder

      Questions on performance of work-item built-in functions. I'm talking about these:

      get_work_dimNumber of dimensions in use
      get_global_sizeNumber of global work items
      get_global_idGlobal work item ID value
      get_local_sizeNumber of local work items
      get_local_idLocal work item ID
      get_num_groupsNumber of work groups
      get_group_idWork group ID

      Are these like real function calls or more like a constant? I am supposing the latter. I might typically write something like this for a starting location:

       

      int offset = get_group_id(0)*myitemsize*get_local_size(0);

      and initialize other similar offset variables for other things, also using the same built-ins. Or perhaps something like:

      int x = get_global_id(0);

      If I were to never change x, or offset, do I get a performance benefit by defining these variables in the first place? How are registers impacted?  I was thinking about changing above to something like this for readability but don't know of the impact.

      #define x get_global_id(0)

        • Re: Performance of work-item built-in functions
          maxdz8

          gpgpucoder wrote:

          I might typically write something like this for a starting location:

          int offset = get_group_id(0)*myitemsize*get_local_size(0);

          and initialize other similar offset variables for other things, also using the same built-ins. Or perhaps something like:

          int x = get_global_id(0);

          Leaving aside that you're not going to measure any performance difference in any case as long as you call them once...

           

          It is my understanding get_global_id(d) is in fact a register populated by the dispatcher before the kernel runs its first instruction. Feel free to rename it as you want, the compiler will happily detect this and remove. I would suggest against calling it x, but in some cases it is acceptable.

           

          I haven't investigated the others but I'd expect them to come from registers as well.

           

          Be sure to remember int x = get_global_id(0) will most likely be different from get_group_id(0)*get_local_size(0)+get_local_id(0) due to index offset at dispatch (I often get lazy on this one).

          1 of 1 people found this helpful
            • Re: Performance of work-item built-in functions
              realhet

              I think get_group_id(0) contains the base index too (lower bound of the kernel domain). In the disasm, get_group_id(0) uses one more instruction than get_global_id(0), I guess that extra instruction divides the groupsize aligned lower bound by the groupsize.

              Only get_local_id(0) is able to work directly from registers, all others are have to access extra data from memory (that contains all the info on the kernel domain).

              In pure asm it is possible to calculate a 24bit global id in one cycle with MAD24, but it is only 0 based.

               

              Anyways, if the kernel is only a few instructions long, this is a big penalty, but if it is a thousands times bigger kernel, then these extra reads and calculations doesn't matter.

              • Re: Performance of work-item built-in functions
                gpgpucoder

                Thank you for the insight. I didn't make it sufficiently clear that I have places where I am presently calling those functions more than once, in order to initialize various indexes and offsets. So I wanted best insight on how to streamline my usage, such as to not impact registers too much.  In the brief experiments I have done on some very simplified code, I've seen no difference in using the function vs a variable. I will review the generated IR code when I get a moment.