4 Replies Latest reply on Mar 13, 2017 10:59 AM by boxerab

    Global memory for work item cache

    boxerab

      Due to to local memory limitations, I need to use global memory as cache for my work items.

      Suppose I have 1000 work groups with 64 work items each. Each item needs 4K cache. Cache doesn't need to persist after work item completes.

      I will allocate one single global memory buffer and assign one chunk of size 4K to each work item.

       

      (I am targeting AMD GPUs)

       

      What is the minimum size I would need to guarantee that there would not be any concurrency issues between work items?

      Since AMD has <= 64 CUs, my guess is

      64 * 128 * 4000 bytes, and use (global work item ID % (64*128)) to assign a cache chunk to a work item.

       

      Thanks,

      Aaron

        • Re: Global memory for work item cache
          dipak

          Hi Aaron,

          Sorry, I don't understand the above calculation. I think the exact number depends on the device capability as well as the application itself. You have to consider max. number of in-flight wavefronts to avoid any kind of concurrency issue.

           

          Regards,

            • Re: Global memory for work item cache
              boxerab

              Thanks, Dipak. What would be a safe max for in-flight wavefronts on current hardware?  Also, can I guarantee that work items are scheduled in order of their global

              work item id ?   If I can guarantee this, then if let's say M is max in-flight wavefronts, and each wavefront has 64 work items,  then

               

              (global work item id) % (M*64)

               

              would give me a unique id into my cache that would prevent concurrency issues.

                • Re: Global memory for work item cache
                  dipak
                  What would be a safe max for in-flight wavefronts on current hardware?

                  On GCN device, the theoretical limit of max. in-flight wavefronts is: numOfCU * 4 * 10  [as each SIMD unit has instruction buffer for 10 wavefronts]

                   

                  can I guarantee that work items are scheduled in order of their global work item id ? 

                  I don't think you can assume an in-order execution unless you create your own chain of dependency. In this scenario, your program may under-utilise the capability of the running device.

                  Generally, a ND-range consists of multiple work-groups and work-groups are assigned to multiple CUs. All work-items in a work-group are executed on same CU, however a CU can process multiple work-groups at a time. CUs operate independently of each other. It is also possible for different SIMDs within a CU to execute different instructions. As a result, it is not expected that work-items will be processed in order of their global id.

                  If I can guarantee this, then if let's say M is max in-flight wavefronts, and each wavefront has 64 work items,  then (global work item id) % (M*64) would give me a unique id into my cache that would prevent concurrency issues.

                  I think, the answer is same as previous.

                   

                  Regards,