12 Replies Latest reply on Apr 30, 2009 10:57 PM by eduardoschardong

    global GPR vs. global data store

    hbliao

      there are several AMD slides floating taking about GDS on 4850/4870 GPUs, global data store. I am not sure how to access them in the current 1.21 beta release.

      Another question is that, from the header files in 1.21 beta, there is another term, global GPR. What is the different from GDS?

      BTW, I dumped the properties reported from CAL package on 4850. It said

      GDS is not supported and Global GPR is supported. What's wrong with my GPU?

       

        • global GPR vs. global data store
          MicahVillmow
          GDS is not currently supported in CAL even though the hardware does have this feature. This is mainly because there is no global locking mechanism to synchronize on and therefor there is no good way of using it. A Global GPR is a Register that is shared between the same thread index of a wavefront. These are the shared registers in IL.
          For example, if you have two wavefronts with thread id's numbers 0-63 and 64-127. By declaring 1 shared register in your IL kernel, threads n and n + 64 can both read/write to sr0. Shared registers are guarantee atomic accesses in the same instruction only. So you can do a rmw operation on this register and another wavefront will see the updated value. This is useful for doing simple reductions in compute shader. You can do a simple reduction of any size in 3 passes instead of the log (n) passes that is currently required.
          It would go something like this:
          first pass:
          run 1 thread per data point and have it update a globally shared register(either min, max, sum, etc...)
          second pass:
          run 1 wavefront per simd and use the LDS to share data between threads and update a single thread with the result of the rest of the threads and write out to global buffer
          third pass:
          run 1 wavefront and have it reduce the data from the global buffer to a single point

          This can only be guaranteed to work if you use calCtxRunProgramGridArray and set the array to be three passes.
            • global GPR vs. global data store
              hbliao

              which one should be interpreted as the replacement of CUDA's atomic operations?

              looks like shared register cannot be directly mapped onto CUDA's atomic operator since it doesn't allow across different thread indices. how about the future GDS support?

               

              • global GPR vs. global data store
                helmutb
                Hi Micah,

                just a few general question regarding the reduction process you described in your last post:

                0: calCtxRunProgramGridArray() requires groupsize, LDS size and shared registers to be identical for all invocations but the number of threads to launch can be different, is that correct?

                1: Assuming i have finished the first pass i and have a result in a shared register sr0. If i want to run one group per SIMD in the second pass i need to launch 10 groups with a groupsize of 1024 and a LDS size of 16 bytes. Now since sr0 in thread n holds the same data as sr0 in thread n+64, n+128 and so on, i only need to reduce the threads 0 to 63 to a single point. When doing so I don't think i have to insert a fence instruction between the LDS read and write operations since threads 0 to 63 are part of a single wavefront and therefore are executed simultaneously, is this correct?

                2: Since i am having only numberOfSIMDs results which need to be reduced to a single point i am considering doing this on a per thread level in subsequent shaders. I am thinking of doing this by attaching the global buffer as a constant buffer. Is this possible and does this make sense? As far as i know constant buffer fetches can be handled within ALU clauses so this could be very cheap if i have a lot of texture fetches beside.

                thanks in advance and sorry for being a little bit OT,
                Helmut
                • global GPR vs. global data store
                  vvolkov

                   

                  Originally posted by: MicahVillmow GDS is not currently supported in CAL even though the hardware does have this feature. This is mainly because there is no global locking mechanism to synchronize on and therefor there is no good way of using it.


                  But we do not necessarily need locks to operate on shared memory. What may suffice is memory fence to enfoce consistency, such as we have with LDS. Do you mean there is no global memory fence?

                    • global GPR vs. global data store
                      MicahVillmow

                      There is no way to synchronize across SIMD's on the HD4XXX series of chips, so it is not possible to enforce any type of memory consistency.

                        • global GPR vs. global data store
                          vvolkov

                           

                          Originally posted by: MicahVillmow There is no way to synchronize across SIMD's on the HD4XXX series of chips, so it is not possible to enforce any type of memory consistency.


                          If global data store permits reads and writes from threads running on different SIMDs, you can implement global barrier synchronization in software, can't you?

                            • global GPR vs. global data store
                              MicahVillmow

                              vvolkov,

                              The problem is that you need some sort of way to do a read-modify-write atomically to do software based synchronization. 

                              There are some uses for GDS on HD4XXX, but they are not generic enough to allow for general usage and as such are not exposed.

                                • global GPR vs. global data store
                                  vvolkov

                                   

                                  Originally posted by: MicahVillmow

                                  The problem is that you need some sort of way to do a read-modify-write atomically to do software based synchronization. 



                                  Barrier can be implemented without using atomic updates. Such are used on multicore CPUs and prototypes exist for NVIDIA GPUs. The idea is to avoid race condition on updating a shared variable by replicating it across the thread array, so that each thread can update only its private copy. Suppose we have N threads. Then threads 2, ..., N increment variables 2, ..., N correspondingly and busy-wait until variable 1 is incremented. Thread 1 busy-waits until all variables 2, ..., N are incremented and then increments variable 1. This signals other threads to proceed. This implements barrier. No atomic updates are necessary since there is no race conditions. You may need to substitute "thread" with "thread block" to implement it on GPU.

                                    • global GPR vs. global data store
                                      MicahVillmow

                                      vvolkov,

                                      This way could possibly be done, but I'd posit that it is way to inefficient to deal with. You can actually do this method now with using global buffer as a replacement for GDS to get some timing numbers, but i'm guessing it would take 10s-100s of thousand of cycles for a single barrier which would make it neither efficient or useful. I'd like to be proved wrong though.

                                      It probably would be quicker to do something similiar to what mcuda does and break the kernel into multiple kernels at each global barrier point.

                                        • global GPR vs. global data store
                                          vvolkov

                                           

                                          Originally posted by: MicahVillmow vvolkov,

                                          This way could possibly be done, but I'd posit that it is way to inefficient to deal with. You can actually do this method now with using global buffer as a replacement for GDS to get some timing numbers, but i'm guessing it would take 10s-100s of thousand of cycles for a single barrier which would make it neither efficient or useful. I'd like to be proved wrong though.

                                          It probably would be quicker to do something similiar to what mcuda does and break the kernel into multiple kernels at each global barrier point.



                                          I have performance numbers for NVIDIA GPUs using their "global memory". Running many such barriers back to back results in ~1-2 microseconds per barrier. This is close to 4 memory latencies, which sounds optimal for this algorithm.

                                          1-2 microseconds is still less than ~3-7 microseconds required to launch new kernel in CUDA. There was a guy at NVIDIA forum who claimed getting speedups using this technique. There is a problem with memory consistency though.

                                          I don't have solid results on AMD GPUs yet, but it seems that launching new kernel using calCtxRunProgram costs around 10 microseconds, which is ~10,000 shader clock cycles. So, synchronizing via global buffer may still be faster than breaking into multiple kernels.

                                          Syncronizing via GDS might be even faster since it is on-chip, so it is likely to have smaller latency than global buffer.

                                          Vasily

                                            • global GPR vs. global data store
                                              eduardoschardong

                                              May a wavefront finish before another one from the same kernel een start?

                                              If so, it woun't be possible to the first thread busywait all others but it may be possible to each thread (let's say, the first of each wavefront) busywait the previous one, it wouldn't allow a perfect memory barrier but, assuming wavefronts are started in order and stores are executed in order too, it may allow ordering access one serial read/write access to each global variable.

                                              This may avoid launching a new stream in some algorithms.

                                               

                              • global GPR vs. global data store
                                MicahVillmow
                                helmutb,
                                0) this is correct, the setup parameters must be equivalent so that you are guaranteed that the wavefront spawn pattern is the same
                                1) In order to ensure correctness of code on all graphics cards that have compute shader, fence instructions must be put in between read and write operations. The compiler will optimize any fence instructions that are not required because of wavefront sizes.
                                2) This is possible, but dynamically fetched constant buffer accesses are done via the vertex cache and not the constant cache. This drastically slows down constant buffer access.