3 Replies Latest reply on Jan 6, 2014 7:03 PM by drallan

    Persistent __local variable  option as a fast communication between kernels

    tugrul_512bit

      For small patches of cloth animations, 2D fluid computation or even some reduction techniques, will there be an option to make __local variables stationary until next kernel execution or at least for repeating the same kernel without touching memory?

       

      In basic, as a workaround for ping-pong technique so everything is done in gpu purely.

       

      Im an opencl beggineer, already did some nbody and 2d fluid java programs that harness gpu via jocl and Im looking for some new optimizations.

       

      For example, how would a 2d-fluid compute performance improve if half of the local memory is dedicated to such communications between kernels? Does it drop due to poor utilization/occupation or does it increase because memory fetching is decreased by a good margin? For example, I have a HD7870@(1100/1200).

       

      It sounds weird but, if I have 1280 cores and  n<=1280 planets(64 is computed per compute unit, updated by broadcasting), Then can I do n-body calculations with this optimization using full potential of 2.5 TFlops? Otherwise it doesnt put enough load on my gpu.

       

       

       

      Maybe multiplication of pre-cached matrices can be another example?

       

      Doable?

       

      Thanks for your time.

        • Re: Persistent __local variable  option as a fast communication between kernels
          drallan

          Although it's sometimes possible, Opencl does not support persistent variables or memory, but Opencl doesn't seem too clear private or local variables. However, there are too substantial problems.

           

          1) Other programs may use the GPU and thus the memory.

          2) Execution groups do not run on the same physical hardware compute unit (CU) on subsequent runs so, local ids are not fixed to hardware locations containing the same memory. I think this is done to balance power consumption and thermal effects. The data does appear at the same address, just the wrong ids.

           

          If you can overcome both of these, it can work and I have done it on occasion.

           

          First, your program must somehow identify the relevant memory, one way might be to leave a special tag that your program uses for its own group id. People who hack with assembly code can access hardware configuration registers that identify the CUs and local memory layout, but using assembly code is not supported either and is a complex undertaking. Be forewarned.

           

          The other thing you need to do is make sure there are no other programs using the GPU between runs. Use a headless GPU card if possible, other programs should not be using the card.

           

          In some cases it might be worth it particularly for short programs with a high data to compute (ALU operations) ratio. I found it useful for small repetitious 1D and 2D wave functions that only required a few adds and multiplies.

           

          Vgpr registers can hold more data than LDS memory, but will often fail because the very intelligent OCL compiler may not actually use registers even though a program loads them.

           

          If not useful, it is still interesting to try.

          1 of 1 people found this helpful
            • Re: Persistent __local variable  option as a fast communication between kernels
              tugrul_512bit

              If variables in a CU is not related to other CU ingredients and if even the most secure operating system lets the opencl malloc a non-zeroed array(I dont know if a GPU-memory array has to be cleared for servers), no-other programs interfere, its safe?

               

              What about vram corruptions, bus errors, do they tend to change starting addresses of mallocing?

               

              Also those "Vgpr"(private registers?) registers must be also faster than LDS I assume. Then even bigger/more patches of cloth&fluid can be computed using both LDS and Vgpr(after disabling auto register optimizations, if there is such option).

               

              Using GPU-intensive kernels can overheat GPU, do I have to add a warning to the copyright file (of Khronos or some 3D-engine) that tells "Responsibility is yours, heats like furmark!" (if I'm to distribute the opencl application for free)?

               

              Thank you.

                • Re: Persistent __local variable  option as a fast communication between kernels
                  drallan

                  tugrul_512bit wrote:

                   

                  If variables in a CU is not related to other CU ingredients and if even the most secure operating system lets the opencl malloc a non-zeroed array(I dont know if a GPU-memory array has to be cleared for servers), no-other programs interfere, its safe?

                   

                  What about vram corruptions, bus errors, do they tend to change starting addresses of mallocing?

                   

                  Also those "Vgpr"(private registers?) registers must be also faster than LDS I assume. Then even bigger/more patches of cloth&fluid can be computed using both LDS and Vgpr(after disabling auto register optimizations, if there is such option).

                   

                  Using GPU-intensive kernels can overheat GPU, do I have to add a warning to the copyright file (of Khronos or some 3D-engine) that tells "Responsibility is yours, heats like furmark!" (if I'm to distribute the opencl application for free)?

                   

                   

                  Malloc does not apply to local or private variables in a CU, variables must be declared and are static. The reason they are not cleared is probably because the programmer may not need them cleared, and clearing them takes time.

                   

                  As I mentioned, the OCL compiler is intelligent and is not very controllable, even if you declare an array of private (VGPR) registers, it may not use them if it finds a better way to write the program. Then your data will be lost. One way to see what the code does is look at the assembly code in something like the Kernel Analyzer.

                   

                  Warning, the OCL compiler often changes with new versions of OCL, there is a chance that program hacks like this will not work when distributed. Even changing compiler optimization can change the code.  Instead of saying   "Responsibility is yours, heats like furmark!" you could say "Warning, works like Windows....."

                   

                  Yes, VGPRs are much faster, they are "0" wait. On a Tahiti 7970 VGPR bandwidth is ~23 terabytes/sec., LDS reads are about ~3.8 terabytes/sec.

                   

                  Yes, a happy GPU can get hot.