21 Replies Latest reply on Apr 14, 2010 9:16 PM by MicahVillmow

    Some performance-related questions about current OpenCL implementation

    Raistmer
      1) Is it possible to have local (per thread) array stored into registers in OpenCL?
      It was not possible in Brook+.
      That is, if I write inside kernel:
      float4 buf[32];
      will these 32 elements placed into registers or they will be spilled into global memory?

      2) If write to global memory buffer resides in not chosen branch inside branch instruction (per wavefront basis) will this write be avoided or zeros or junk will be written anyway?
      Also, will such write be avoided per wavefront or per thread basis?

      3) How many registers can be used per thread to still hide global memory read latence more or less effectively? (or how many wavefronts per SIMD should be launched simultaneously to hide read latence?)
        • Some performance-related questions about current OpenCL implementation
          nou

          1. private arrays are in global memory. use local arrays instead.and there was a note that devs are working on moving arrays into registers too.

          • Some performance-related questions about current OpenCL implementation
            MicahVillmow
            Raistmer,
            Our next release will move some private arrays into hardware indexable temp registers. Depending on the size and usage the compiler will determine if the accesses are converted to registers or backed into global memory.
            • Some performance-related questions about current OpenCL implementation
              Raistmer
              Thanks.
              Could you give some insights about 2) and 3) questions, please?
              • Some performance-related questions about current OpenCL implementation
                MicahVillmow
                Raistmer,
                2) if the branch is not taken, the write is not executed.
                3) The register count affects the number of wavefronts that can execute on the same SIMD. The number of wavefronts that are required to hide the memory latency is algorithm/kernel dependent.
                  • Some performance-related questions about current OpenCL implementation
                    Raistmer
                    Originally posted by: MicahVillmow

                    3) The register count affects the number of wavefronts that can execute on the same SIMD. The number of wavefronts that are required to hide the memory latency is algorithm/kernel dependent.


                    I consider such situation:
                    kernel has memory load operation only in very beginning of kernel. After that all computations carried in registers.
                    that is, for starting computations some values should be loaded from memory, each new wavefront will issue memory loads right after creation, no matter how long whole kernel is. Computations can start only after data will be in register, hence some latency from memory read. As I understand it, in such situation this latency can't be hided by computations inside kernel itself (at least until very first wavefront will not have its data loaded), the only possible way is to busy GPU by issuing new memory reads (this issue requires some cycles too probably). How many wavefronts should be launched in this situation before first one recives its data?
                    Or it's not possible to hide this latency at all, first few wavefronts always will suffer from delay?
                      • Some performance-related questions about current OpenCL implementation
                        omkaranathan

                         

                        The first wavefront will always suffer from this latency. The concept of hiding the memory latency by issuing more wavefronts is meaningful after the first wavefront starts executing. The initial setup time and the memory latency of first wavefront will always be there.



                          • Some performance-related questions about current OpenCL implementation
                            davibu

                             

                            Originally posted by: omkaranathan

                            The first wavefront will always suffer from this latency. The concept of hiding the memory latency by issuing more wavefronts is meaningful after the first wavefront starts executing. The initial setup time and the memory latency of first wavefront will always be there.

                             

                            Hasn't Fermi introduced the capability to schedule multiple kernels in order to exactly solve this problem ?

                             

                              • Some performance-related questions about current OpenCL implementation
                                _Big_Mac_

                                No matter how you play it, hiding latency by pipelining execution (ie. switching to another wavefront/warp/kernel/context/whatever to do some meaningful work during the wait) will not shorten the initial wait. The requests for memory will not come back to the first requester any sooner, Fermi or no Fermi it's true tho that kernel concurrency they introduced is yet another step in pipelining.

                                As for moving arrays to register - I presume indexing would have to be made entirely by literals in code, like array[2]. The moment you start using array[k] you're forcing the compiler to assume there's arbitrary pointer arithmetic involved and I don't see how you could reliably implement pointer arithmetic over registers. What is register A plus 'k'? How can you take an address of a register?

                                  • Some performance-related questions about current OpenCL implementation
                                    davibu

                                     

                                    Originally posted by: _Big_Mac_ No matter how you play it, hiding latency by pipelining execution (ie. switching to another wavefront/warp/kernel/context/whatever to do some meaningful work during the wait) will not shorten the initial wait. The requests for memory will not come back to the first requester any sooner, Fermi or no Fermi


                                     

                                    What about if you use different kernels ? In one case you pay the "initial wait" every time you switch to the execution of a new kind of kernel while in the second case there is no penalty. It looks like it can make a huge difference in any application chain running different kernels (i.e. a quite common case).

                                     

                                      • Some performance-related questions about current OpenCL implementation
                                        ryta1203

                                        Mulitple kernels still won't help really from the initial wait.

                                        1st kernel 1st warp, you still have to wait for the fetch units (which are busy) to finish.

                                        2nd kernel 1st warp, you have to wait for fetches, but fetch units are busy with 1st kernel 1st warp..

                                        ...so unless they ahve dedicated fetch units/kernel, which seems stupid.

                                        Multiple kernels mainly increase performance through ALU utilization between two or more fetch bound kernels.

                            • Some performance-related questions about current OpenCL implementation
                              MicahVillmow
                              _Big_Mac_,
                              Our hardware can index into registers via a special addressing mode, however this will only occur with arrays that are fairly small and dynamically indexed(i.e. around 10 elements or less). Once the array gets above a certain size and requires to many registers it gets pushed into memory.
                                • Some performance-related questions about current OpenCL implementation
                                  Raistmer
                                  Originally posted by: MicahVillmow

                                  _Big_Mac_,

                                  Our hardware can index into registers via a special addressing mode, however this will only occur with arrays that are fairly small and dynamically indexed(i.e. around 10 elements or less). Once the array gets above a certain size and requires to many registers it gets pushed into memory.


                                  Very bad behavior when big register file used as "cache" to avoid unneeded fetches from global memory.
                                  16k of float4 registers per unit and only 10 of them can be used as array? Not good
                                • Some performance-related questions about current OpenCL implementation
                                  MicahVillmow
                                  Raistmer,
                                  Each SIMD has 256x64 128bit registers. So that means each thread in a wavefront has access to 256 registers max. Some registers are reserved for temps giving about 240 registers per SIMD. This must be divided evenly between wavefronts in a group. In OpenCL the default group size is 256 threads or 4 wavefronts on the high end chips. That leaves 60 registers per thread. So, in this case 1/6th of registers are used for this array. In practice 4 wavefronts is not enough to hide all the latency on the chip, so the compiler will attempt to place multiple groups on a single SIMD. If two groups are placed, then each thread gets 30 registers and 1/3 would go to indexing into an array. If three groups are scheduled on a SIMD, then half the available registers would be allocated to indexing into an array.

                                  So, as you can see, 10 vector elements can take up a fairly large amount of space in the register file.
                                    • Some performance-related questions about current OpenCL implementation
                                      Raistmer
                                      Well, some tasks require more registers at expense of number of simultaneously running threads. This even reflected in OpenCL specs, when it talks about task-based parallelism.
                                      For example, if I need only few threads, let say 2 threads per SIMD (my GPU has 10 SIMDs so it would be kernel of 20 threads), is it possible to use 32*256 registers in one thread (per SIMD) and 32*256 registers in another?
                                      or no matter how many threads available, only 256 registers will be allocated per wavefront?
                                      You gave answer from fully used wavefront capacity point of view, I need as many free registers as possible per thread (workitem), even if there will be only one thread per wavefront.

                                      EDIT:
                                      In short, what will be if I write smth like this in kernel:

                                      float4 a0,a1,a2,a3,......a512;

                                      Will such kernel fail? Will compiler place most of variables into global memory instead of registers? Is it possible (there is corresponding switch for compiler in CUDA, don't know about OpenCL) to instruct compiler to use more registers per thread than its default value?
                                      • Some performance-related questions about current OpenCL implementation
                                        Raistmer
                                        Originally posted by: MicahVillmow
                                        If three groups are scheduled on a SIMD, then half the available registers would be allocated to indexing into an array.


                                        To indexing?? that is, 10 registers are needed as overhead to index into single array provided it would be implemented?? Why so many registers are used as index ???

                                        Maybe some misunderstanding here,
                                        float4 array[10];
                                        will actually eat 20 registers, not 10 ?
                                      • Some performance-related questions about current OpenCL implementation
                                        MicahVillmow
                                        Each thread is limited to 256 registers. Our register file is 64 wide, 256 deep and threads cannot access registers outside of their column.
                                        • Some performance-related questions about current OpenCL implementation
                                          MicahVillmow
                                          Raistmer,
                                          No, if you use more than 10 elements or so in your array(not 100% sure about the cutoff limit), then you no longer will get dynamic indexing into the register file and instead the stack will be pushed out to memory.
                                          • Some performance-related questions about current OpenCL implementation
                                            MicahVillmow
                                            Raistmer,
                                            If you use an array and you don't dynamically index into the array, they should be placed in registers as the array is not required. It is with dynamic indexing into the array that performance issues arise.