22 Replies Latest reply on Feb 4, 2009 10:46 AM by ryta1203

    Performance in Brook+

    ryta1203
      A few questions and notes I have that I would like to confirm with everyone/AMD:

      1. Is anyone seeing a difference in a Multi-GPU system with CFX on or off?

      2. Is it beneficial to split a kernel up into multiple smaller kernels if the ALU:Fetch of the smaller kernels is 1.00 and the larger kernel ALU:Fetch is lower, say ~.83? For me, my results were worse with the multiple smaller kernels than the larger one, even though the ALU:Fetch was "better" for the smaller ones. Why is that?

      3. How expensive is branching? It seems to me that it's pretty expensive, but the KSA (while it does give you the CF instructions) doesn't really go into that.

      4. It's possible to write in Brook+ and then hand-tweak your IL kernels right? This makes sense to me I just haven't tried it yet. I'm asking because it seems that the certain optimizations do no good in the KSA for the ALU:Fetch ratio but they DO reduce the number of IL instructions.

      This is all for now, I eagerly await responses.
        • Performance in Brook+
          ryta1203
          I also noticed that there 4 components in a literal register; however, Brook+ does not group these literals when creating IL code. Would grouping them help to enhance performance?
          • Performance in Brook+
            MicahVillmow
            1) I'm not sure on this one.
            2) ALU:Fetch ratio is usually a good indicator of kernel performance, but there are other factors including control flow instructions, branching and memory access patterns that also affect the performance. The 1.0 ratio and the 0.83 ratio difference is fairly insignificant in comparison to the startup overhead of multiple smaller kernels.
            3) Branching is very expensive if done at a granularity smaller than a wavefront. If your whole wavefront all branches to one path, then every other path is skipped. If a single thread from a wavefront goes down a path, then all threads in that wavefront execute that path. So, say you have a 3 section if clause and 62 threads go down the first path, 1 thread goes down the second path and 1 thread goes down the third path. In this case, all threads go down every path. Branching is all or nothing on the GPU.
            4) With the 1.3 sdk this is easier to do, but certain things like the constant buffers are expected to be in specific locations.

            The literal registers are an IL language construct and they get inlined in the ISA. So grouping them will not enhance performance other than possibly being easier for the compiler to analyze for optimizations.
              • Performance in Brook+
                ryta1203
                Originally posted by: MicahVillmow

                1) I'm not sure on this one.

                2) ALU:Fetch ratio is usually a good indicator of kernel performance, but there are other factors including control flow instructions, branching and memory access patterns that also affect the performance. The 1.0 ratio and the 0.83 ratio difference is fairly insignificant in comparison to the startup overhead of multiple smaller kernels.


                Micah,

                Thanks. I'm still a little confused about the ALU:Fetch ratio, is 1.0 = full occupancy OR > 1.0 = full occupancy?

                I ask because if you are ALU bound, then isn't that really the goal of doing GPGPU calculations? The problem really comes in when you are fetch bound no?

                Also, what about when you are Global Write bottlenecked?
              • Performance in Brook+
                MicahVillmow
                Ryta,
                a ratio of 4:1 or better is what you should be shooting for when looking at that ratio. The ALU:Fetch ratio itself will not tell you if you are ALU bound or not, but give a rough estimate. Only a calculation on the time it takes to compute the ALU versus the time it takes to do the memory fetches can give you that information. A ratio of 1:1 usually means you are fetch bound.
                If you are ALU bound, than it usually will require better algorithms in order to increase performance.
                If you are global write bottlenecked, then you need to figure out how to burst your writes to achieve higher performance.
                  • Performance in Brook+
                    ryta1203
                    Micah,

                    Thanks again, sorry for all the questions. I'm just trying to get a clear idea of how I can use the KSA to help me increase performance.


                    On the GPU Tools forum, bpurnomo had this to say:

                    A texture fetch refers to a single memory access.

                    For the ALU:Fetch ratio, you want to be at ONE (it is a ratio). Yes, ONE means full occupancy.

                    High non-red numbers are bad as that means the system is not balanced. Red just means Fetch bound; it does not necesarily mean bad, and green means ALU bound. For example it is better at 0.9 red (close to balance) rather than 10.0 green.


                    Are you talking about the same ALU:Fetch ratio?

                    Also, can you point me to the documentation that talks about bursting memory accesses AND the documentation that talks about wavefronts and how it relates to the hardware? Thanks.
                  • Performance in Brook+
                    MicahVillmow
                    section 1.2/1.3 in the Stream Computing User Guide mention this information also with formula's to calculate if you are memory or alu bound.
                    bursting memory: 1.2.5.4 Memory Stores
                    wavefronts: 1.2.4

                    As for bpurnomo, I'm not sure that he is correct. Full occupancy on the graphics card is determined by much more than just the ALU:Fetch ratio as the number of registers used and other resources are also taken into account. The way to have KSA help with performance is by letting you see what the ISA is and it can give you hints on where to start looking.

                    For example, matmult which we know is a memory bound algorithm, has KSA reporting a ALU:Fetch ratio of 12.18 for the 4870. Even with this high ratio, the algorithm is still memory bound because the number of registers used does not allow enough wavefronts in flight to cover the latency of global buffer reads.

                    So, for rought estimates it is useful, but not for exact information.
                      • Performance in Brook+
                        ryta1203
                        Micah,

                        Thanks again for the reply, all this helps. I read that information in the SCUG, I was hoping for something more detailed, but I will take a look at the samples again and hopefully extract something from that.

                        It would be nice to have something that did give "full occupancy" details. For example, CUDA has both an Occupancy Calculator and a profiler, both of which can give you the occupancy of the GPU. The profiler also gives info like local/global/shared memory used and if it's coalesced or not. This is very useful information when attempting to optimize the kernels.

                        The KSA doesn't seem to have any of this information and the information it does give does not currently take into account the GPR usage, so even that information does not seem to be accurate.

                        I asked the GPU Tools about a profiler but they don't seem interested in it. Anyways, thanks again.

                          • Performance in Brook+
                            ryta1203
                            Where can I find the total number of registers (type and number) in a thread processor for a particularly architecture?

                            Wouldn't this information give you vital intel on the number of threads that can be run by a SIMD engine?
                        • Performance in Brook+
                          MicahVillmow
                          http://www.anandtech.com/print...cle.aspx?i=3341


                          This has some information on registers on the RV770.
                          It is on page 5

                          Just do total number of registers per simd / simd thread width and you get max number of registers per thread
                            • Performance in Brook+
                              ryta1203
                              So help me along here please, I'm slow:

                              If there are 163,840 registers on the RV770 and it has 10 SIMD engines and my problem size is 1026,1026 (width is 1026) then:

                              163840/10 = 16384/1026 = ~15.9, so 15??

                              This doesn't seem right, is it?

                            • Performance in Brook+
                              MicahVillmow
                              Ok,
                              So there are 163840 registers on the RV770. There are 10 SIMD's, so that gives us 16384 registers per simd, or 16K x 128bit as specified in the Registers per SIMD Core row.
                              Now, the article states right above the table that there are 64 threads per wavefront. So, 16384 / 64 gives you 256 registers per thread.
                              If you run a problem domain of 1026 * 1026, assuming 1 thread per location, that gives you 1,052,676 threads that need to be executed.
                              Divide that by the wavefront size, gives you 16449(must round up) wavefronts that will be spawned by the GPU for this domain.
                              Now, lets assume that you have 5 registers per thread(which can be determined from KSA disassembly), this lets you run a MAX of (256/5) = 51 wavefronts in parallel per SIMD, or 510 at a time on the GPU.
                              So this means that you have enough wavefronts to fill up the GPU at least 32 times.

                              So, assuming that your application gets all of the resources on the chip, this is what you should expect. However, because of other constraints this is the best case scenario and not the average case. So this should give you some idea about what you can do.

                              Hope this helps. That review article is fairly well done and if you analyze it with a compute mindset you can figure out a lot of things that are docs don't currently specify.
                                • Performance in Brook+
                                  ryta1203
                                  Micah,

                                  Thanks again, I can see a little clearer now. I will read over the article very carefully, thanks for the link.
                                  • Performance in Brook+
                                    ryta1203
                                    Micah,

                                    Sorry for all the questions. I have another:

                                    Why is it that it is registers per thread/wavefront? How can you have 51*64 threads running/SIMD engine in parallel with only 16 thread processsors/SIMD engine? I understand that each thread processor executes 2x2 (quad) threads over 4 cycles, thus really only 16 are "running" at one time but that is still 51*16?

                                    If you have 5 regs/thread and 64 threads per wavefront than that would be 5*64 registers needed, which exceeds the 256. Even if you there are only 16 threads running at one time (16 thread processors/SIMD engine) than that would be 5*16 registers needed which would leave some registers not in use?

                                    Also, how is the thread switching handled in the registers? Is the GPR count for a quad or for a single thread?

                                    Sorry, I'm just trying to get a crystal clear image so I can move forward with performance gains.
                                  • Performance in Brook+
                                    MicahVillmow
                                    Ryta,
                                    In my post above, I mentioned that there are 163840 registers per chip or 16384 per SIMD. With a wavefront size of 64, as it is on the HD48XX series, this gives a max of 256 registers per thread, not per wavefront. The register file is 256 deep and 64 wide. So, although according to the article it executes the 64 threads over 4 cycles and 16 are running at once. Threads in sequential cycles don't access the same column in the register file, they index into the columns of the register fiel by their position in the wavefront. So if you have 5 regs/thread and 64 threads/wavefront, that leaves you with 251 registers per thread for the rest of the wavefronts. So your single wavefront is using in total 320 registers, but each simd has 16K registers.
                                      • Performance in Brook+
                                        ryta1203
                                        Originally posted by: MicahVillmow

                                        Ryta,

                                        In my post above, I mentioned that there are 163840 registers per chip or 16384 per SIMD. With a wavefront size of 64, as it is on the HD48XX series, this gives a max of 256 registers per thread, not per wavefront.


                                        I understand that, no problems there.

                                        The register file is 256 deep and 64 wide. So, although according to the article it executes the 64 threads over 4 cycles and 16 are running at once. Threads in sequential cycles don't access the same column in the register file, they index into the columns of the register fiel by their position in the wavefront. So if you have 5 regs/thread and 64 threads/wavefront, that leaves you with 251 registers per thread for the rest of the wavefronts.


                                        Do you mean 251 register for the rest of the threads in the wavefront? Or for the rest of the wavefronts (64 threads each)? How can you have more than 1 wavefront running at a time when you have 4 threads running on each thread processor and you only have 16 thread processors, 16*4=64. In reality, you only have 16 threads running at 1 instance of time since you only have 16 thread processors correct?

                                        So your single wavefront is using in total 320 registers, but each simd has 16K registers.


                                        This doesn't explain how more than 16 threads can be running on a SIMD engine with only 16 thread processors at one time. If you only have 16 thread processors and each processor runs 1 thread at a time (switching between 4) then how can you get 51*64 threads running parallel?

                                      • Performance in Brook+
                                        MicahVillmow
                                        A wavefront executes a single instruction over all the threads in the wavefront in N cycles. Therefor if it executes 16 threads per cycle, then according to the article a wavefront size of 64 executes an instruction in 4 cycles. The problem comes when a clause is finished, what happens? This brings us back to the Stream_Computing_User_Guide.pdf section 1.2.7, Stream Processor Scheduling. In Figure 1.12, it shows what happens when a thread, T0 finishes execution and stalls for some reason, which in most cases is a memory access. T1 then executes until it stalls, followed by T2 and T3 all the way up to TN, where N is the number of threads able to execute with the available resources. This is a gross oversimplification, but should emphasize the point. Replace T0 with Wavefront0 and you have a better idea of what our hardware does.
                                          • Performance in Brook+
                                            ryta1203
                                            I read the SCUG SPS, it's really straightforward; however, it gives an example of within a thread processor, which (according to the AMD docs) only runs a quad at a time (2x2 threads). Also, according to the AMD docs each thread processor can execute only 1 thread at 1 time and there are only 16 thread processors, so I'm still unsure of how you are executing multiple wavefronts at a time.

                                            I'm assuming there are things about the hardware AMD doesn't want to reveal. Micah, thanks again for your time and patience.
                                          • Performance in Brook+
                                            MicahVillmow
                                            The example might talk only about a thread processor, but it can extrapolated and applied to all the thread processors on a SIMD.