38 Replies Latest reply on Aug 20, 2010 2:36 PM by ryta1203

    Increase GPR usage with new SDK and Driver?

    ryta1203

      I went from Catalyst 10.5 to 10.7 and SDK 2.1 to SDK 2.2 and now all my kernels have horrible performance and the register allocation is approximately DOUBLE!

      What happened?

        • Increase GPR usage with new SDK and Driver?
          ryta1203

          BTW, has anyone else noticed this? Has it effected anyone else's performance? What am I missing here?

           

          BlackScholes example has gone from 16 to 31 GPRs? Is this correct?

            • Increase GPR usage with new SDK and Driver?
              ryta1203

              Also, this problem seems to only be with the 5870?

              For reported SKA GPR usage the 4870 is the same or better... ODD.

                • Increase GPR usage with new SDK and Driver?
                  Curiouscat

                  I'm using the SKA in SDK 2.2 to check kernels written for SDK 2.1, targeting the 5870. It's a mixed bag. Some kernels are reported to have better throughput, some worse, and some which were reported to use 0 scratch registers now use several (I'm seeing 9, 11 and 15) and have reduced throughput.

                  Another problem:

                  #pragma OPENCL EXTENSION cl_amd_fp64 : enable

                  yields the following error:

                  error: can't enable all
                            OpenCL extensions or unrecognized OpenCL extension

                • Increase GPR usage with new SDK and Driver?
                  Raistmer
                  Originally posted by: ryta1203

                  BTW, has anyone else noticed this? Has it effected anyone else's performance? What am I missing here?


                  My app has approx same performance under new Cat + SDK2.2.
                  On some workload it become slower ~2% on other even slightly faster.
                  Also, rebuild with new SDK had no effect on speed, old binary and new one execute with same speed under new SDK/driver.
                  But maybe my kernels just have no GPR pressure, didn't check what happened with GPRs via SKA.

                  EDIT: BTW, I use HD4870, so maybe my card didn't affected indeed...
                    • Increase GPR usage with new SDK and Driver?
                      ryta1203

                       

                      Originally posted by: Raistmer
                      Originally posted by: ryta1203 BTW, has anyone else noticed this? Has it effected anyone else's performance? What am I missing here?
                      My app has approx same performance under new Cat + SDK2.2. On some workload it become slower ~2% on other even slightly faster. Also, rebuild with new SDK had no effect on speed, old binary and new one execute with same speed under new SDK/driver. But maybe my kernels just have no GPR pressure, didn't check what happened with GPRs via SKA. EDIT: BTW, I use HD4870, so maybe my card didn't affected indeed...


                      Yes, like I said, I'm not seeing a difference on the 4870 as far as GPR allocation is concerned (I haven't checked performance).

                      It's the 5870 (and probably the entire 58xx series) where  my GPR has increased dramatically in most kernels.

                       EDIT: It's a concern for me since one of my kernels has gone from 31 GPR to 50 GPR. Simul wavefronts from 8 to 4, quite a difference in performance by using a "new" (and assumed "better") driver/SDK.

                        • Increase GPR usage with new SDK and Driver?
                          Curiouscat

                          I've played some more with the SKA. An example of perplexing behaviour:

                          Start with three kernels, call them kernel_A, kernel_B and kernel_C, which all take the same arguments and perform similar computations. Individually, they use no scratch registers and have similar throughputs; call those thru_A, thru_B and thru_C (MThreads/s).

                          Now combine them to a single kernel which takes the same arguments, by simply turning their bodies into blocks of the new kernel. Since there are no shared variables between the blocks, I would expect the compiler to treat each block as it treated the original kernel body. I would still expect to see no scratch register usage and throughput given by 1/(1/thru_A + 1/thru_B + 1/thru_C).

                          Instead, I now get plenty of scratch register usage and significantly lower throughput than expected.

                            • Increase GPR usage with new SDK and Driver?
                              ryta1203

                               

                              Originally posted by: Curious cat I've played some more with the SKA. An example of perplexing behaviour:

                              Start with three kernels, call them kernel_A, kernel_B and kernel_C, which all take the same arguments and perform similar computations. Individually, they use no scratch registers and have similar throughputs; call those thru_A, thru_B and thru_C (MThreads/s).

                              Now combine them to a single kernel which takes the same arguments, by simply turning their bodies into blocks of the new kernel. Since there are no shared variables between the blocks, I would expect the compiler to treat each block as it treated the original kernel body. I would still expect to see no scratch register usage and throughput given by 1/(1/thru_A + 1/thru_B + 1/thru_C).

                              Instead, I now get plenty of scratch register usage and significantly lower throughput than expected.

                              Have you looked at the ISA and played with moving instructions around?

                              It turns out that simply cascading kernels is not the best way to results. I won't get into this much but it's not too hard to get the same register usage from the merged kernel as it is from the max(kernA, kernB, kernC), but you will need to look at, and possibly move, the code.

                                • Increase GPR usage with new SDK and Driver?
                                  Curiouscat

                                   

                                  Originally posted by: ryta1203Have you looked at the ISA and played with moving instructions around?

                                  No. The point being that if the compiler were behaving reasonably, it would reuse all registers employed in block A when doing block B, and again reuse all registers employed in block B when doing block C. Instead, it's spilling registers. If I were a compiler developer, I would want to understand why; it may well be the same problem causing the increased register use in 2.2 vs 2.1.

                                    • Increase GPR usage with new SDK and Driver?
                                      genaganna

                                       

                                      Originally posted by: Curious cat
                                      Originally posted by: ryta1203Have you looked at the ISA and played with moving instructions around?

                                       

                                       

                                      No. The point being that if the compiler were behaving reasonably, it would reuse all registers employed in block A when doing block B, and again reuse all registers employed in block B when doing block C. Instead, it's spilling registers. If I were a compiler developer, I would want to understand why; it may well be the same problem causing the increased register use in 2.2 vs 2.1.

                                       

                                      Curious cat,

                                              Could you please post your three kernels here which helps us to see what is going wrong?

                                        • Increase GPR usage with new SDK and Driver?
                                          Curiouscat

                                           

                                          Curious cat,

                                                  Could you please post your three kernels here which helps us to see what is going wrong?

                                          No, but I could try creating an example and mail it to you. Will have to wait a few days though (busy). If Mica Villmow still has the code I mailed him back in June when aticaldd was crashing (I don't), it might be enough to use the body of that kernel.

                                          • Increase GPR usage with new SDK and Driver?
                                            ryta1203

                                             

                                            Originally posted by: genaganna
                                            Originally posted by: Curious cat
                                            Originally posted by: ryta1203Have you looked at the ISA and played with moving instructions around?

                                             

                                             

                                            No. The point being that if the compiler were behaving reasonably, it would reuse all registers employed in block A when doing block B, and again reuse all registers employed in block B when doing block C. Instead, it's spilling registers. If I were a compiler developer, I would want to understand why; it may well be the same problem causing the increased register use in 2.2 vs 2.1.

                                             

                                            Curious cat,

                                                    Could you please post your three kernels here which helps us to see what is going wrong?

                                            Just take three samples and cascade them.

                                          • Increase GPR usage with new SDK and Driver?
                                            ryta1203

                                             

                                            Originally posted by: Curious cat
                                            Originally posted by: ryta1203Have you looked at the ISA and played with moving instructions around?

                                            No. The point being that if the compiler were behaving reasonably, it would reuse all registers employed in block A when doing block B, and again reuse all registers employed in block B when doing block C. Instead, it's spilling registers. If I were a compiler developer, I would want to understand why; it may well be the same problem causing the increased register use in 2.2 vs 2.1.

                                            Well, I agree with you but it's not so what are you going to do?

                                            I'm pretty sure that AMD outsources their compiler (to VizExperts?).. so that could help to explain things.

                                            As far as 10.7 goes, I just installed the 10.7 from AMD.com drivers page. Is there a different that I should be using???

                                            And Jawed, it's not just SKA, the PROFILER (stated previously) reports the exact same high GPR usage that the SKA reports.

                                • Increase GPR usage with new SDK and Driver?
                                  Jawed

                                  Are you using the update version of 10.7?

                                  update driverhttp://update%20driver

                                    • Increase GPR usage with new SDK and Driver?
                                      Curiouscat

                                       

                                      Originally posted by: Jawed Are you using the update version of 10.7?

                                      update driverhttp://update%20driver

                                      Yes, I have all the latest and greatest installed. Even did a complete uninstall and directory delete followed by reinstallation to be sure. No change. I now have kernels which used to have 0 scratch registers using 9, 11, 15 and 20 scratch registers, and feel like Sisyphus.

                                      It would be OK if performance improved by using more registers, but in all those cases it is reported to be down significantly.

                                      Does

                                      #pragma OPENCL EXTENSION cl_amd_fp64 : enable

                                      work for you?

                                      • Increase GPR usage with new SDK and Driver?
                                        ryta1203

                                         

                                        Originally posted by: Jawed Are you using the update version of 10.7?

                                        update driverhttp://update%20driver

                                        Jawed,

                                        If you are talking to me then yes, per my original post.

                                          • Increase GPR usage with new SDK and Driver?
                                            Jawed

                                            There are two releases of Catalyst 10.7, so I can't tell which you are using.

                                            For this reason SKA cannot be relied upon, because when 10.7 is selected internally for compilations, you don't know which release of 10.7 is being used.

                                            (For the record: I've got no experience of any of this, as I haven't installed SDK 2.2, nor Catalyst 10.7, nor SKA 1.6).

                                          • Increase GPR usage with new SDK and Driver?
                                            davibu

                                             

                                            Originally posted by: Jawed Are you using the update version of 10.7?

                                             

                                            update driverhttp://update%20driver

                                             

                                            Just for the record, this version of the drivers crash the Xserver as soon as I start an OpenCL application (Ubuntu 10.04 64 bit, 5870+5850). May be something related to multi-gpus ?

                                             

                                            Multi-GPUs support seems to work really bad with standard Catalyst 10.7 since the introduction of SDK 2.2. I obtain the same performance with 1 or 2 GPU.

                                             

                                              • Increase GPR usage with new SDK and Driver?
                                                davibu

                                                Here the backtrace of the Xserver crash:

                                                 

                                                Backtrace:
                                                0: /usr/bin/X (xorg_backtrace+0x28) [0x4a3258]
                                                1: /usr/bin/X (0x400000+0x655bd) [0x4655bd]
                                                2: /lib/libpthread.so.0 (0x7f9e00f81000+0xf8f0) [0x7f9e00f908f0]
                                                3: /usr/lib/xorg/modules/drivers/fglrx_drv.so (0x7f9dfd6f2000+0x2af584) [0x7f9dfd9a1584]
                                                4: /usr/lib/xorg/modules/drivers/fglrx_drv.so (0x7f9dfd6f2000+0x2ad843) [0x7f9dfd99f843]
                                                5: /usr/bin/X (0x400000+0x30c3c) [0x430c3c]
                                                6: /usr/bin/X (0x400000+0x261aa) [0x4261aa]
                                                7: /lib/libc.so.6 (__libc_start_main+0xfd) [0x7f9dffc78c4d]
                                                8: /usr/bin/X (0x400000+0x25d59) [0x425d59]
                                                Segmentation fault at address 0x8

                                                Caught signal 11 (Segmentation fault). Server aborting

                                                Please consult the The X.Org Foundation support
                                                         at http://wiki.x.org
                                                 for help.
                                                Please also check the log file at "/var/log/Xorg.0.log" for additional information.

                                                 

                                                  • Increase GPR usage with new SDK and Driver?
                                                    laobrasuca

                                                    well, all that i know is that my ALUBusy % drop from 100% to 56% for a kernel of mine, with no changes, even in comments, if you see what i mean. this is really bizzard, not to metion that my kernel run slower now! i can't tell about the GPRs coz the sdk 2.1 profiler doesn't show this information in visual 2008 (i've got the pro version). how can i check the GPRs with the sdk 2.1? all that i have is a 50 GPRs with the new sdk 2.2 's profiler.

                                                    one more stuff, the transfer RAM<->VRAM is slower with the sdk 2.2 compared to the version 2.1, something like 20% slower (both senses).

                                                     

                                                    ps: my card is a hd5770

                                                      • Increase GPR usage with new SDK and Driver?
                                                        ryta1203

                                                         

                                                        Originally posted by: laobrasuca well, all that i know is that my ALUBusy % drop from 100% to 56% for a kernel of mine, with no changes, even in comments, if you see what i mean. this is really bizzard, not to metion that my kernel run slower now! i can't tell about the GPRs coz the sdk 2.1 profiler doesn't show this information in visual 2008 (i've got the pro version). how can i check the GPRs with the sdk 2.1? all that i have is a 50 GPRs with the new sdk 2.2 's profiler.

                                                        one more stuff, the transfer RAM<->VRAM is slower with the sdk 2.2 compared to the version 2.1, something like 20% slower (both senses).

                                                         

                                                        ps: my card is a hd5770

                                                        I also have VS2008, check your profiler settings.

                                                        If not then you can just dump the ISA and look at the bottom of that file. Or you can use the SKA to check the GPR, just make sure if you use the SKA that you are using the same version (for 2.1, the SKA that uses Catalyst 10.3 and for 2.2 the SKA that uses Catalyst 10.7)

                                                • Increase GPR usage with new SDK and Driver?
                                                  MicahVillmow
                                                  Ryta,
                                                  Our compiler stack for OpenCL is developed internally. However, the CAL compiler is fundamentally a graphics compiler, which has different requirements than a general purpose compute compiler. We are still working on fine tuning our stack for compute compiler loads and it looks like there are some cases where our tuning was less than optimal.
                                                    • Increase GPR usage with new SDK and Driver?
                                                      ryta1203

                                                       

                                                      Originally posted by: MicahVillmow Ryta, Our compiler stack for OpenCL is developed internally. However, the CAL compiler is fundamentally a graphics compiler, which has different requirements than a general purpose compute compiler. We are still working on fine tuning our stack for compute compiler loads and it looks like there are some cases where our tuning was less than optimal.


                                                      So you can confirm this (the dramatic increase in register usage)? I just want to know, actually it's not really effecting my work so much, but I am still curious. Thanks.

                                                      It would be awful difficult for developers to have to decide which SDK and drivers to use based on which ones perform better for their kernels. A "slight +/-" swing in performance is to be expected but to almost half the performance for some kernels because the register allocation system is broken makes it difficult.

                                                    • Increase GPR usage with new SDK and Driver?
                                                      MicahVillmow
                                                      Ryta,
                                                      Yeah we see this internally. We are still looking into the root cause of why it is occurring, but since there are a lot of components that changed between 2.1 and 2.2 so it might take us a little bit to figure out exactly what change, or series of changes, caused this to occur.
                                                        • Increase GPR usage with new SDK and Driver?
                                                          ryta1203

                                                           

                                                          Originally posted by: MicahVillmow Ryta, Yeah we see this internally. We are still looking into the root cause of why it is occurring, but since there are a lot of components that changed between 2.1 and 2.2 so it might take us a little bit to figure out exactly what change, or series of changes, caused this to occur.


                                                          Micah,

                                                            Ok, thanks again for confirming this, appreciate it.

                                                          • Increase GPR usage with new SDK and Driver?
                                                            ryta1203

                                                             

                                                            Originally posted by: MicahVillmow Ryta, Yeah we see this internally. We are still looking into the root cause of why it is occurring, but since there are a lot of components that changed between 2.1 and 2.2 so it might take us a little bit to figure out exactly what change, or series of changes, caused this to occur.


                                                            Sorry, so is this an SDK issue or a driver issue? I was just wondering if this might be fixable in 10.8 or 10.9 or will we have to wait for a new SDK version?

                                                          • Increase GPR usage with new SDK and Driver?
                                                            MicahVillmow
                                                            It is an issue with the CAL compiler which is shipped with the driver.