20 Replies Latest reply on Feb 21, 2013 7:15 AM by yurtesen

    OpenCL performance dropped down 12.10 >> 13.1

    darkmen

      Hi everyone.

      I have updated today the AMD Catalist drivers to 13.1 and got 20% performance loss on my HD7970.

      Does anyone have the same experiance?

      Also which is the easiest way to rollback to 12,10? Uninstalling 13.1 and reinstalling 12.10 gives the same lower speed (opencl reporting NEW runtime version)

        • Re: OpenCL performance dropped down 12.10 >> 13.1
          Claggy

          I reported that last week too:

           

          http://devgurus.amd.com/message/1286437#1286437

           

          I had to delete a whole lot of files to be able to reinstall Cat 12.8,

          since then an AMD Catalyst Un-install Utility has appeared on the AMD Game Driver download site:

           

          http://sites.amd.com/us/game/downloads/Pages/catalyst-uninstall-utility.aspx

           

          Not tried it properly yet, except that it didn't work on Vista, and it says it is for Windows 7 only,

           

          Claggy

          • Re: OpenCL performance dropped down 12.10 >> 13.1
            darkhmz

            Hi!

             

            I have experienced the same issue with Catalyst 13.1. In my case the performance drop was around 39% on my HD5830. I've tested kernel performance with different versions of amdocl.dll and the OpenCL version shipped with Catalyst 13.1 was the worst. According to APP profiler, kernel execution times were ~17.51ms and ~24.38ms (12.10 vs 13.1).

            • Re: OpenCL performance dropped down 12.10 >> 13.1
              darkmen

              Hi, i have just tried the 13.2 version with OCL runtime 1124.2,

              Performance goes even more down then 13.1.

              And this is all goes to a compiler. Now comparing ISA sources produced by 12.10 and 13.1 (btw, AMD APP KernelAnalyzer crashes on 13.2)

              Seems there are some changes around branches and\or loops.

               

              The source pseudo code:

              for(uint i=0;i<STEP;i++){

                        if(check_data(...))

                   output[0] = i;

              }

               

              12.10 ISA:

                s_mov_b64     exec, s[10:11]     

                s_addk_i32    s3, 0x001f         

                s_addk_i32    s2, 0x0001         

                s_cmp_ge_u32  s2, 0x00002100     

                s_cbranch_scc1  label_3CC4       

                s_branch      label_0707         

                s_getpc_b64   s[10:11]           

                s_sub_u32     s10, s10, 0x0000d6e4

                s_subb_u32    s11, s11, 0        

                s_setpc_b64   s[10:11]           

              label_3CC4:                        

               

              13.1 ISA:

                s_mov_b64     exec, s[10:11]     

                s_addk_i32    s3, 0x001f         

                s_addk_i32    s2, 0x0001         

                s_cmp_ge_u32  s2, 0x00002100     

                s_cbranch_scc0  label_3F7E       

                s_getpc_b64   s[10:11]           

                s_add_u32     s10, s10, 0x00000038

                s_addc_u32    s11, s11, 0        

                s_setpc_b64   s[10:11]           

              label_3F7E:                        

                s_getpc_b64   s[10:11]           

                s_sub_u32     s10, s10, 0x0000d19c

                s_subb_u32    s11, s11, 0        

                s_setpc_b64   s[10:11]           

                s_getpc_b64   s[10:11]           

                s_sub_u32     s10, s10, 0x0000d1b0

                s_subb_u32    s11, s11, 0        

                s_setpc_b64   s[10:11]           

               

              As you can see, the new compiler seems makes more instructions for same code.

                • Re: OpenCL performance dropped down 12.10 >> 13.1
                  realhet

                  Wow, that's funny code...

                    s_getpc_b64   s[10:11]       

                    s_add_u32     s10, s10, 0x00000038

                    s_addc_u32    s11, s11, 0       

                    s_setpc_b64   s[10:11]          

                  It can be realized with an "s_branch 0x000E" (0x000E comes from 0x0038/4, /4 because of dword align)

                  I guess they prepared the compiler to do bigger loops than 128KB (which can't be encoded in s_branch), so they replaced almost every jumps with these 4cycle far jumps. Even when the jump targets are well known absolute locations in s_branch's reach

                   

                  (Btw: 64KByte is running out of the GCN's 32KByte code cache! You should keep that loop below 32K)

                   

                  Tho', I think the performance issue could be rather inside the check_data(...) region, not in this rarely executed loop management code.

                    • Re: OpenCL performance dropped down 12.10 >> 13.1
                      darkmen

                      Well, I agree: offcourse this will not give 20% perf loss.

                       

                      I can see positive experience also (atleast in theory):

                      • Loops even more unrolled now
                      • exec mask instruntions are more effective (i can see even less branches in code):

                      12.10 ISA:

                        s_mov_b64     s[48:49], exec                             

                        s_andn2_b64   exec, s[48:49], s[46:47]                   

                        s_andn2_b64   s[44:45], s[44:45], exec                   

                        s_cbranch_scc0  label_086E                               

                        s_andn2_b64   exec, s[48:49], exec                       

                        s_mov_b64     exec, s[48:49]                             

                        s_mov_b64     exec, s[44:45]                             

                        s_branch      label_0838                                 

                      label_086E:

                       

                      13.1 ISA:

                        s_mov_b64     vcc, exec                                  

                        s_andn2_b64   exec, vcc, s[46:47]                        

                        s_andn2_b64   s[44:45], s[44:45], exec                   

                        s_cbranch_scc0  label_0C76                               

                        s_mov_b64     exec, s[44:45]                             

                        s_branch      label_0C42                                 

                      label_0C76:

                       

                      So, the question is still open, what makes it slower?