10 Replies Latest reply on May 7, 2016 6:38 AM by meriken

    90+% Performance Reduction of OpenCL Application with AMD Radeon Software Crimson Edition

    meriken

      With the latest AMD Software Crimson Edition, I am experiencing a 90+% performance reduction of the tripcode generator I developed, and other users reported the same problem. The following are relevant links:

       

      http://meriken.ygch.net/programming/merikens-tripcode-engine-english/

      https://github.com/meriken/merikens-tripcode-engine

      (Please make sure to uncomment "// #define ENGLISH_VERSION" in "/MerikensTripcodeEngine/Source Files/MERIKENsTripcodeEngine.h" when you build this application.)

       

      When I change OpenCL build options in "/MerikensTripcodeEngine/Source Files/OpenCL12.cpp" from "-O1 -cl-mad-enable" to "-O5 -cl-mad-enable", I either get corrupt results (7970/7990) or the same slow speed (290X). I was able to reproduce the problem with the following drivers:

       

      radeon-crimson-15.12-with-dotnet45-win7-64bit

      non-whql-64bit-radeon-software-crimson-16.1.1-win10-win8.1-win7-jan30

      non-whql-64bit-radeon-software-crimson-16.2-win10-win8.1-win7-feb23

      non-whql-64bit-radeon-software-crimson-16.3-win10-win8.1-win7-march9

       

      This application was working just fine up until Catalyst 15.11.1 Beta. My system configuration is as follows:

       

      GPU0: DIAMOND Radeon HD 7990

      GPU1: Gigabyte Radeon HD 7990

      GPU2: Sapphire Radeon R9-290X

      GPU3: Gigabyte Radeon R9-290X

      CPU: Intel Core i7-4770

      MB: ASUS Maximus VI Extreme

      PSU: Corsair AX1200

      OS: Microsoft Windows 7 64bit SP1 English

       

      I would really appreciate if you could work on this issue ASAP.

        • Re: 90+% Performance Reduction of OpenCL Application with AMD Software Crimson Edition
          meriken

          I either get corrupt results (7970/7990)

          I forgot to mention that you need to specify the "-b 1" option when you run MerikensTripcodeEngine.exe in order to see an error message for this particular problem. The error message would look like this: "A generated tripcode was corrupt."

           

          I encountered similar bugs in AMD's OpenCL drivers over the last 3 years or so, and I was able to work around them by tweaking either my OpenCL codes or OpenCL build options. I exhausted all the available means this time around, however, and I am rather desperate. If this bug is not fixed, I either have to rewrite the OpenCL kernels with a GCN assembler (very time consuming), or drop support for AMD in favor of NVIDIA.

          • Re: 90+% Performance Reduction of OpenCL Application with AMD Radeon Software Crimson Edition
            meriken

            Just to give an update, I ended up rewriting the OpenCL kernel in question in GCN assembly:

            https://github.com/meriken/merikens-tripcode-engine/blob/master/SourceFiles/OpenCL/bin/OpenCL10GCN.asm

            Now my application works reliably across different versions of AMD display drivers. Luckily, I was able to cover GCN 1.0/1.1/1.2 with the same kernel at the source code level.

            By the way, CLRadeonExtender is a wonderful toolkit with GCN assember and disassembler that really should be part of AMD APP SDK. I cannot recommend it highly enough.

              • Re: 90+% Performance Reduction of OpenCL Application with AMD Radeon Software Crimson Edition
                realhet

                Great job!

                 

                Your function calls are awesome! I can just guess if you unroll that maybe you can't fit into the ICache. So does the dealing with s_xxxxPC_b64 worth the effort?

                What tool have you used to generate the asm sources? An own code generator, right?

                 

                Sry for having too many questions, I'm excited, because I rarely see cool GCN ASM nowadays.

                  • Re: 90+% Performance Reduction of OpenCL Application with AMD Radeon Software Crimson Edition
                    meriken

                    Thank you. It's actually quite nice to have somebody who understands my codes.

                    I can just guess if you unroll that maybe you can't fit into the ICache.

                    That is exactly right. With Bitslice DES, it is crucial that the main loop is on the instruction cache, no matter which architecture you are dealing with.

                    So does the dealing with s_xxxxPC_b64 worth the effort?

                    Absolutely. With these lightweight function calls, the performance gain was around 100% compared to the unrolled version.

                    What tool have you used to generate the asm sources?

                    It's a combination of a custom code generator and CLRadeonExtender.

                    I started with an OpenCL kernel. I created GCN byte codes with CodeXL for GCN 1.0/1.1/1.2, disassembled them with CLRadeonExtender, and used diff to see how the OpenCL compiler handled three different GCN architectures. Once I have a functional kernel written in GCN assembly, I analysed its register usage, generated an optimized version of the main loop with a custom code generator, and merged it with the disassembled code.

                    Sry for having too many questions, I'm excited, because I rarely see cool GCN ASM nowadays.

                    Oh, no problem. Your work was my original inspiration, so I am honored to answer your questions