26 Replies Latest reply on Jan 20, 2019 3:05 AM by webmaster128

    OpenCL compilation hangs forever

    webmaster128

      Hi all,

       

      I am trying to compile this project for an AMD GPU: GitHub - webmaster128/lisk-vanity: A tool to generate short Lisk addresses with GPU support

       

      The c.l files are in lisk-vanity/src/opencl at master · webmaster128/lisk-vanity · GitHub  which are concatenated as follows:

      lisk-vanity/gpu.rs at 9515c00c01adbc1eb8c68d3b2acf245b99c675ca · webmaster128/lisk-vanity · GitHub

       

      Unfortuntly the compilation never terminates.

       

      - It works fine for NVIDIA GPU and Apple/Intel CPU
      - It shows proper erros when there are syntax errors
      - It compiles when I comment out enough code. Which code does not matter.

       

      This is how to reproduce on Ubuntu:

       

      # Install rust
      curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain nightly
      source $HOME/.cargo/env
      
      git clone https://github.com/webmaster128/lisk-vanity && cd lisk-vanity
      
      export RUSTFLAGS='-L /opt/rocm/opencl/lib/x86_64/'
      cargo build
      ./target/debug/lisk-vanity --gpu --threads 0
      
      

       

      System/Driver is AMD-ROCm1.9.224+TensorFlow1.10 Ubuntu16.04 x64

        • Re: OpenCL compilation hangs forever
          dipak

          Thank you for reporting this compilation issue. We will check it and get back to you.

          Please share the GPU details where you observed the above problem.

           

          P.S. I have whitelisted you.

            • Re: OpenCL compilation hangs forever
              webmaster128

              The issue occurs for those 3 GPUs from GPUEater.com:
              - Radeon RX 580 (8G)
              - Radeon RX Vega 56 (8GB)
              - Radeon Vega Frontier Edition (16G)

               

              If you need more details, please let me know

                • Re: OpenCL compilation hangs forever
                  webmaster128

                  Running the script lisk-vanity/get-ocl-line.py at master · webmaster128/lisk-vanity · GitHub after checking out the above git repo allows merging all code into one .cl file. Maybe that helps for debugging.

                    • Re: OpenCL compilation hangs forever
                      dipak

                      Thanks for your inputs. I just did a quick check with CodeXL and observed a build error for those devices. I will do some more tests and let you know my findings.

                        • Re: OpenCL compilation hangs forever
                          dipak

                          I observed a build error for any CI+ devices. For example, CodeXL reports following build error for Vega (gfx900):

                           

                          Error in hsa_operand section, at offset 397288:

                          Address offset exceeds variable size

                          LLVM ERROR:

                          Brig container validation has failed in BRIGAsmPrinter.cpp

                           

                          As I checked with the compiler team about the error message, they suspect this is an user side error because it occurs when somewhere in the source programmer statically addresses an array out of bound. So, I would suggest you to check the kernel files for any such statically addressed out of bound array access.

                           

                           

                          Thanks.

                            • Re: OpenCL compilation hangs forever
                              webmaster128

                              Thanks for the feedback! I will try to use CodeXL on my own to see if it can help me find the place. hsa_operand and BRIGAsmPrinter.cpp is nothing that is in my code. If offset 397288 is a number of bytes in the kernel.cl, I do not see anything there.

                               

                              But even if there was an issue in my program, the compiler must terminate and show an error message. My clGetProgramBuildInfo with CL_PROGRAM_BUILD_STATUS remains at CL_BUILD_IN_PROGRESS.

                                • Re: OpenCL compilation hangs forever
                                  dipak
                                  hsa_operand and BRIGAsmPrinter.cpp is nothing that is in my code. If offset 397288 is a number of bytes in the kernel.cl, I do not see anything there.

                                  That error message includes information about the compiler's internal section which identified and triggered the error. Please ignore those details. As I said earlier, essentially the error message indicates a statically addressed out of bound array access.

                                   

                                  Actually I got the error on a Win10 setup. I don't have a ROCm setup to verify it. On Windows, compiler tool-chain (HSAIL) is different than the ROCm one. The above error message is from HSAIL compiler which has diagnosed the out of bound accessing error early.  As per the compiler team, there is no such diagnostics on ROCm tool-chain, but error is still there so hang is a normal outcome.

                                   

                                  I agree with you that the compiler should not crash. However, please note that compiler tool-chain on ROCm is new and still improving. So I think it will have the fix in future. Anyway, I've reported it to the concerned team.

                                   

                                  Thanks.

                                    • Re: OpenCL compilation hangs forever
                                      webmaster128

                                      I tried to resproduce the error message using CodeXL on Windows. I use the Analyze feature from CodeXL, is that correct?

                                       

                                      After running for 22 devices, I get either build success or "Error: OpenCL offline compilation for the detected target GPU is not supported: gfx900 (Vega)" (codexl_analyze_log.txt · GitHub). Does this mean I cannot analyze the gfx900 bug without a Windows machine with GPU?

                                        • Re: OpenCL compilation hangs forever
                                          dipak

                                          My observation was different when I checked with CodeXL.  Please find the attached codeXL build reports (for 64bit gpu build) generated on below setup:

                                           

                                          Windows 10 (64bit) + latest adrenalin 18.12.3 (18.50.03.05-181217a-337288E) + latest CodeXL 2.6.361 + Hawaii XT (R9 290X) 

                                           

                                          If you are using a different version of CodeXL and driver, please try with the latest ones.

                                           

                                          Information about the attached files:

                                          CodeXL_build_report_old.txt ---> based on older cl files

                                          CodeXL_build_report_new.txt ---> all the cl files are same except curve25519.cl. It was replaced by newer one available here: lisk-vanity/curve25519.cl at 933217d618160d80ab5658019fc491ba2bcdaa97 · webmaster128/lisk-vanity · GitHub (modified 3days ago)

                                           

                                          Another point to note, the compilation was successful for devices with graphics IP v6. These devices belong to 1st generation GCN family and a different compiler tool-chain is used for these devices. HSA tool-chain is mainly used for devices from 2nd gen GCN and newer families.

                                           

                                          By the way, as I know, CodeXL depends on Radeon GPU Analyzer(rga) for offline compilation. When compiling with rga, please use the command option carefully to invoke proper compiler tool-chain. Otherwise, the observed behavior may vary. Here is a related thread: Offline compile with CodeXL

                                           

                                          Thanks.

                                            • Re: OpenCL compilation hangs forever
                                              webmaster128

                                              Thanks!

                                               

                                              Given that the error messages does not point to a specific piece of code, is there a best practice strategy to find the issue?

                                                • Re: OpenCL compilation hangs forever
                                                  dipak

                                                  Just to inform you, I've already forwarded your query to the compiler team. Once I get any feedback, I'll share with you.

                                                   

                                                  Thanks.

                                                  • Re: OpenCL compilation hangs forever
                                                    dipak

                                                    One point to note. As I observed, the kernels seem building fine for all the devices if optimization is disabled (with build flag "-O0"). I'll share this observation with the compiler team for clarification.

                                                    Meanwhile, could you please try to build and run (using both the compiler toolchains) the kernels without optimization and let me know your findings?

                                                     

                                                    Thanks.

                                                      • Re: OpenCL compilation hangs forever
                                                        webmaster128

                                                        Just to inform you, I've already forwarded your query to the compiler team. Once I get any feedback, I'll share with you.

                                                        Great, thanks

                                                        Meanwhile, could you please try to build and run (using both the compiler toolchains) the kernels without optimization and let me know your findings?

                                                        I tried disabling optimization from time to time since I read about optimization-related issues in other threads but never observed a notable difference. However, I did not do a meticulous analysis

                                                          • Re: OpenCL compilation hangs forever
                                                            dipak

                                                            I observed this difference when compiled on Windows (using CodeXL as well as using a simple OpenCL project). I don't know about the same on ROCm because I couldn't test it there.

                                                            Another point is, I am not sure whether the compiled code would work or not. That's why I suggested you to try it on your setup.

                                                             

                                                            Thanks.

                                                              • Re: OpenCL compilation hangs forever
                                                                webmaster128

                                                                Okay, here we go. I now installed ROCm on Ubuntu 16.04

                                                                1.

                                                                /opt/rocm/opencl/bin/x86_64/clang -include/opt/rocm/opencl/include/opencl-c.h -cl-std=CL2.0 kernel.cl

                                                                Default optimization; hands forever as initially reported


                                                                2.

                                                                /opt/rocm/opencl/bin/x86_64/clang -include/opt/rocm/opencl/include/opencl-c.h -cl-std=CL2.0 -O0 kernel.cl

                                                                leads to the error

                                                                ld.lld: error: relocation R_AMDGPU_REL32_LO cannot be used against symbol curve25519_move_conditional_bytes; recompile with -fPIC

                                                                >>> defined in /tmp/kernel-b2acb4.o

                                                                >>> referenced by /tmp/kernel-b2acb4.o:(ge25519_scalarmult_base_choose_niels)

                                                                 

                                                                ld.lld: error: relocation R_AMDGPU_REL32_HI cannot be used against symbol curve25519_move_conditional_bytes; recompile with -fPIC

                                                                >>> defined in /tmp/kernel-b2acb4.o

                                                                >>> referenced by /tmp/kernel-b2acb4.o:(ge25519_scalarmult_base_choose_niels)

                                                                 

                                                                ld.lld: error: relocation R_AMDGPU_REL32_LO cannot be used against symbol curve25519_swap_conditional; recompile with -fPIC

                                                                >>> defined in /tmp/kernel-b2acb4.o

                                                                >>> referenced by /tmp/kernel-b2acb4.o:(ge25519_scalarmult_base_choose_niels)

                                                                 

                                                                ld.lld: error: relocation R_AMDGPU_REL32_HI cannot be used against symbol curve25519_swap_conditional; recompile with -fPIC

                                                                >>> defined in /tmp/kernel-b2acb4.o

                                                                >>> referenced by /tmp/kernel-b2acb4.o:(ge25519_scalarmult_base_choose_niels)
                                                                [...]

                                                                 

                                                                3.

                                                                After adding -fPIC as suggested

                                                                /opt/rocm/opencl/bin/x86_64/clang -include/opt/rocm/opencl/include/opencl-c.h -cl-std=CL2.0 -O0 -fPIC kernel.cl

                                                                the error remains

                                                                ld.lld: error: relocation R_AMDGPU_REL32_LO cannot be used against symbol curve25519_move_conditional_bytes; recompile with -fPIC

                                                                >>> defined in /tmp/kernel-8ddcbb.o

                                                                >>> referenced by /tmp/kernel-8ddcbb.o:(ge25519_scalarmult_base_choose_niels)

                                                                 

                                                                ld.lld: error: relocation R_AMDGPU_REL32_HI cannot be used against symbol curve25519_move_conditional_bytes; recompile with -fPIC

                                                                >>> defined in /tmp/kernel-8ddcbb.o

                                                                >>> referenced by /tmp/kernel-8ddcbb.o:(ge25519_scalarmult_base_choose_niels)

                                                                 

                                                                ld.lld: error: relocation R_AMDGPU_REL32_LO cannot be used against symbol curve25519_swap_conditional; recompile with -fPIC

                                                                >>> defined in /tmp/kernel-8ddcbb.o

                                                                >>> referenced by /tmp/kernel-8ddcbb.o:(ge25519_scalarmult_base_choose_niels)

                                                                 

                                                                ld.lld: error: relocation R_AMDGPU_REL32_HI cannot be used against symbol curve25519_swap_conditional; recompile with -fPIC

                                                                >>> defined in /tmp/kernel-8ddcbb.o

                                                                >>> referenced by /tmp/kernel-8ddcbb.o:(ge25519_scalarmult_base_choose_niels)

                                                                 

                                                                ld.lld: error: relocation R_AMDGPU_REL32_LO cannot be used against symbol curve25519_neg; recompile with -fPIC

                                                                >>> defined in /tmp/kernel-8ddcbb.o

                                                                >>> referenced by /tmp/kernel-8ddcbb.o:(ge25519_scalarmult_base_choose_niels)
                                                                [...]

                                                                 

                                                                Installation looks good. /opt/rocm/bin/rocminfo and /opt/rocm/opencl/bin/x86_64/clinfo show the hardware and the OPenCL compiler is

                                                                /opt/rocm/opencl/bin/x86_64/clang --version

                                                                clang version 8.0

                                                                Target: amdgcn-unknown-amdhsa

                                                                Thread model: posix

                                                                InstalledDir: /opt/rocm/opencl/bin/x86_64

                                                                 

                                                                  • Re: OpenCL compilation hangs forever
                                                                    webmaster128

                                                                    Ah wait, a linking error also means that the compilation succeeded.

                                                                     

                                                                    The second command of those produces a compilation result. The first one hangs.

                                                                    /opt/rocm/opencl/bin/x86_64/clang -include/opt/rocm/opencl/include/opencl-c.h -cl-std=CL2.0 -c kernel.cl

                                                                    /opt/rocm/opencl/bin/x86_64/clang -include/opt/rocm/opencl/include/opencl-c.h -cl-std=CL2.0 -c -O0 kernel.cl

                                                                    I attached the (unchanged) kernel.cl for debugging.

                                                                      • Re: OpenCL compilation hangs forever
                                                                        dipak

                                                                        Thank you for sharing above findings. I'll get back to you shortly.

                                                                         

                                                                        Thanks.

                                                                        • Re: OpenCL compilation hangs forever
                                                                          dipak

                                                                          As front-end of the compiler does not issue any warning and the error occurs only when optimization is enabled, the compiler team suspects that it is probably not so obvious to track memory override from the source, until the optimization steps unveil it. In this case, it might be very hard to associate a IR code (for example, HSAIL) variable with a source variable even though we are able to dump an error trace.

                                                                          From their reply, it looks like the programmer needs to manually review the code to identify the erroneous memory usages.

                                                                           

                                                                          Btw, when I was doing some small changes in file "curve25519.cl" to suppress few warnings, accidentally I saw below lines of code that look erroneous to me. Just wanted to point to this in case it helps you. The code block might not be related to the actual error though.

                                                                           

                                                                          ge25519_scalarmult_base_niels(...)

                                                                          ...

                                                                          //memset(r->z, 0, sizeof(bignum25519));

                                                                          for (size_t n = 0; n < sizeof(bignum25519); n++) r->z[n] = 0;  ---> seems out-of-bound array access

                                                                          ...

                                                                           

                                                                          Thanks.

                                                                          1 of 1 people found this helpful
                                            • Re: OpenCL compilation hangs forever
                                              webmaster128

                                              I can reproduce the issue from a different Linux machine using rga-2.0.1: when I use the default language level, I get a bunch of errors regarding __generic address space. However, the code was designed for OpenCL 1.2 where no __generic exists. Setting the language level to 1.2 leads to the behaviour described in the original post: compiler hangs forever.

                                               

                                              Default

                                               

                                              ./rga -s rocm-cl -c gfx900 --isa test_isa.txt --livereg regs.txt kernel.cl 
                                              
                                              Target GPU detected:
                                              
                                              gfx900 (Vega)
                                                  Radeon (TM) Pro WX 9100
                                                  Radeon Instinct MI25
                                                  Radeon Instinct MI25 MxGPU
                                                  Radeon Pro SSG
                                                  Radeon RX Vega
                                                  Radeon Vega Frontier Edition
                                              
                                              Building for gfx900... failed.
                                              
                                              Error (reported by the ROCm OpenCL Compiler):
                                              kernel.cl:26418:17: error: passing '__generic uint32_t *' (aka '__generic unsigned int *') to parameter of type 'uint32_t *' (aka 'unsigned int *') changes address space of pointer
                                                      curve25519_mul(r->x, p->x, p->t);
                                                                    ^~~~
                                              kernel.cl:25888:28: note: passing argument to parameter 'out' here
                                              curve25519_mul(bignum25519 out, const bignum25519 a, const bignum25519 b) {
                                                                        ^
                                              kernel.cl:26418:23: error: passing 'const __generic uint32_t *' (aka 'const __generic unsigned int *') to parameter of type 'const uint32_t *' (aka 'const unsigned int *') changes address space of pointer
                                                      curve25519_mul(r->x, p->x, p->t);
                                                                          ^~~~
                                              


                                              The same happens when explitcitly adding --OpenCLoption "-cl-std=CL2.0".

                                               

                                              OpenCL 1.2

                                               

                                              ./rga -s rocm-cl -c gfx900 --OpenCLoption "-cl-std=CL1.2" --isa test_isa.txt --livereg regs.txt kernel.cl 
                                              
                                              Target GPU detected:
                                              
                                              gfx900 (Vega)
                                                  Radeon (TM) Pro WX 9100
                                                  Radeon Instinct MI25
                                                  Radeon Instinct MI25 MxGPU
                                                  Radeon Pro SSG
                                                  Radeon RX Vega
                                                  Radeon Vega Frontier Edition
                                              
                                              

                                               

                                              No more output for minutes