OpenCL

webmaster128 · ‎12-17-2018

Hi all,

I am trying to compile this project for an AMD GPU: GitHub - webmaster128/lisk-vanity: A tool to generate short Lisk addresses with GPU support

The c.l files are in lisk-vanity/src/opencl at master · webmaster128/lisk-vanity · GitHub which are concatenated as follows:

lisk-vanity/gpu.rs at 9515c00c01adbc1eb8c68d3b2acf245b99c675ca · webmaster128/lisk-vanity · GitHub

Unfortuntly the compilation never terminates.

- It works fine for NVIDIA GPU and Apple/Intel CPU
- It shows proper erros when there are syntax errors
- It compiles when I comment out enough code. Which code does not matter.

This is how to reproduce on Ubuntu:

# Install rust

curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain nightly

source $HOME/.cargo/env

git clone https://github.com/webmaster128/lisk-vanity && cd lisk-vanity

export RUSTFLAGS='-L /opt/rocm/opencl/lib/x86_64/'

cargo build

./target/debug/lisk-vanity --gpu --threads 0

System/Driver is AMD-ROCm1.9.224+TensorFlow1.10 Ubuntu16.04 x64

dipak · ‎12-18-2018

Thank you for reporting this compilation issue. We will check it and get back to you.

Please share the GPU details where you observed the above problem.

P.S. I have whitelisted you.

webmaster128 · ‎12-18-2018

The issue occurs for those 3 GPUs from GPUEater.com:
- Radeon RX 580 (8G)
- Radeon RX Vega 56 (8GB)
- Radeon Vega Frontier Edition (16G)

If you need more details, please let me know

webmaster128 · ‎12-18-2018

Running the script lisk-vanity/get-ocl-line.py at master · webmaster128/lisk-vanity · GitHub after checking out the above git repo allows merging all code into one .cl file. Maybe that helps for debugging.

dipak · ‎12-18-2018

Thanks for your inputs. I just did a quick check with CodeXL and observed a build error for those devices. I will do some more tests and let you know my findings.

dipak · ‎12-19-2018

I observed a build error for any CI+ devices. For example, CodeXL reports following build error for Vega (gfx900):

Error in hsa_operand section, at offset 397288:
Address offset exceeds variable size
LLVM ERROR:
Brig container validation has failed in BRIGAsmPrinter.cpp

As I checked with the compiler team about the error message, they suspect this is an user side error because it occurs when somewhere in the source programmer statically addresses an array out of bound. So, I would suggest you to check the kernel files for any such statically addressed out of bound array access.

Thanks.

webmaster128 · ‎12-19-2018

Thanks for the feedback! I will try to use CodeXL on my own to see if it can help me find the place. hsa_operand and BRIGAsmPrinter.cpp is nothing that is in my code. If offset 397288 is a number of bytes in the kernel.cl, I do not see anything there.

But even if there was an issue in my program, the compiler must terminate and show an error message. My clGetProgramBuildInfo with CL_PROGRAM_BUILD_STATUS remains at CL_BUILD_IN_PROGRESS.

dipak · ‎12-20-2018

hsa_operand and BRIGAsmPrinter.cpp is nothing that is in my code. If offset 397288 is a number of bytes in the kernel.cl, I do not see anything there.

That error message includes information about the compiler's internal section which identified and triggered the error. Please ignore those details. As I said earlier, essentially the error message indicates a statically addressed out of bound array access.

Actually I got the error on a Win10 setup. I don't have a ROCm setup to verify it. On Windows, compiler tool-chain (HSAIL) is different than the ROCm one. The above error message is from HSAIL compiler which has diagnosed the out of bound accessing error early. As per the compiler team, there is no such diagnostics on ROCm tool-chain, but error is still there so hang is a normal outcome.

I agree with you that the compiler should not crash. However, please note that compiler tool-chain on ROCm is new and still improving. So I think it will have the fix in future. Anyway, I've reported it to the concerned team.

Thanks.

webmaster128 · ‎12-22-2018

I tried to resproduce the error message using CodeXL on Windows. I use the Analyze feature from CodeXL, is that correct?

After running for 22 devices, I get either build success or "Error: OpenCL offline compilation for the detected target GPU is not supported: gfx900 (Vega)" (codexl_analyze_log.txt · GitHub). Does this mean I cannot analyze the gfx900 bug without a Windows machine with GPU?

dipak · ‎12-24-2018

My observation was different when I checked with CodeXL. Please find the attached codeXL build reports (for 64bit gpu build) generated on below setup:

Windows 10 (64bit) + latest adrenalin 18.12.3 (18.50.03.05-181217a-337288E) + latest CodeXL 2.6.361 + Hawaii XT (R9 290X)

If you are using a different version of CodeXL and driver, please try with the latest ones.

Information about the attached files:

CodeXL_build_report_old.txt ---> based on older cl files

CodeXL_build_report_new.txt ---> all the cl files are same except curve25519.cl. It was replaced by newer one available here: lisk-vanity/curve25519.cl at 933217d618160d80ab5658019fc491ba2bcdaa97 · webmaster128/lisk-vanity · G... (modified 3days ago)

Another point to note, the compilation was successful for devices with graphics IP v6. These devices belong to 1st generation GCN family and a different compiler tool-chain is used for these devices. HSA tool-chain is mainly used for devices from 2nd gen GCN and newer families.

By the way, as I know, CodeXL depends on Radeon GPU Analyzer(rga) for offline compilation. When compiling with rga, please use the command option carefully to invoke proper compiler tool-chain. Otherwise, the observed behavior may vary. Here is a related thread: Offline compile with CodeXL

Thanks.

webmaster128 · ‎12-29-2018

Thanks!

Given that the error messages does not point to a specific piece of code, is there a best practice strategy to find the issue?

dipak · ‎01-03-2019

Just to inform you, I've already forwarded your query to the compiler team. Once I get any feedback, I'll share with you.

Thanks.

dipak · ‎01-03-2019

One point to note. As I observed, the kernels seem building fine for all the devices if optimization is disabled (with build flag "-O0"). I'll share this observation with the compiler team for clarification.

Meanwhile, could you please try to build and run (using both the compiler toolchains) the kernels without optimization and let me know your findings?

Thanks.

webmaster128 · ‎01-03-2019

Just to inform you, I've already forwarded your query to the compiler team. Once I get any feedback, I'll share with you.

Great, thanks

Meanwhile, could you please try to build and run (using both the compiler toolchains) the kernels without optimization and let me know your findings?

I tried disabling optimization from time to time since I read about optimization-related issues in other threads but never observed a notable difference. However, I did not do a meticulous analysis

dipak · ‎01-04-2019

I observed this difference when compiled on Windows (using CodeXL as well as using a simple OpenCL project). I don't know about the same on ROCm because I couldn't test it there.

Another point is, I am not sure whether the compiled code would work or not. That's why I suggested you to try it on your setup.

Thanks.

webmaster128 · ‎01-06-2019

Okay, here we go. I now installed ROCm on Ubuntu 16.04

1.

/opt/rocm/opencl/bin/x86_64/clang -include/opt/rocm/opencl/include/opencl-c.h -cl-std=CL2.0 kernel.cl

Default optimization; hands forever as initially reported

2.

/opt/rocm/opencl/bin/x86_64/clang -include/opt/rocm/opencl/include/opencl-c.h -cl-std=CL2.0 -O0 kernel.cl

leads to the error

ld.lld: error: relocation R_AMDGPU_REL32_LO cannot be used against symbol curve25519_move_conditional_bytes; recompile with -fPIC
>>> defined in /tmp/kernel-b2acb4.o
>>> referenced by /tmp/kernel-b2acb4.o:(ge25519_scalarmult_base_choose_niels)
ld.lld: error: relocation R_AMDGPU_REL32_HI cannot be used against symbol curve25519_move_conditional_bytes; recompile with -fPIC
>>> defined in /tmp/kernel-b2acb4.o
>>> referenced by /tmp/kernel-b2acb4.o:(ge25519_scalarmult_base_choose_niels)
ld.lld: error: relocation R_AMDGPU_REL32_LO cannot be used against symbol curve25519_swap_conditional; recompile with -fPIC
>>> defined in /tmp/kernel-b2acb4.o
>>> referenced by /tmp/kernel-b2acb4.o:(ge25519_scalarmult_base_choose_niels)
ld.lld: error: relocation R_AMDGPU_REL32_HI cannot be used against symbol curve25519_swap_conditional; recompile with -fPIC
>>> defined in /tmp/kernel-b2acb4.o
>>> referenced by /tmp/kernel-b2acb4.o:(ge25519_scalarmult_base_choose_niels)
[...]

3.

After adding -fPIC as suggested

/opt/rocm/opencl/bin/x86_64/clang -include/opt/rocm/opencl/include/opencl-c.h -cl-std=CL2.0 -O0 -fPIC kernel.cl

the error remains

ld.lld: error: relocation R_AMDGPU_REL32_LO cannot be used against symbol curve25519_move_conditional_bytes; recompile with -fPIC
>>> defined in /tmp/kernel-8ddcbb.o
>>> referenced by /tmp/kernel-8ddcbb.o:(ge25519_scalarmult_base_choose_niels)
ld.lld: error: relocation R_AMDGPU_REL32_HI cannot be used against symbol curve25519_move_conditional_bytes; recompile with -fPIC
>>> defined in /tmp/kernel-8ddcbb.o
>>> referenced by /tmp/kernel-8ddcbb.o:(ge25519_scalarmult_base_choose_niels)
ld.lld: error: relocation R_AMDGPU_REL32_LO cannot be used against symbol curve25519_swap_conditional; recompile with -fPIC
>>> defined in /tmp/kernel-8ddcbb.o
>>> referenced by /tmp/kernel-8ddcbb.o:(ge25519_scalarmult_base_choose_niels)
ld.lld: error: relocation R_AMDGPU_REL32_HI cannot be used against symbol curve25519_swap_conditional; recompile with -fPIC
>>> defined in /tmp/kernel-8ddcbb.o
>>> referenced by /tmp/kernel-8ddcbb.o:(ge25519_scalarmult_base_choose_niels)
ld.lld: error: relocation R_AMDGPU_REL32_LO cannot be used against symbol curve25519_neg; recompile with -fPIC
>>> defined in /tmp/kernel-8ddcbb.o
>>> referenced by /tmp/kernel-8ddcbb.o:(ge25519_scalarmult_base_choose_niels)
[...]

Installation looks good. /opt/rocm/bin/rocminfo and /opt/rocm/opencl/bin/x86_64/clinfo show the hardware and the OPenCL compiler is

/opt/rocm/opencl/bin/x86_64/clang --version
clang version 8.0
Target: amdgcn-unknown-amdhsa
Thread model: posix
InstalledDir: /opt/rocm/opencl/bin/x86_64

webmaster128 · ‎01-07-2019

Ah wait, a linking error also means that the compilation succeeded.

The second command of those produces a compilation result. The first one hangs.

/opt/rocm/opencl/bin/x86_64/clang -include/opt/rocm/opencl/include/opencl-c.h -cl-std=CL2.0 -c kernel.cl
/opt/rocm/opencl/bin/x86_64/clang -include/opt/rocm/opencl/include/opencl-c.h -cl-std=CL2.0 -c -O0 kernel.cl

I attached the (unchanged) kernel.cl for debugging.

dipak · ‎01-07-2019

Thank you for sharing above findings. I'll get back to you shortly.

Thanks.

dipak · ‎01-09-2019

As front-end of the compiler does not issue any warning and the error occurs only when optimization is enabled, the compiler team suspects that it is probably not so obvious to track memory override from the source, until the optimization steps unveil it. In this case, it might be very hard to associate a IR code (for example, HSAIL) variable with a source variable even though we are able to dump an error trace.

From their reply, it looks like the programmer needs to manually review the code to identify the erroneous memory usages.

Btw, when I was doing some small changes in file "curve25519.cl" to suppress few warnings, accidentally I saw below lines of code that look erroneous to me. Just wanted to point to this in case it helps you. The code block might not be related to the actual error though.

ge25519_scalarmult_base_niels(...)
...
//memset(r->z, 0, sizeof(bignum25519));
for (size_t n = 0; n < sizeof(bignum25519); n++) r->z = 0; ---> seems out-of-bound array access
...

Thanks.

webmaster128 · ‎01-14-2019

Thank you so much dipak for spotting this. This is a serious issue in the code and you're absolutely right about this beeing out-of-bounds array access. This is now fixed in lisk-vanity and nano-vanity. Thanks!

Still the code compiler not terminating remains. However, since we have the same issue in nano-vanity, I now have a better way to reproduce:

Compilation issue with modern AMD + optimization enabled · Issue #33 · PlasmaPower/nano-vanity · Git... I attached the smaller, updated kernel here.

Additional observation: during the compilation the CPU is at 100% consistently but the memory use stays very low. So the compiler must be doing something (that it only does with optimization enabled).

dipak · ‎01-16-2019

I'll check and get back to you. Btw, is this issue related to earlier one or something new?

webmaster128 · ‎01-16-2019

Most likely it is the exact same issue. lisk-vanity is a fork of nano-vanity that share a lot of common CL code (ed25519). Since the behaviour is the same for both code bases (compiler hanging at 100 % when optimization is enabled), I think the issue is in the common code. I choose to use nano-vanity now for debugging since it has way less lines of code.

dipak · ‎01-17-2019

The new kernel seems building (with and without optimization) fine for all the devices on Windows (with HSAIL compiler). Please fine the attached CodeXL build report. The problem may be specific to ROCm compiler only. However, currently I don't have a ROCm setup to check it myself.

Also, as this issue looks different, it would be helpful if you open a new thread describing the problem and attach the reproducible kernel file. New thread for a separate problem helps us to track them. I'll ask our verification team to reproduce it.

By the way, ROCm Github site is the best place to report/post any issue/query related to ROCm. So, I would suggest you to report the problem here: Issues · RadeonOpenCompute/ROCm · GitHub

Thanks.

webmaster128 · ‎01-20-2019

Thanks dipak, let me quickly summarize what we archives so far:

1. ROCm compiler bug found
2. Cannot be reproduced in CodeXL because of a different bug
3. Different bug (out of bound access) detected and fixed due to good eye in code review
4. Bug from 1. remains but now we have a reproducible example that compiles cleanly with CodeXL

I'll move this over to Github including the findings from here. This thread can be closed.

webmaster128 · ‎12-21-2018

I can reproduce the issue from a different Linux machine using rga-2.0.1: when I use the default language level, I get a bunch of errors regarding __generic address space. However, the code was designed for OpenCL 1.2 where no __generic exists. Setting the language level to 1.2 leads to the behaviour described in the original post: compiler hangs forever.

Default

./rga -s rocm-cl -c gfx900 --isa test_isa.txt --livereg regs.txt kernel.cl 
Target GPU detected:
gfx900 (Vega)
    Radeon (TM) Pro WX 9100
    Radeon Instinct MI25
    Radeon Instinct MI25 MxGPU
    Radeon Pro SSG
    Radeon RX Vega
    Radeon Vega Frontier Edition
Building for gfx900... failed.
Error (reported by the ROCm OpenCL Compiler):
kernel.cl:26418:17: error: passing '__generic uint32_t *' (aka '__generic unsigned int *') to parameter of type 'uint32_t *' (aka 'unsigned int *') changes address space of pointer
        curve25519_mul(r->x, p->x, p->t);
                      ^~~~
kernel.cl:25888:28: note: passing argument to parameter 'out' here
curve25519_mul(bignum25519 out, const bignum25519 a, const bignum25519 b) {
                          ^
kernel.cl:26418:23: error: passing 'const __generic uint32_t *' (aka 'const __generic unsigned int *') to parameter of type 'const uint32_t *' (aka 'const unsigned int *') changes address space of pointer
        curve25519_mul(r->x, p->x, p->t);
                            ^~~~

The same happens when explitcitly adding --OpenCLoption "-cl-std=CL2.0".

OpenCL 1.2

./rga -s rocm-cl -c gfx900 --OpenCLoption "-cl-std=CL1.2" --isa test_isa.txt --livereg regs.txt kernel.cl

Target GPU detected:

gfx900 (Vega)

Radeon (TM) Pro WX 9100

Radeon Instinct MI25

Radeon Instinct MI25 MxGPU

Radeon Pro SSG

Radeon RX Vega

Radeon Vega Frontier Edition

No more output for minutes

webmaster128 · ‎12-21-2018

After annotating some function argument pointers (Comparing master...opencl2.0 · webmaster128/lisk-vanity · GitHub ) the compile errors for OpenCL 2.0 disappear and I have the same problen for 1.2 and 2.0: compiler hangs forever.

I will try Window and compiler tool-chain (HSAIL) and hope for some hint at which part of the code the "statically addressed out of bound array access" appears.

OpenCL

OpenCL compilation hangs forever

1.

2.

3.

Default

OpenCL 1.2