cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

Wedge009
Adept II

amdgpu-pro 20.45: ROCr vs PAL OpenCL breaks BOINC GPU processing

Greetings.

After corresponding with a fellow BOINC user I decided to make a post here to highlight some difficulties I've been having - in particular, I was advised @bridgman may be able to assist. I made a search for existing posts that may be related to my specific issue, but I didn't find any.

Between the two of us we've established the following:

  1. amdgpu-pro 20.40 is incompatible with Ubuntu Linux kernels after 5.4.0-54.
  2. amdgpu-pro 20.45 displaces PAL OpenCL in favour of a ROCr implementation.

For point 1, this was resolved with amdgpu-pro 20.45.

For point 2, this appears only to be an issue for GPUs based on Vega and later. eg My BOINC comrade is only running Polaris GPUs under amdgpu-pro and thus can get by with the 'legacy' OpenCL installation. I have confirmed this by similarly running amdgpu-pro with 'legacy' OpenCL on Fiji GPUs successfully.

However, my problems lie with amdgpu-pro on Vega 10 and Vega 20 GPUs (Vega 64 and Radeon VII respectively). I have been running these GPUs with PAL OpenCL just fine for several months now, but with the switch to ROCr, all BOINC processing on GPUs fail. I have confirmed this with two separate projects - keeping in mind that GPU processing works just fine with PAL OpenCL.

Example 1:

Error: Failed to compile opencl source (from CL or HIP source to LLVM IR).

clBuildProgram() failed with error (-11) 

Error -11. Processing Aborted.

Example 2 (gfx900 = Vega 64):

boinc_get_opencl_ids returned [0x17d61c0 , 0x7f9070b41cd0] 
Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "gfx900" by: Advanced Micro Devices, Inc.
Max allocation limit: 7287183769
Global mem size: 8573157376
Couldn't create OpenCL command queue (error: -6)!
OpenCL shutdown complete!
initialize_ocl returned error [2013]

Both of these error logs appear to be problems with using OpenCL devices under ROCr, and even though BOINC does detect OpenCL devices correctly apparently the project applications cannot work properly. I experienced this with both the November 2020 release of amdgpu-pro 20.45 (build 1164792) and December 2020 (build 1188099).

I understand the aim of unifying software stacks and in principle agree it's a good idea. However, not at the expense of breaking existing functionality. One might argue that individual BOINC projects will have to adapt to using ROCr, but I think it would be better to work out why ROCr breaks the processing in the first place and tackle the problem at the root.

For the time being I am stuck with staying on amdgpu-pro 20.40 and because of aforementioned point 1, this also means I am stuck on an out-of-date Linux kernel. Obviously this is not a tenable long-term position - is there anyone who has had similar experience and/or can assist with BOINC GPU processing under ROCr?

Appreciate any help or assistance. Thanks.

PS I love amdgpu in general, in terms of stability and ease of use it's a big step up from the old fglrx days, so thanks to everyone involved with its development and maintenance. The fact that it has worked so well for me until now has me surprised that I'm currently having this issue.

1 Solution

It turns out that the struggles I was having with Vega20 was apparently due to a hardware fault with one the cards - it seems it's not reliable but neither is it 100% broken. Unfortunate for me, but fortunate in terms of the software support situation. Closing this as resolved with release of amdgpu 21.40.1.40501.

(This post kept being blocked as 'spam' for some reason - something wrong with the forum's filtering?)

(Attempting to post again for the umpteenth time...)

View solution in original post

19 Replies