After corresponding with a fellow BOINC user I decided to make a post here to highlight some difficulties I've been having - in particular, I was advised @bridgman may be able to assist. I made a search for existing posts that may be related to my specific issue, but I didn't find any.
Between the two of us we've established the following:
For point 1, this was resolved with amdgpu-pro 20.45.
For point 2, this appears only to be an issue for GPUs based on Vega and later. eg My BOINC comrade is only running Polaris GPUs under amdgpu-pro and thus can get by with the 'legacy' OpenCL installation. I have confirmed this by similarly running amdgpu-pro with 'legacy' OpenCL on Fiji GPUs successfully.
However, my problems lie with amdgpu-pro on Vega 10 and Vega 20 GPUs (Vega 64 and Radeon VII respectively). I have been running these GPUs with PAL OpenCL just fine for several months now, but with the switch to ROCr, all BOINC processing on GPUs fail. I have confirmed this with two separate projects - keeping in mind that GPU processing works just fine with PAL OpenCL.
Error: Failed to compile opencl source (from CL or HIP source to LLVM IR). clBuildProgram() failed with error (-11) Error -11. Processing Aborted.
Example 2 (gfx900 = Vega 64):
boinc_get_opencl_ids returned [0x17d61c0 , 0x7f9070b41cd0] Using OpenCL platform provided by: Advanced Micro Devices, Inc. Using OpenCL device "gfx900" by: Advanced Micro Devices, Inc. Max allocation limit: 7287183769 Global mem size: 8573157376 Couldn't create OpenCL command queue (error: -6)! OpenCL shutdown complete! initialize_ocl returned error 
Both of these error logs appear to be problems with using OpenCL devices under ROCr, and even though BOINC does detect OpenCL devices correctly apparently the project applications cannot work properly. I experienced this with both the November 2020 release of amdgpu-pro 20.45 (build 1164792) and December 2020 (build 1188099).
I understand the aim of unifying software stacks and in principle agree it's a good idea. However, not at the expense of breaking existing functionality. One might argue that individual BOINC projects will have to adapt to using ROCr, but I think it would be better to work out why ROCr breaks the processing in the first place and tackle the problem at the root.
For the time being I am stuck with staying on amdgpu-pro 20.40 and because of aforementioned point 1, this also means I am stuck on an out-of-date Linux kernel. Obviously this is not a tenable long-term position - is there anyone who has had similar experience and/or can assist with BOINC GPU processing under ROCr?
Appreciate any help or assistance. Thanks.
PS I love amdgpu in general, in terms of stability and ease of use it's a big step up from the old fglrx days, so thanks to everyone involved with its development and maintenance. The fact that it has worked so well for me until now has me surprised that I'm currently having this issue.
Solved! Go to Solution.
It turns out that the struggles I was having with Vega20 was apparently due to a hardware fault with one the cards - it seems it's not reliable but neither is it 100% broken. Unfortunate for me, but fortunate in terms of the software support situation. Closing this as resolved with release of amdgpu 220.127.116.11501.
(This post kept being blocked as 'spam' for some reason - something wrong with the forum's filtering?)
(Attempting to post again for the umpteenth time...)
Since this seems to be specifically an OpenCL issue try posting your question at AMD Forum OpenCL but you first must post here to get access to that forum: https://community.amd.com/t5/newcomers-start-here/bd-p/newcomer-forum
Here are some current threads at AMD Forum OpenCL: https://community.amd.com/t5/opencl/bd-p/opencl-discussions
The one I circled seems to be something similar to your situation.
I am not familiar with your topic but if ROCr is the same as ROCM then you need to post it at GITSHUB : https://github.com/RadeonOpenCompute/ROCm/issues
Thanks for the reply. The driver installation specifically said it would install ROCr in lieu of PAL OpenCL. I'm aware of ROCm, and the two seem related though I'm not entirely sure how. I have tried to use ROCm instead of amdgpu-pro once in the past but I can't recall if BOINC detected OpenCL devices correctly (if it did, then GPU processing would have failed in a similar manner otherwise I wouldn't have gone back to amdgpu-pro).
I'll try to get attention to my issue at the OpenCL forum. I read the post you referred me to and it's not clear if the user ever used amdgpu-pro as the discussion is specific to ROCm.
Thanks for letting us know about this.
It is basically the same code being used in "the ROCm stack" and AMDGPU-PRO with --opencl=rocr, although we did a lot of additional testing/fixing as part of using the code in AMDGPU-PRO and there is some additional visibility during the transition. I'll try to get this raised as an internal issue.
I didn't see any mention of the transition to ROCr in the release notes, and the installation instructions still discuss using --opencl=pal vs --opencl=rocr (completely understand that documentation often takes longer to update).
As I mentioned, I did try a ROCm installation quite some time ago (possibly early 2020) but didn't have any success with BOINC applications so I reverted to amdgpu-pro PAL OpenCL. If you're recording this as an internal (presumably developer) issue, is there anywhere I can track this, perhaps provide additional information and testing feedback, if required?
Thanks again to you and your team for your efforts. In the mean time I'll stay on amdgpu-pro 20.40 and await an update on this.
>I didn't see any mention of the transition to ROCr in the release notes, and the installation instructions still discuss using --opencl=pal vs --opencl=rocr (completely understand that documentation often takes longer to update).
Auggh... I know we already fixed that... sorry, don't know what happened there.
I don't know if we have a suitable customer-facing bug tracker but will try to find out.
EDIT - the documentation included in the downloaded package does properly mention the --opencl=rocr option, but it still is not very clear - it kinda suggests you can use either (--opencl=rocr,legacy) while in reality you use legacy for Polaris and earlier, rocr for Vega and later.
With the edit referring to the internal documentation and choosing between rocr and legacy options, I just remembered something that may be relevant.
The way BOINC detects OpenCL devices for GPU processing isn't super-reliable (was very finicky in fglrx-era) and I recall when I moved to Vega GPUs that instead of just specifying --opencl=pal, I had to install both PAL and legacy options together: legacy for BOINC to detect the GPUs, and PAL to run OpenCL on Vega.
The reason I bring this is up is because I still used the --opencl=pal,legacy option when installing for Vega 20, which internally becomes --opencl=rocr,legacy under amdgpu-pro 20.45. Legacy would allow BOINC to detect the OpenCL devices, but I assume ROCr would be used for the actual processing on Vega and later GPUs. I just wonder if there's some interconnect between PAL and legacy OpenCL that isn't present between ROCr and legacy OpenCL.
It looks like my original post - which contained the bulk of the information - was lost in the administrative move from Drivers and Software Support. The original post was to https://community.amd.com/t5/drivers-software/amdgpu-pro-20-45-rocr-vs-pal-opencl-breaks-boinc-gpu-p....
To summarise, I cannot run GPGPU applications via BOINC using amdgpu-pro 20.45 apparently because of the switch from PAL OpenCL to ROCr for Vega GPUs onwards (older GPUs run okay with 'legacy' OpenCL).
Reverting to amdgpu-pro 20.40 on Ubuntu Linux Kernels means that I am stuck on an out-of-date kernel for the time being.