cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

Wedge009
Adept II

amdgpu-pro 20.45: ROCr vs PAL OpenCL breaks BOINC GPU processing

Greetings.

After corresponding with a fellow BOINC user I decided to make a post here to highlight some difficulties I've been having - in particular, I was advised @bridgman may be able to assist. I made a search for existing posts that may be related to my specific issue, but I didn't find any.

Between the two of us we've established the following:

  1. amdgpu-pro 20.40 is incompatible with Ubuntu Linux kernels after 5.4.0-54.
  2. amdgpu-pro 20.45 displaces PAL OpenCL in favour of a ROCr implementation.

For point 1, this was resolved with amdgpu-pro 20.45.

For point 2, this appears only to be an issue for GPUs based on Vega and later. eg My BOINC comrade is only running Polaris GPUs under amdgpu-pro and thus can get by with the 'legacy' OpenCL installation. I have confirmed this by similarly running amdgpu-pro with 'legacy' OpenCL on Fiji GPUs successfully.

However, my problems lie with amdgpu-pro on Vega 10 and Vega 20 GPUs (Vega 64 and Radeon VII respectively). I have been running these GPUs with PAL OpenCL just fine for several months now, but with the switch to ROCr, all BOINC processing on GPUs fail. I have confirmed this with two separate projects - keeping in mind that GPU processing works just fine with PAL OpenCL.

Example 1:

Error: Failed to compile opencl source (from CL or HIP source to LLVM IR).

clBuildProgram() failed with error (-11) 

Error -11. Processing Aborted.

Example 2 (gfx900 = Vega 64):

boinc_get_opencl_ids returned [0x17d61c0 , 0x7f9070b41cd0] 
Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "gfx900" by: Advanced Micro Devices, Inc.
Max allocation limit: 7287183769
Global mem size: 8573157376
Couldn't create OpenCL command queue (error: -6)!
OpenCL shutdown complete!
initialize_ocl returned error [2013]

Both of these error logs appear to be problems with using OpenCL devices under ROCr, and even though BOINC does detect OpenCL devices correctly apparently the project applications cannot work properly. I experienced this with both the November 2020 release of amdgpu-pro 20.45 (build 1164792) and December 2020 (build 1188099).

I understand the aim of unifying software stacks and in principle agree it's a good idea. However, not at the expense of breaking existing functionality. One might argue that individual BOINC projects will have to adapt to using ROCr, but I think it would be better to work out why ROCr breaks the processing in the first place and tackle the problem at the root.

For the time being I am stuck with staying on amdgpu-pro 20.40 and because of aforementioned point 1, this also means I am stuck on an out-of-date Linux kernel. Obviously this is not a tenable long-term position - is there anyone who has had similar experience and/or can assist with BOINC GPU processing under ROCr?

Appreciate any help or assistance. Thanks.

PS I love amdgpu in general, in terms of stability and ease of use it's a big step up from the old fglrx days, so thanks to everyone involved with its development and maintenance. The fact that it has worked so well for me until now has me surprised that I'm currently having this issue.

1 Solution

It turns out that the struggles I was having with Vega20 was apparently due to a hardware fault with one the cards - it seems it's not reliable but neither is it 100% broken. Unfortunate for me, but fortunate in terms of the software support situation. Closing this as resolved with release of amdgpu 21.40.1.40501.

(This post kept being blocked as 'spam' for some reason - something wrong with the forum's filtering?)

(Attempting to post again for the umpteenth time...)

View solution in original post

19 Replies

Since this seems to be specifically an OpenCL issue try posting your question at AMD Forum OpenCL but you first must post here to get access to that forum: https://community.amd.com/t5/newcomers-start-here/bd-p/newcomer-forum

Here are some current threads at AMD Forum OpenCL: https://community.amd.com/t5/opencl/bd-p/opencl-discussions

Screenshot 2021-02-20 105920.png

The one I circled seems to be something similar to your situation.

I am not familiar with your topic but if ROCr is the same as ROCM then you need to post it at GITSHUB :  https://github.com/RadeonOpenCompute/ROCm/issues

 

Wedge009
Adept II

Thanks for the reply. The driver installation specifically said it would install ROCr in lieu of PAL OpenCL. I'm aware of ROCm, and the two seem related though I'm not entirely sure how. I have tried to use ROCm instead of amdgpu-pro once in the past but I can't recall if BOINC detected OpenCL devices correctly (if it did, then GPU processing would have failed in a similar manner otherwise I wouldn't have gone back to amdgpu-pro).

I'll try to get attention to my issue at the OpenCL forum. I read the post you referred me to and it's not clear if the user ever used amdgpu-pro as the discussion is specific to ROCm.

0 Likes

Thanks for letting us know about this.

It is basically the same code being used in "the ROCm stack" and AMDGPU-PRO with --opencl=rocr, although we did a lot of additional testing/fixing as part of using the code in AMDGPU-PRO and there is some additional visibility during the transition. I'll try to get this raised as an internal issue.

Thanks, bridgman.

I didn't see any mention of the transition to ROCr in the release notes, and the installation instructions still discuss using --opencl=pal vs --opencl=rocr (completely understand that documentation often takes longer to update).

As I mentioned, I did try a ROCm installation quite some time ago (possibly early 2020) but didn't have any success with BOINC applications so I reverted to amdgpu-pro PAL OpenCL. If you're recording this as an internal (presumably developer) issue, is there anywhere I can track this, perhaps provide additional information and testing feedback, if required?

Thanks again to you and your team for your efforts. In the mean time I'll stay on amdgpu-pro 20.40 and await an update on this.

>I didn't see any mention of the transition to ROCr in the release notes, and the installation instructions still discuss using --opencl=pal vs --opencl=rocr (completely understand that documentation often takes longer to update).

Auggh... I know we already fixed that... sorry, don't know what happened there.

I don't know if we have a suitable customer-facing bug tracker but will try to find out.

EDIT - the documentation included in the downloaded package does properly mention the --opencl=rocr option, but it still is not very clear - it kinda suggests you can use either (--opencl=rocr,legacy) while in reality you use legacy for Polaris and earlier, rocr for Vega and later.

No worries, I'll just keep watch for new releases and do my testing then.

Thanks again.

0 Likes

With the edit referring to the internal documentation and choosing between rocr and legacy options, I just remembered something that may be relevant.

The way BOINC detects OpenCL devices for GPU processing isn't super-reliable (was very finicky in fglrx-era) and I recall when I moved to Vega GPUs that instead of just specifying --opencl=pal, I had to install both PAL and legacy options together: legacy for BOINC to detect the GPUs, and PAL to run OpenCL on Vega.

The reason I bring this is up is because I still used the --opencl=pal,legacy option when installing for Vega 20, which internally becomes --opencl=rocr,legacy under amdgpu-pro 20.45. Legacy would allow BOINC to detect the OpenCL devices, but I assume ROCr would be used for the actual processing on Vega and later GPUs. I just wonder if there's some interconnect between PAL and legacy OpenCL that isn't present between ROCr and legacy OpenCL.

0 Likes

It looks like my original post - which contained the bulk of the information - was lost in the administrative move from Drivers and Software Support. The original post was to https://community.amd.com/t5/drivers-software/amdgpu-pro-20-45-rocr-vs-pal-opencl-breaks-boinc-gpu-p....

To summarise, I cannot run GPGPU applications via BOINC using amdgpu-pro 20.45 apparently because of the switch from PAL OpenCL to ROCr for Vega GPUs onwards (older GPUs run okay with 'legacy' OpenCL).

Reverting to amdgpu-pro 20.40 on Ubuntu Linux Kernels means that I am stuck on an out-of-date kernel for the time being.

Hi @Wedge009 ,

I have whitelisted you and moved the post to the OpenCL forum.

(P.S. I mistakenly moved a part of message,  now I merged it with the original post. )

 

Thanks.

 

 

A minor update.

I discovered that a user with Navi 10 GPU is apparently running BOINC okay with ROCr-based OpenCL from amdgpu-pro 20.45.

https://einsteinathome.org/goto/comment/184228

The difference between that set-up and mine is that the user has Navi 10 instead of Vega 20 GPU. I doubt it's relevant, but they are also running a mixed set-up, with both ROCr and 'legacy' OpenCL installed.

0 Likes

The aforementioned user didn't appear to have success with ROCr-based OpenCL after all.

I have just tested with Ubuntu kernel 5.8.0-45 with amdgpu-pro 20.50 and there's no apparent change with ROCr-based OpenCL. I gather the 20.50 release was focused more on supporting the newly-released Radeon 6700 XT anyway.

Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "gfx906" by: Advanced Micro Devices, Inc.
Max allocation limit: 14360458035
Global mem size: 17163091968
OpenCL device has FP64 support
...
Warning:  Program terminating, but clFFT resources not freed.
Please consider explicitly calling clfftTeardown( ).
...
Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "gfx906+sram-ecc" by: Advanced Micro Devices, Inc.
Max allocation limit: 14588628168
Global mem size: 17163091968
Couldn't create OpenCL command queue (error: -6)!
OpenCL shutdown complete!
initialize_ocl returned error [2013]
OCL context null
OCL queue null
Error generating generic FFT context object [5]

To clarify, this is what is reported with ROCr-based OpenCL:

OpenCL: AMD/ATI GPU 0: Vega 20 [Radeon VII] (driver version 3224.0 (HSA1.1,LC), device version OpenCL 2.0, 16368MB, 16368MB available, 13832 GFLOPS peak)
OpenCL: AMD/ATI GPU 1: Vega 20 [Radeon VII] (driver version 3224.0 (HSA1.1,LC), device version OpenCL 2.0, 16368MB, 16368MB available, 13832 GFLOPS peak)

I've since reverted to kernel 5.4.0-54 with amdgpu-pro 20.40 in order to get PAL-based OpenCL back:

OpenCL: AMD/ATI GPU 0: AMD Radeon VII (driver version 3180.7 (PAL,HSAIL), device version OpenCL 2.0 AMD-APP (3180.7), 16368MB, 16368MB available, 13832 GFLOPS peak)
OpenCL: AMD/ATI GPU 1: AMD Radeon VII (driver version 3180.7 (PAL,HSAIL), device version OpenCL 2.0 AMD-APP (3180.7), 16368MB, 16368MB available, 13832 GFLOPS peak)

Is there no possibility of bringing back PAL-based OpenCL, even as a 'legacy' option?

0 Likes

Reporting no change in situation with recent amdgpu-pro 21.10 release: ROCr-based OpenCL is still not compatible with BOINC-based GPU processing. Tested on Ubuntu kernel 5.8.0-50.

Reporting for the recently released amdgpu-pro 21.20. A bit of a surprise, since the release notes show the main change was only to add support for RHEL and SLED. The application I'm running is not crashing at start of a GPU-based task as it did for amdgpu-pro 20.45 through to 21.10 (inclusive), where ROCr-based OpenCL was enforced.

However, it was too early to celebrate as it appears that instead of crashing the application was just stuck in the initialisation phase of a job. With the old PAL-based OpenCL a single job would complete in round 3-4 minutes. After 35 minutes, the task didn't progress past initialisation at all so I gave up and reverted to amdgpu-pro 20.40 yet again.

Still not at a stage where I could consider ROCr a functional replacement for PAL-based OpenCL, but I suppose being stuck is a marginal improvement from immediate crashing. Or, depending on the perspective of responsiveness a halt could be considered worse than an immediate crash.

Tested on Ubuntu 20.04.2 hwe kernel 5.8.0-55.

0 Likes

Reporting no apparent difference with amdgpu-pro 21.30 (which only appears to be updated for Ubuntu 20.04.3). That is, ROCr-based OpenCL still stalls (though not crashes) any attempts to use OpenCL under BOINC.

Still reverting to amdgpu-pro 20.40 as the last PAL-based OpenCL package.

0 Likes

At long last I managed to get BOINC running with ROCm (instead of amdgpu-pro). Specifically ROCm 4.3 on Vega10 with Ubuntu kernel 5.11.0-27.

In order to get clinfo (and BOINC) to recognise the GPU I had to manually edit /etc/OpenCL/vendors/amdocl64_40300.icd to contain the absolute path of /opt/rocm/opencl/lib/libamdocl64.so (setting LD_LIBRARY_PATH=/opt/rocm/opencl/lib was not sufficient).

Unfortunately, my experience seems to match what I get with the ROCr-based OpenCL in amdgpu-pro 21.20 and 21.30: stalled execution, no actual GPU usage. I've seen reports that ROCm works okay on Polaris GPUs, but I don't have any of those...

0 Likes

I recently received information of a user who has successfully got BOINC working. Same Threadripper 3960X CPU, same Radeon VII GPU, difference is they are using Arch Linux with self-compiled ROCm 4.3.1 while I'm using Ubuntu without success with amdgpu-pro 21.30 or official ROCm 4.3.0 packages. I've yet to get any feedback on how they got their ROCm set-up working but I wonder if there's an issue somewhere in the Ubuntu packaging for amdgpu-pro and/or ROCm...

0 Likes

What a pleasant surprise. I got ROCr-based OpenCL from amdgpu-install 21.40.1 running successfully on one host with Vega10, no hacks or work-arounds needed. All I did was remove amdgpu-pro 20.40, update to current HWE kernel, and run

amdgpu-install --opencl=rocr

(with reboots in between).

Hopefully that bodes well for my other two hosts (one with Vega10, the other with Vega20) still stuck on kernel 5.4. I'll be sure to mark this as solved once I get around to updating them, if successful.

0 Likes

It turns out that the struggles I was having with Vega20 was apparently due to a hardware fault with one the cards - it seems it's not reliable but neither is it 100% broken. Unfortunate for me, but fortunate in terms of the software support situation. Closing this as resolved with release of amdgpu 21.40.1.40501.

0 Likes

It turns out that the struggles I was having with Vega20 was apparently due to a hardware fault with one the cards - it seems it's not reliable but neither is it 100% broken. Unfortunate for me, but fortunate in terms of the software support situation. Closing this as resolved with release of amdgpu 21.40.1.40501.

(This post kept being blocked as 'spam' for some reason - something wrong with the forum's filtering?)

(Attempting to post again for the umpteenth time...)