I've been having stability problems with my 5700XT (VisionTek, blower-style) since I got it about 6 months ago. I run very recent kernels, 5.9.x currently with amdgpu + amdgpu-pro compute, not amdgpu-pro everything. I also run BOINC 24/7.
The primary symptom is that I see artifacts after running for anywhere from 12 hours to 96 hours. Once that occurs, artifacts may persist on the desktop display or go away, but BOINC GPU work stalls or jobs will continuously error until I reboot. But, I also get gnome-shell crashing sometimes or, very rarely, system crashes.
I replaced my memory, water cooler, mainboard, and UPS before I finally accepted that it might be a temperature-related problem. Since I have been monitoring junction temperature, things got better quickly. I wrote a script to throttle GPU computing when junction temperature gets too high. That seems to have reduced the frequency of problems. However, the recurrence rate has increased recently and that correlates with new BOINC workload. I'm reducing the temperature threshold by a bit each time I have a problem.
Since this has persisted over multiple a AMD OpenCL and amdgpu versions, I suspect it is not just a software stability problem. However, it would be nice if gfx1010 were supported in the fully open source stack.
Does anyone with a 5700XT see this when they do BOINC or other GPU computing?
Do you think I'm on the right track with the temperature?