5700XT Stability & Artifact Issues (Linux, OpenCL, BOINC)

I've been having stability problems with my 5700XT (VisionTek, blower-style) since I got it about 6 months ago.  I run very recent kernels, 5.9.x currently with amdgpu + amdgpu-pro compute, not amdgpu-pro everything.  I also run BOINC 24/7.

The primary symptom is that I see artifacts after running for anywhere from 12 hours to 96 hours.  Once that occurs, artifacts may persist on the desktop display or go away, but BOINC GPU work stalls or jobs will continuously error until I reboot.  But, I also get gnome-shell crashing sometimes or, very rarely, system crashes.

I replaced my memory, water cooler, mainboard, and UPS before I finally accepted that it might be a temperature-related problem.  Since I have been monitoring junction temperature, things got better quickly.  I wrote a script to throttle GPU computing when junction temperature gets too high.  That seems to have reduced the frequency of problems.  However, the recurrence rate has increased recently and that correlates with new BOINC workload.  I'm reducing the temperature threshold by a bit each time I have a problem.

Since this has persisted over multiple a AMD OpenCL and amdgpu versions, I suspect it is not just a software stability problem. However, it would be nice if gfx1010 were supported in the fully open source stack.

Does anyone with a 5700XT see this when they do BOINC or other GPU computing?

Do you think I'm on the right track with the temperature?

Reducing the junction temperature threshold at which I throttle GPU work has improved stability.  Only one video problem since then and it was while gaming.  The evidence is very strongly indicating a thermal problem.

However, I'm keeping the room quite cold.  The last crash occurred while ambient temperatures were below 20 C.  It's very cold; I'm slightly uncomfortable in the room.  I don't see any way I can improve the conditions.

Could there be something wrong with the blower or thermal contact?


If you can test the card in a Windows PC that would be helpful to see if it does the same. 

You may just have a problem card and should talk to the support department of who made it for advice and possible RMA. 


I would if I could, but I don't see a way to make that happen.  It would take something like 4 weeks of real-time testing to make any comparable test, which isn't realistic, even if I had a copy of Windows.

It's not clear what what would show, either.  The only way that would be relevant would be if the two drivers had different thermal profiles or thermal feedback controls.  Is that possible?  I think the thermal control is in the firmware, isn't it?

The evidence is pretty clear that it is not related to a driver bug, at least not above and beyond the typical quality of AMD drivers.  It could be a factor, but even if it is, it is dwarfed by the thermal issue.  So, unless the driver is responsible for thermal response, it doesn't seem diagnostic.  Perhaps you are saying you know that the driver does affect thermals.  If so, please let me know, as I'm not aware of that.  You could be right; I'm just guessing.

Regarding the support department, I'm pretty worried they'll ignore me.  It's not like it's under warranty.  But, perhaps that is worth a try.