After 3 or 4 months of tinkering, I think I have finally got a mostly stable card. I have been subject to the black screen of infamous death during various gaming loads, which forced a power cycle and restart with depressing frequency. Card would performs beautifully, right up until the world went black. Event 41 in the windows event viewer (Unexpected power down).
Since it seems to have been a variety of different things, I figured I would document the settings and learnings a little, in case it helped anybody else.
System, more or less:
1.) Found that enabling SAM would make the crashes significantly worse. Its technically supported by the motherboard and card, but at least with my system, makes it crash within a few minutes.
2.) A slightly higher voltage on the CPU VCCIO (1.25ish or 1.3) and the SA Voltage (Around 1.3V) seemed to make the system more stable. After an update to the latest bios, the MB settings seem to use higher voltages for both of these than older versions, so they are in auto now (version 1402).
3.) RAM speeds (XMP I, II, no xmp) didn't seem to make any difference
4.) CPU Overclock, auto, Intel boost, no boost, MCE, no MCE - no discernable difference
1.) Apparently the problem often seems to be a power supply OCP trip, so I swapped out my Corsair power supply for a new Dark Power 1200 (which is beautiful, but a little pricey). Seems to have made the problem a little bit better, but did not resolve it completely. Still get trips.
1.) Running it in Radeon Chill and turning it down to between 60 and 100 FPS seemed to make the problem go away in *most* games. Perfectly playable, but a bit of a waste of a 6900xt. Cyberpunk 2077 seems to be the worst culprit, and would crash even at a locked 60 FPS, with very low temperatures, and only about 100W of consumed power.
2.) My particular card seems to have a default max frequency of 2509 Mhz, which seems absurdly high for an air cooled card, so I found that I could get it mostly stable even without Radeon Chill with a slight undervoltage and setting the max at 2340, which looked like the published rage mode max for that card. (Except in cyberpunk. Which still died. )
3.) Temperatures overall looked fine, but the junction temp would get about 30 to 35 degrees above the primary cpu temp that controlled the fans, which seemed high. After some arguing with myself about how excited I was to take apart a spendy video card I finally took the plunge and disassembled the backplate and cooler. The thermal paste was pretty dry and not super even, and the pad over the VRM was super dry and crappy, so I replaced the thermal pads with new ones and repasted the die with some decent thermal grizzly paste. Crossed fingers and reinstalled. About 15 degree improvement overall and far less difference between controlling and max temp. This alone seems to have made the card far more stable. Even at default speeds I wasn't getting crashes in most games afterward. (Still died in cyberpunk tho. Sigh.)
Normally I run two cheap 1080P monitors in addition to my gaming monitor for office work. Today I turned those two off and ran cyberpunk for a couple hours straight with just the one monitor connected with no crashes. I am not sure I understand the mechanism for this or if I just got lucky today. Usually it would take less than 10 minutes to crash cyberpunk tho, so I am hopeful it repeats. Anybody have a similar experience or know what could be going on?
So overall I couldn't get stable performance until I did *all* of the following:
1.) Disabled SAM, at least on Intel with my hardware. Hopefully that comes around with better BIOS and driver support. We'll see, I guess.
2.) Got a wompin' big power supply. The 1000 watt recommendation is no joke. Lord knows what the 7000 series will pull.
3.) Latest Motherboard Bios, for the higher CPU Voltages (and who knows what else)
4.) Clean, re-apply thermal paste, and replace thermal pads. (Your mileage may very. Warrantee risks and all the rest apply. A little scary). ASUS factory application was pretty bad in my case, but it looked like the worst part was the VRM thermals, which were pretty much dried to powder.
5.) Turn off all other monitors? I would love to hear if anybody else has seen this. Seems odd to me. Might just be that cyberpunk is still a buggy mess.
I maybe should just RMA it, but its been a learning experience, if nothing else, and hopefully it can help somebody.
Good luck all!