Hello,
I admin four Supermicro EPYC machines (dual 7F72 or dual 7252) that are about 3 years old.
They are now plagued with reboots without any error message in the system (centOS 7.7).
Nothing was changed in the environment recently and the machines share the same networks and storage space with other machines that do not have any issues.
The supermicro case is going slow.
The only thing the EPYC machines share are ECC memory warnings in the BMC log at the time of reboot. The memory dimms are Hynix 3200 MHz DDR4 in 16 gb modules.
When it reboots, all 16 modules give warnings for the 7F72, and 2 out of 8 give warnings on the 7252 (P2 dimm C1 and E1).
Question:
* do you know of anyone who experienced something similar?
* Hynix acknowledged some manufacturing issues around these times if I believe theregister (https://www.theregister.com/2021/06/09/sk_hynix_dram_defects/ ) but would there be any way to check if that could be it?
* any other idea?
Thanks for any input!
Philippe