AnsweredAssumed Answered

Issues with Samsung SSDs on Epyc

Question asked by shusky2812 on May 25, 2020
Latest reply on Jun 30, 2020 by hardcoregames™

According to the previous email conversation with the technical support of AMD and Hanaro/Samsung, I will now create the requested forum thread.

 

We are experiencing severe issues with AMD Epyc CPUs in combination with SuperMicro Mainboards and Samsung 860 Series SSDs. According to the EU Support of Samsung, provided by Hanaro Europe B.V.,  the issue is due to AMD's refusal to meet S-ATA requirements in the manufacture of its motherboards. According to Hanaro, Samsung implemented a "feature" to generate error messages on AMD controllers starting with the Samsung 860 SSD Series, which includes the used 860 EVO, to show that the AMD S-ATA ports do not fulfill the S-ATA requirements.

 

Hardware Setup:

1)

CPU: 1x AMD EPYC 7401P
RAM: 8x 32 GB Samsung M393A2K40CB2-CTD
MB: Supermicro H11SSL-i Rev 1.0
SSD: Samsung MZ-76E1T0B/EU


Mainboard / IPMI Firmware:
Firmware Revision: 01.39
Firmware Build Time: 10/09/2018
BIOS Version: 1.0c
BIOS Build Time: 10/04/2018
Redfish Version: 1.0.1
CPLD Version: 02.b1.00
AGESA: 1.0.0.9 -  AMI CRB_019


OS - Kernel: Debian 9 - 5.0.17

 

LSPCI S-ATA-Controller:
07:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
42:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
62:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)

 

2)
CPU: 2x AMD EPYC 7282
RAM: 16x 32 GB Samsung M393A4K40CB2-CTD
MB: Supermicro H11DSi-NT Rev 2.0
SSD: Samsung MZ-76E2T0B/EU

 

Mainboard / IPMI Firmware:
Firmware Revision: 01.52.00
Firmware Build Time: 11/18/2019
BIOS Version: 2.1
BIOS Build Time: 02/21/2020
Redfish Version: 1.0.1
CPLD Version: 04.00.14
AGESA: 1.0.0.5 - 5.14_RomeCrb_0ACMK013

 

OS - Kernel: Debian 10 - 5.3.13 / 5.4.41

 

LSPCI S-ATA-Controller:

23:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
24:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
46:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
47:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
a3:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
a4:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
c3:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
c4:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)

 

During some higher I/O workloads, the following kernel messages are logged:

May 17 06:54:41 m13970 kernel: [364273.528676] ata7: hard resetting link
May 17 06:54:42 m13970 kernel: [364274.008322] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May 17 06:54:42 m13970 kernel: [364274.008615] ata7.00: supports DRM functions and may not be fully accessible
May 17 06:54:42 m13970 kernel: [364274.011645] ata7.00: supports DRM functions and may not be fully accessible
May 17 06:54:42 m13970 kernel: [364274.014371] ata7.00: configured for UDMA/133
May 17 06:54:42 m13970 kernel: [364274.014384] sd 6:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 17 06:54:42 m13970 kernel: [364274.014386] sd 6:0:0:0: [sdb] tag#14 Sense Key : Illegal Request [current]
May 17 06:54:42 m13970 kernel: [364274.014387] sd 6:0:0:0: [sdb] tag#14 Add. Sense: Unaligned write command
May 17 06:54:42 m13970 kernel: [364274.014389] sd 6:0:0:0: [sdb] tag#14 CDB: Read(10) 28 00 35 e5 e9 90 00 00 78 00
May 17 06:54:42 m13970 kernel: [364274.014461] zio pool=datastore vdev=/dev/disk/by-id/ata-Samsung_SSD_860_EVO_2TB_S3YVNB0MXXXXXXX-part4 error=5 type=1 offset=442981621760 size=61440 flags=40080cb0
May 17 06:54:42 m13970 kernel: [364274.014471] sd 6:0:0:0: [sdb] tag#31 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 17 06:54:42 m13970 kernel: [364274.014472] sd 6:0:0:0: [sdb] tag#31 Sense Key : Illegal Request [current]
May 17 06:54:42 m13970 kernel: [364274.014473] sd 6:0:0:0: [sdb] tag#31 Add. Sense: Unaligned write command
May 17 06:54:42 m13970 kernel: [364274.014476] sd 6:0:0:0: [sdb] tag#31 CDB: Read(10) 28 00 35 e5 e8 98 00 00 f8 00
May 17 06:54:42 m13970 kernel: [364274.014523] zio pool=datastore vdev=/dev/disk/by-id/ata-Samsung_SSD_860_EVO_2TB_S3YVNB0MXXXXXXX-part4 error=5 type=1 offset=442981494784 size=126976 flags=40080cb0
May 17 06:54:42 m13970 kernel: [364274.014532] ata7: EH complete
May 17 06:54:42 m13970 kernel: [364274.014602] ata7.00: Enabling discard_zeroes_data

-> Causing SMART 199 CRC Errors on the SSD as well as issues with the ZFS RAID as it ejects the SSD due to the above logged errors

 

The issue is already known for more than 1.5 years, however, still not resolved:

201693 – Samsung 860 EVO NCQ Issue with AMD SATA Controller 

 

Due to an compatibility issue with the disabling NCQ and the used ZFS raid, it is also not possible for us to apply the workaround with setting the queue depth for those devices to 1.

 

We never experienced any issues with those SSDs on our previously used Intel platforms. Therefore, it seems like an issue caused by AMD, which we would need resolved shortly in order to be able to roll out the new AMD platform in our data centers.

Outcomes