cancel
Showing results for 
Search instead for 
Did you mean: 

Server Gurus Discussions

shusky2812
Journeyman III

Issues with Samsung SSDs on Epyc

According to the previous email conversation with the technical support of AMD and Hanaro/Samsung, I will now create the requested forum thread.

We are experiencing severe issues with AMD Epyc CPUs in combination with SuperMicro Mainboards and Samsung 860 Series SSDs. According to the EU Support of Samsung, provided by Hanaro Europe B.V.,  the issue is due to AMD's refusal to meet S-ATA requirements in the manufacture of its motherboards. According to Hanaro, Samsung implemented a "feature" to generate error messages on AMD controllers starting with the Samsung 860 SSD Series, which includes the used 860 EVO, to show that the AMD S-ATA ports do not fulfill the S-ATA requirements.

Hardware Setup:

1)

CPU: 1x AMD EPYC 7401P
RAM: 8x 32 GB Samsung M393A2K40CB2-CTD
MB: Supermicro H11SSL-i Rev 1.0
SSD: Samsung MZ-76E1T0B/EU


Mainboard / IPMI Firmware:
Firmware Revision: 01.39
Firmware Build Time: 10/09/2018
BIOS Version: 1.0c
BIOS Build Time: 10/04/2018
Redfish Version: 1.0.1
CPLD Version: 02.b1.00
AGESA: 1.0.0.9 -  AMI CRB_019


OS - Kernel: Debian 9 - 5.0.17

LSPCI S-ATA-Controller:
07:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
42:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
62:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)

2)
CPU: 2x AMD EPYC 7282
RAM: 16x 32 GB Samsung M393A4K40CB2-CTD
MB: Supermicro H11DSi-NT Rev 2.0
SSD: Samsung MZ-76E2T0B/EU

Mainboard / IPMI Firmware:
Firmware Revision: 01.52.00
Firmware Build Time: 11/18/2019
BIOS Version: 2.1
BIOS Build Time: 02/21/2020
Redfish Version: 1.0.1
CPLD Version: 04.00.14
AGESA: 1.0.0.5 - 5.14_RomeCrb_0ACMK013

OS - Kernel: Debian 10 - 5.3.13 / 5.4.41

LSPCI S-ATA-Controller:

23:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
24:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
46:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
47:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
a3:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
a4:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
c3:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
c4:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)

During some higher I/O workloads, the following kernel messages are logged:

May 17 06:54:41 m13970 kernel: [364273.528676] ata7: hard resetting link
May 17 06:54:42 m13970 kernel: [364274.008322] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May 17 06:54:42 m13970 kernel: [364274.008615] ata7.00: supports DRM functions and may not be fully accessible
May 17 06:54:42 m13970 kernel: [364274.011645] ata7.00: supports DRM functions and may not be fully accessible
May 17 06:54:42 m13970 kernel: [364274.014371] ata7.00: configured for UDMA/133
May 17 06:54:42 m13970 kernel: [364274.014384] sd 6:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 17 06:54:42 m13970 kernel: [364274.014386] sd 6:0:0:0: [sdb] tag#14 Sense Key : Illegal Request [current]
May 17 06:54:42 m13970 kernel: [364274.014387] sd 6:0:0:0: [sdb] tag#14 Add. Sense: Unaligned write command
May 17 06:54:42 m13970 kernel: [364274.014389] sd 6:0:0:0: [sdb] tag#14 CDB: Read(10) 28 00 35 e5 e9 90 00 00 78 00
May 17 06:54:42 m13970 kernel: [364274.014461] zio pool=datastore vdev=/dev/disk/by-id/ata-Samsung_SSD_860_EVO_2TB_S3YVNB0MXXXXXXX-part4 error=5 type=1 offset=442981621760 size=61440 flags=40080cb0
May 17 06:54:42 m13970 kernel: [364274.014471] sd 6:0:0:0: [sdb] tag#31 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 17 06:54:42 m13970 kernel: [364274.014472] sd 6:0:0:0: [sdb] tag#31 Sense Key : Illegal Request [current]
May 17 06:54:42 m13970 kernel: [364274.014473] sd 6:0:0:0: [sdb] tag#31 Add. Sense: Unaligned write command
May 17 06:54:42 m13970 kernel: [364274.014476] sd 6:0:0:0: [sdb] tag#31 CDB: Read(10) 28 00 35 e5 e8 98 00 00 f8 00
May 17 06:54:42 m13970 kernel: [364274.014523] zio pool=datastore vdev=/dev/disk/by-id/ata-Samsung_SSD_860_EVO_2TB_S3YVNB0MXXXXXXX-part4 error=5 type=1 offset=442981494784 size=126976 flags=40080cb0
May 17 06:54:42 m13970 kernel: [364274.014532] ata7: EH complete
May 17 06:54:42 m13970 kernel: [364274.014602] ata7.00: Enabling discard_zeroes_data

-> Causing SMART 199 CRC Errors on the SSD as well as issues with the ZFS RAID as it ejects the SSD due to the above logged errors

The issue is already known for more than 1.5 years, however, still not resolved:

201693 – Samsung 860 EVO NCQ Issue with AMD SATA Controller 

Due to an compatibility issue with the disabling NCQ and the used ZFS raid, it is also not possible for us to apply the workaround with setting the queue depth for those devices to 1.

We never experienced any issues with those SSDs on our previously used Intel platforms. Therefore, it seems like an issue caused by AMD, which we would need resolved shortly in order to be able to roll out the new AMD platform in our data centers.

0 Kudos
10 Replies
mbaker_amd
Staff
Staff

Re: Issues with Samsung SSDs on Epyc

Hello shusky2812‌,

We are looking into this and will get back with you.

0 Kudos
shusky2812
Journeyman III

Re: Issues with Samsung SSDs on Epyc

Hello,

for your convenience, please find below the command with which we were able to reproduce the issue on all of our Epyc servers running FIO on Debian 10:

fio --name=randrw --rw=randrw --size=32GB --direct=1 --ioengine=libaio --iodepth=32

The errors should occur within a few seconds.

0 Kudos
mbaker_amd
Staff
Staff

Re: Issues with Samsung SSDs on Epyc

Thanks.  We are waiting to receive some drives to try and reproduce this issue.  They are expected soon, and will be in touch.

0 Kudos
hardcoregames_
Big Boss

Re: Issues with Samsung SSDs on Epyc

I found Intel's SSD products work fine with AMD servers that I have worked on. I use them in my own shop as well as they are reliable and I have never had problems with them.

Not sure why Samsung's SSD products do not like AMD servers, I have seen countless complaints over that problem myself.

0 Kudos
mbaker_amd
Staff
Staff

Re: Issues with Samsung SSDs on Epyc

Hello hardcoregames™‌, 

Typically Samsung SSD's (SATA and NVMe attached) work great.  This seems to be a corner case that we're looking into for shusky2812‌.  Hopefully an update soon.

0 Kudos
hardcoregames_
Big Boss

Re: Issues with Samsung SSDs on Epyc

mbaker_amd wrote:

Hello hardcoregames™

Typically Samsung SSD's (SATA and NVMe attached) work great.  This seems to be a corner case that we're looking into for shusky2812.  Hopefully an update soon.

I have not used M.2 SATA SSD as they tend to be very poor performers as opposed to M.2 NVMe which is what I see in servers.

The difference is IOS is astounding.

0 Kudos
mbaker_amd
Staff
Staff

Re: Issues with Samsung SSDs on Epyc

Hello shusky2812‌,

We had to order some of the Samsung 860 drives that you're using.  While we were waiting on the drives, we looked into the issue you linked to on bugzilla.kernel.org and discussed with Samsung.  They, Samsung, indicated that this issue may have been fixed in firmware RVT02B6Q.  The bugzilla reporter, so we are assuming you as well, was running with RVT01B6Q.  Our drives came pre-loaded with RVT02B6Q, and we ran FIO on them for 3 days without error.

Looking again at the bugzilla, and there are indications that some people have to disable NCQ on the drives.  We did not have to do that. 

Our suggestion is to get the latest firmware and try again.

0 Kudos
shusky2812
Journeyman III

Re: Issues with Samsung SSDs on Epyc

Hello mbaker_amd‌,

We are actually running the firmware versions RVT02B6Q, RVT03B6Q and RVT04B6Q and are still experiencing these issues.


Regarding NCQ:

As stated in my first post, we cannot disable NCQ due to the used ZFS file system on these drives, which has compatibility issues with a queue depth of 1.

Regarding the used hardware, I can also confirm the same issues with new HPE DL325 Gen10 Plus and HPE DL385 Gen10 Plus servers as long as you do use the onboard S-ATA ports and do not install a Hardware RAID controller. As far as we could verify, it only occurs on the first four S-ATA ports, both on the HPE systems and the SuperMicro ones.

If necessary, we would be able to provide you some test systems with SSH access for further investigation.

0 Kudos
mbaker_amd
Staff
Staff

Re: Issues with Samsung SSDs on Epyc

Hello shusky2812,

Have you contacted Samsung about this?  We cannot recreate the issue, and Samsung tells us the issue is resolved with their updated firmware.

0 Kudos