cancel
Showing results for 
Search instead for 
Did you mean: 

Server Gurus Discussions

shusky2812
Journeyman III

Issues with Samsung SSDs on Epyc

According to the previous email conversation with the technical support of AMD and Hanaro/Samsung, I will now create the requested forum thread.

We are experiencing severe issues with AMD Epyc CPUs in combination with SuperMicro Mainboards and Samsung 860 Series SSDs. According to the EU Support of Samsung, provided by Hanaro Europe B.V.,  the issue is due to AMD's refusal to meet S-ATA requirements in the manufacture of its motherboards. According to Hanaro, Samsung implemented a "feature" to generate error messages on AMD controllers starting with the Samsung 860 SSD Series, which includes the used 860 EVO, to show that the AMD S-ATA ports do not fulfill the S-ATA requirements.

Hardware Setup:

1)

CPU: 1x AMD EPYC 7401P
RAM: 8x 32 GB Samsung M393A2K40CB2-CTD
MB: Supermicro H11SSL-i Rev 1.0
SSD: Samsung MZ-76E1T0B/EU


Mainboard / IPMI Firmware:
Firmware Revision: 01.39
Firmware Build Time: 10/09/2018
BIOS Version: 1.0c
BIOS Build Time: 10/04/2018
Redfish Version: 1.0.1
CPLD Version: 02.b1.00
AGESA: 1.0.0.9 -  AMI CRB_019


OS - Kernel: Debian 9 - 5.0.17

LSPCI S-ATA-Controller:
07:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
42:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
62:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)

2)
CPU: 2x AMD EPYC 7282
RAM: 16x 32 GB Samsung M393A4K40CB2-CTD
MB: Supermicro H11DSi-NT Rev 2.0
SSD: Samsung MZ-76E2T0B/EU

Mainboard / IPMI Firmware:
Firmware Revision: 01.52.00
Firmware Build Time: 11/18/2019
BIOS Version: 2.1
BIOS Build Time: 02/21/2020
Redfish Version: 1.0.1
CPLD Version: 04.00.14
AGESA: 1.0.0.5 - 5.14_RomeCrb_0ACMK013

OS - Kernel: Debian 10 - 5.3.13 / 5.4.41

LSPCI S-ATA-Controller:

23:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
24:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
46:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
47:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
a3:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
a4:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
c3:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
c4:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)

During some higher I/O workloads, the following kernel messages are logged:

May 17 06:54:41 m13970 kernel: [364273.528676] ata7: hard resetting link
May 17 06:54:42 m13970 kernel: [364274.008322] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May 17 06:54:42 m13970 kernel: [364274.008615] ata7.00: supports DRM functions and may not be fully accessible
May 17 06:54:42 m13970 kernel: [364274.011645] ata7.00: supports DRM functions and may not be fully accessible
May 17 06:54:42 m13970 kernel: [364274.014371] ata7.00: configured for UDMA/133
May 17 06:54:42 m13970 kernel: [364274.014384] sd 6:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 17 06:54:42 m13970 kernel: [364274.014386] sd 6:0:0:0: [sdb] tag#14 Sense Key : Illegal Request [current]
May 17 06:54:42 m13970 kernel: [364274.014387] sd 6:0:0:0: [sdb] tag#14 Add. Sense: Unaligned write command
May 17 06:54:42 m13970 kernel: [364274.014389] sd 6:0:0:0: [sdb] tag#14 CDB: Read(10) 28 00 35 e5 e9 90 00 00 78 00
May 17 06:54:42 m13970 kernel: [364274.014461] zio pool=datastore vdev=/dev/disk/by-id/ata-Samsung_SSD_860_EVO_2TB_S3YVNB0MXXXXXXX-part4 error=5 type=1 offset=442981621760 size=61440 flags=40080cb0
May 17 06:54:42 m13970 kernel: [364274.014471] sd 6:0:0:0: [sdb] tag#31 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 17 06:54:42 m13970 kernel: [364274.014472] sd 6:0:0:0: [sdb] tag#31 Sense Key : Illegal Request [current]
May 17 06:54:42 m13970 kernel: [364274.014473] sd 6:0:0:0: [sdb] tag#31 Add. Sense: Unaligned write command
May 17 06:54:42 m13970 kernel: [364274.014476] sd 6:0:0:0: [sdb] tag#31 CDB: Read(10) 28 00 35 e5 e8 98 00 00 f8 00
May 17 06:54:42 m13970 kernel: [364274.014523] zio pool=datastore vdev=/dev/disk/by-id/ata-Samsung_SSD_860_EVO_2TB_S3YVNB0MXXXXXXX-part4 error=5 type=1 offset=442981494784 size=126976 flags=40080cb0
May 17 06:54:42 m13970 kernel: [364274.014532] ata7: EH complete
May 17 06:54:42 m13970 kernel: [364274.014602] ata7.00: Enabling discard_zeroes_data

-> Causing SMART 199 CRC Errors on the SSD as well as issues with the ZFS RAID as it ejects the SSD due to the above logged errors

The issue is already known for more than 1.5 years, however, still not resolved:

201693 – Samsung 860 EVO NCQ Issue with AMD SATA Controller 

Due to an compatibility issue with the disabling NCQ and the used ZFS raid, it is also not possible for us to apply the workaround with setting the queue depth for those devices to 1.

We never experienced any issues with those SSDs on our previously used Intel platforms. Therefore, it seems like an issue caused by AMD, which we would need resolved shortly in order to be able to roll out the new AMD platform in our data centers.

0 Likes
10 Replies
Anonymous
Not applicable

Hello shusky2812‌,

We are looking into this and will get back with you.

0 Likes
shusky2812
Journeyman III

Hello,

for your convenience, please find below the command with which we were able to reproduce the issue on all of our Epyc servers running FIO on Debian 10:

fio --name=randrw --rw=randrw --size=32GB --direct=1 --ioengine=libaio --iodepth=32

The errors should occur within a few seconds.

0 Likes
Anonymous
Not applicable

Thanks.  We are waiting to receive some drives to try and reproduce this issue.  They are expected soon, and will be in touch.

0 Likes
Anonymous
Not applicable

Hello shusky2812‌,

We had to order some of the Samsung 860 drives that you're using.  While we were waiting on the drives, we looked into the issue you linked to on bugzilla.kernel.org and discussed with Samsung.  They, Samsung, indicated that this issue may have been fixed in firmware RVT02B6Q.  The bugzilla reporter, so we are assuming you as well, was running with RVT01B6Q.  Our drives came pre-loaded with RVT02B6Q, and we ran FIO on them for 3 days without error.

Looking again at the bugzilla, and there are indications that some people have to disable NCQ on the drives.  We did not have to do that. 

Our suggestion is to get the latest firmware and try again.

0 Likes

Hello mbaker_amd‌,

We are actually running the firmware versions RVT02B6Q, RVT03B6Q and RVT04B6Q and are still experiencing these issues.


Regarding NCQ:

As stated in my first post, we cannot disable NCQ due to the used ZFS file system on these drives, which has compatibility issues with a queue depth of 1.

Regarding the used hardware, I can also confirm the same issues with new HPE DL325 Gen10 Plus and HPE DL385 Gen10 Plus servers as long as you do use the onboard S-ATA ports and do not install a Hardware RAID controller. As far as we could verify, it only occurs on the first four S-ATA ports, both on the HPE systems and the SuperMicro ones.

If necessary, we would be able to provide you some test systems with SSH access for further investigation.

0 Likes
Anonymous
Not applicable

Hello shusky2812,

Have you contacted Samsung about this?  We cannot recreate the issue, and Samsung tells us the issue is resolved with their updated firmware.

0 Likes

Dear Mr. XXX,

Thank you for contacting Samsung Memory Support.
We are sorry that you are having problems using the SSD 860 EVO on your computer with an AMD controller.

Please note that some Samsung SSDs like the SSD 860 EVO may have compatibility issues with AMD controllers and also with some ASMedia memory controllers.
This incompatibility may cause certain functions on the SSD to remain inactive, such as Rapid Mode, Secure Erase, etc., and may cause the computer to freeze or prevent the SSD from working optimally or at all.
This problem is due to AMD's refusal to meet SATA requirements in the manufacture of its motherboards.

This problem is not apparent on earlier SSD models because Samsung has not generated an error code that could be displayed when its SATA drives are attached to the AMD controller.
This decision was made with the hope that AMD will respect the requirements and act accordingly. Unfortunately this was not the case so far.
Starting with the SSD 860 series, Samsung decided to generate an error code.

Unfortunately, there is no information from AMD about the planned remedial actions and we do not know exactly when this will happen.
In the hope that AMD will release an update that can fix the problem, you can try the following solution:
- Disable NCQ (Native Command Queue) in your SATA driver
- If you use the default Storahci MS driver, add [HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Services \ storahci \ Parameters \ Device] to the registry. "NcqDisabled" = dword: 00000001 or "SingleIO" = hex (7): 2a, 00,00 00,00
- If you are using the AMD SATA driver, add the following instead: [HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ services \ amd_sata \ Parameters \ Device] "AmdSataNCQDisabled" = dword: 0000000F or "AmdSataQueueDepth" = dword: 00000001
- You can also switch your SATA controller into IDE mode. Note, however, that the performance will be slower than if you disable NCQ as indicated above.

If the above suggestions are not helpful, it is recommended to use the SSD with an Intel controller or a SATA-AHCI controller (the standard Microsoft controller).
Proceed as follows;
- Device Manager> IDE ATA / ATAPI Controller> AMD SATA Controller> Right Click> Update Driver> Browse Computer for Drives> select from the list> Default SATA AHCI Controller> Next> OK

Also make sure that the chipset is up to date.

We hope that we have supported you adequately.

Should you have any further questions, please do not hesitate to contact us.

Sincerely yours / Best regards / Met vriendelijke groeten
 
Mirian
Customer Support Team

This was there reply during our initial request, which is why we contacted you for further assistance as their solutions are not working with the setup or are drastically degrading the performance of the SSDs.

As you can see below, the issue still occurred during regular usage on both newer firmware versions:

Firmware RVT03B6Q:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.13-4-contabo] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 860 EVO 2TB
Serial Number:    S45KNB0M601XXX
LU WWN Device Id: 5 002538 e49686fb9
Firmware Version: RVT03B6Q
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Aug 28 14:35:58 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       2474
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       9
177 Wear_Leveling_Count     0x0013   041   041   000    Pre-fail  Always       -       881
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   072   068   000    Old_age   Always       -       28
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       -       297
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       6
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       709051075056

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         2         -

Firmware RVT04B6Q:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.13-4-contabo] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 860 EVO 2TB
Serial Number:    S4X1NJ0N20XXXXX
LU WWN Device Id: 5 002538 e302322d7
Firmware Version: RVT04B6Q
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Aug 28 14:53:35 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1297
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       6
177 Wear_Leveling_Count     0x0013   098   098   000    Pre-fail  Always       -       31
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   062   061   000    Old_age   Always       -       38
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       -       534
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       2
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       36663014043

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         3         -

0 Likes

I found Intel's SSD products work fine with AMD servers that I have worked on. I use them in my own shop as well as they are reliable and I have never had problems with them.

Not sure why Samsung's SSD products do not like AMD servers, I have seen countless complaints over that problem myself.

0 Likes
Anonymous
Not applicable

Hello hardcoregames™‌, 

Typically Samsung SSD's (SATA and NVMe attached) work great.  This seems to be a corner case that we're looking into for shusky2812‌.  Hopefully an update soon.

0 Likes

mbaker_amd wrote:

Hello hardcoregames™

Typically Samsung SSD's (SATA and NVMe attached) work great.  This seems to be a corner case that we're looking into for shusky2812.  Hopefully an update soon.

I have not used M.2 SATA SSD as they tend to be very poor performers as opposed to M.2 NVMe which is what I see in servers.

The difference is IOS is astounding.

0 Likes