According to the previous email conversation with the technical support of AMD and Hanaro/Samsung, I will now create the requested forum thread.
We are experiencing severe issues with AMD Epyc CPUs in combination with SuperMicro Mainboards and Samsung 860 Series SSDs. According to the EU Support of Samsung, provided by Hanaro Europe B.V., the issue is due to AMD's refusal to meet S-ATA requirements in the manufacture of its motherboards. According to Hanaro, Samsung implemented a "feature" to generate error messages on AMD controllers starting with the Samsung 860 SSD Series, which includes the used 860 EVO, to show that the AMD S-ATA ports do not fulfill the S-ATA requirements.
Hardware Setup:
1)
CPU: 1x AMD EPYC 7401P
RAM: 8x 32 GB Samsung M393A2K40CB2-CTD
MB: Supermicro H11SSL-i Rev 1.0
SSD: Samsung MZ-76E1T0B/EU
Mainboard / IPMI Firmware:
Firmware Revision: 01.39
Firmware Build Time: 10/09/2018
BIOS Version: 1.0c
BIOS Build Time: 10/04/2018
Redfish Version: 1.0.1
CPLD Version: 02.b1.00
AGESA: 1.0.0.9 - AMI CRB_019
OS - Kernel: Debian 9 - 5.0.17
LSPCI S-ATA-Controller:
07:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
42:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
62:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
2)
CPU: 2x AMD EPYC 7282
RAM: 16x 32 GB Samsung M393A4K40CB2-CTD
MB: Supermicro H11DSi-NT Rev 2.0
SSD: Samsung MZ-76E2T0B/EU
Mainboard / IPMI Firmware:
Firmware Revision: 01.52.00
Firmware Build Time: 11/18/2019
BIOS Version: 2.1
BIOS Build Time: 02/21/2020
Redfish Version: 1.0.1
CPLD Version: 04.00.14
AGESA: 1.0.0.5 - 5.14_RomeCrb_0ACMK013
OS - Kernel: Debian 10 - 5.3.13 / 5.4.41
LSPCI S-ATA-Controller:
23:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
24:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
46:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
47:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
a3:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
a4:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
c3:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
c4:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
During some higher I/O workloads, the following kernel messages are logged:
May 17 06:54:41 m13970 kernel: [364273.528676] ata7: hard resetting link
May 17 06:54:42 m13970 kernel: [364274.008322] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May 17 06:54:42 m13970 kernel: [364274.008615] ata7.00: supports DRM functions and may not be fully accessible
May 17 06:54:42 m13970 kernel: [364274.011645] ata7.00: supports DRM functions and may not be fully accessible
May 17 06:54:42 m13970 kernel: [364274.014371] ata7.00: configured for UDMA/133
May 17 06:54:42 m13970 kernel: [364274.014384] sd 6:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 17 06:54:42 m13970 kernel: [364274.014386] sd 6:0:0:0: [sdb] tag#14 Sense Key : Illegal Request [current]
May 17 06:54:42 m13970 kernel: [364274.014387] sd 6:0:0:0: [sdb] tag#14 Add. Sense: Unaligned write command
May 17 06:54:42 m13970 kernel: [364274.014389] sd 6:0:0:0: [sdb] tag#14 CDB: Read(10) 28 00 35 e5 e9 90 00 00 78 00
May 17 06:54:42 m13970 kernel: [364274.014461] zio pool=datastore vdev=/dev/disk/by-id/ata-Samsung_SSD_860_EVO_2TB_S3YVNB0MXXXXXXX-part4 error=5 type=1 offset=442981621760 size=61440 flags=40080cb0
May 17 06:54:42 m13970 kernel: [364274.014471] sd 6:0:0:0: [sdb] tag#31 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 17 06:54:42 m13970 kernel: [364274.014472] sd 6:0:0:0: [sdb] tag#31 Sense Key : Illegal Request [current]
May 17 06:54:42 m13970 kernel: [364274.014473] sd 6:0:0:0: [sdb] tag#31 Add. Sense: Unaligned write command
May 17 06:54:42 m13970 kernel: [364274.014476] sd 6:0:0:0: [sdb] tag#31 CDB: Read(10) 28 00 35 e5 e8 98 00 00 f8 00
May 17 06:54:42 m13970 kernel: [364274.014523] zio pool=datastore vdev=/dev/disk/by-id/ata-Samsung_SSD_860_EVO_2TB_S3YVNB0MXXXXXXX-part4 error=5 type=1 offset=442981494784 size=126976 flags=40080cb0
May 17 06:54:42 m13970 kernel: [364274.014532] ata7: EH complete
May 17 06:54:42 m13970 kernel: [364274.014602] ata7.00: Enabling discard_zeroes_data
-> Causing SMART 199 CRC Errors on the SSD as well as issues with the ZFS RAID as it ejects the SSD due to the above logged errors
The issue is already known for more than 1.5 years, however, still not resolved:
201693 – Samsung 860 EVO NCQ Issue with AMD SATA Controller
Due to an compatibility issue with the disabling NCQ and the used ZFS raid, it is also not possible for us to apply the workaround with setting the queue depth for those devices to 1.
We never experienced any issues with those SSDs on our previously used Intel platforms. Therefore, it seems like an issue caused by AMD, which we would need resolved shortly in order to be able to roll out the new AMD platform in our data centers.
Hello shusky2812,
We are looking into this and will get back with you.
Hello,
for your convenience, please find below the command with which we were able to reproduce the issue on all of our Epyc servers running FIO on Debian 10:
fio --name=randrw --rw=randrw --size=32GB --direct=1 --ioengine=libaio --iodepth=32
The errors should occur within a few seconds.
Thanks. We are waiting to receive some drives to try and reproduce this issue. They are expected soon, and will be in touch.
Hello shusky2812,
We had to order some of the Samsung 860 drives that you're using. While we were waiting on the drives, we looked into the issue you linked to on bugzilla.kernel.org and discussed with Samsung. They, Samsung, indicated that this issue may have been fixed in firmware RVT02B6Q. The bugzilla reporter, so we are assuming you as well, was running with RVT01B6Q. Our drives came pre-loaded with RVT02B6Q, and we ran FIO on them for 3 days without error.
Looking again at the bugzilla, and there are indications that some people have to disable NCQ on the drives. We did not have to do that.
Our suggestion is to get the latest firmware and try again.
Hello mbaker_amd,
We are actually running the firmware versions RVT02B6Q, RVT03B6Q and RVT04B6Q and are still experiencing these issues.
Regarding NCQ:
As stated in my first post, we cannot disable NCQ due to the used ZFS file system on these drives, which has compatibility issues with a queue depth of 1.
Regarding the used hardware, I can also confirm the same issues with new HPE DL325 Gen10 Plus and HPE DL385 Gen10 Plus servers as long as you do use the onboard S-ATA ports and do not install a Hardware RAID controller. As far as we could verify, it only occurs on the first four S-ATA ports, both on the HPE systems and the SuperMicro ones.
If necessary, we would be able to provide you some test systems with SSH access for further investigation.
Hello shusky2812,
Have you contacted Samsung about this? We cannot recreate the issue, and Samsung tells us the issue is resolved with their updated firmware.
Dear Mr. XXX,
Thank you for contacting Samsung Memory Support.
We are sorry that you are having problems using the SSD 860 EVO on your computer with an AMD controller.
Please note that some Samsung SSDs like the SSD 860 EVO may have compatibility issues with AMD controllers and also with some ASMedia memory controllers.
This incompatibility may cause certain functions on the SSD to remain inactive, such as Rapid Mode, Secure Erase, etc., and may cause the computer to freeze or prevent the SSD from working optimally or at all.
This problem is due to AMD's refusal to meet SATA requirements in the manufacture of its motherboards.
This problem is not apparent on earlier SSD models because Samsung has not generated an error code that could be displayed when its SATA drives are attached to the AMD controller.
This decision was made with the hope that AMD will respect the requirements and act accordingly. Unfortunately this was not the case so far.
Starting with the SSD 860 series, Samsung decided to generate an error code.
Unfortunately, there is no information from AMD about the planned remedial actions and we do not know exactly when this will happen.
In the hope that AMD will release an update that can fix the problem, you can try the following solution:
- Disable NCQ (Native Command Queue) in your SATA driver
- If you use the default Storahci MS driver, add [HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Services \ storahci \ Parameters \ Device] to the registry. "NcqDisabled" = dword: 00000001 or "SingleIO" = hex (7): 2a, 00,00 00,00
- If you are using the AMD SATA driver, add the following instead: [HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ services \ amd_sata \ Parameters \ Device] "AmdSataNCQDisabled" = dword: 0000000F or "AmdSataQueueDepth" = dword: 00000001
- You can also switch your SATA controller into IDE mode. Note, however, that the performance will be slower than if you disable NCQ as indicated above.
If the above suggestions are not helpful, it is recommended to use the SSD with an Intel controller or a SATA-AHCI controller (the standard Microsoft controller).
Proceed as follows;
- Device Manager> IDE ATA / ATAPI Controller> AMD SATA Controller> Right Click> Update Driver> Browse Computer for Drives> select from the list> Default SATA AHCI Controller> Next> OK
Also make sure that the chipset is up to date.
We hope that we have supported you adequately.
Should you have any further questions, please do not hesitate to contact us.
Sincerely yours / Best regards / Met vriendelijke groeten
Mirian
Customer Support Team
This was there reply during our initial request, which is why we contacted you for further assistance as their solutions are not working with the setup or are drastically degrading the performance of the SSDs.
As you can see below, the issue still occurred during regular usage on both newer firmware versions:
Firmware RVT03B6Q:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.13-4-contabo] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 860 EVO 2TB
Serial Number: S45KNB0M601XXX
LU WWN Device Id: 5 002538 e49686fb9
Firmware Version: RVT03B6Q
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Aug 28 14:35:58 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSEDSMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 2474
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 9
177 Wear_Leveling_Count 0x0013 041 041 000 Pre-fail Always - 881
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 072 068 000 Old_age Always - 28
195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 099 099 000 Old_age Always - 297
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 6
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 709051075056SMART Error Log Version: 1
No Errors LoggedSMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 2 -
Firmware RVT04B6Q:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.13-4-contabo] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 860 EVO 2TB
Serial Number: S4X1NJ0N20XXXXX
LU WWN Device Id: 5 002538 e302322d7
Firmware Version: RVT04B6Q
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Aug 28 14:53:35 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSEDSMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1297
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 6
177 Wear_Leveling_Count 0x0013 098 098 000 Pre-fail Always - 31
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 062 061 000 Old_age Always - 38
195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 099 099 000 Old_age Always - 534
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 2
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 36663014043SMART Error Log Version: 1
No Errors LoggedSMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 3 -
I found Intel's SSD products work fine with AMD servers that I have worked on. I use them in my own shop as well as they are reliable and I have never had problems with them.
Not sure why Samsung's SSD products do not like AMD servers, I have seen countless complaints over that problem myself.
Hello hardcoregames™,
Typically Samsung SSD's (SATA and NVMe attached) work great. This seems to be a corner case that we're looking into for shusky2812. Hopefully an update soon.
mbaker_amd wrote:
Hello hardcoregames™,
Typically Samsung SSD's (SATA and NVMe attached) work great. This seems to be a corner case that we're looking into for shusky2812. Hopefully an update soon.
I have not used M.2 SATA SSD as they tend to be very poor performers as opposed to M.2 NVMe which is what I see in servers.
The difference is IOS is astounding.