FPGA vs. GPU Computational Storage Acceleration: Performance/Power Consideration

amd_adaptivecomputing · ‎11-11-2022

This article was published on July 08th, 2021.

In most situations, advanced programmable hardware—mainly GPUs and FPGAs—is the primary source of acceleration. By using this advanced hardware, enterprises are gaining computational advantages; however, there are still reasonable concerns around programming difficulty.

Figure 1. Analytics/AI Pipeline Components

Hardware manufacturers are now applying acceleration using computational storage, designed to include an in-line computational element. Hardware manufacturers are applying acceleration methods to computational storage, which is storage specifically designed to incorporate an in-line computational element. This approach has been shown to deliver high performance for analytics and AI applications (Figure 1). Data collection, analysis with or without machine learning, and verification can be accelerated using computational storage devices. These devices offer a key advantage because costly computations are offloaded to the storage device, rather than being done on the server CPU. Compared to standard storage/CPU methods, these are the advantages gained by computational storage:

1. Achieving faster performance by customizing the programmable hardware with application-specific programming

2. Freeing up CPU resources by offloading computation from the server to the storage device

3. Co-location of data and compute, reducing the need to transfer data

This novel approach is promising; however, you should assess it for your specific use case, considering performance, cost, power consumption, and ease of use. Performance/price and performance/power are key ratios to evaluate when choosing acceleration hardware. In this post, we’ll explore the performance/power ratio (here’s another article that discusses performance/price).

Computational Storage Power Comparison Overview

About the 3 Systems

In this scenario, we're comparing three tools focusing on CSV data read use cases: NVIDIA GPU Direct Storage, NVIDIA RAPIDS, and Samsung SmartSSD powered by Xilinx. CSV read is crucial in compute intensive pipelines (see Figure 1).

In the following, we define performance to be the processing rate of CSV, or the “bandwidth” of the processing. Here’s a quick refresher on how the three systems work.

Nvidia GPUDirect Storage

Addresses analytics and AI end-to-end
Uses the GPU as a computational element placed next to an NVMe-based storage device (GPUDirect)
Leverages CUDA for programming (RAPIDS)

NVIDIA employed its technology to CSV data read to measure the performance gain over a standard SSD. The results in Figure 1 show 4 to 23 GB/s throughput for a range of 1 to 8 accelerators.

Samsung SmartSSD Drive

Uses a Xilinx FPGA as the computational element
Resides in-line with the storage logic on the same internal PCIe interconnect
Performs computation on the storage platform with programming

Xilinx worked with Samsung to design an accelerator for Apache Spark, including IP for CSV and Parquet processing. Testing of the SmartSSD occurred using the CSV parsing engine in stand-alone mode for comparison. Results in Figure 2 demonstrate a throughput of 4 to 23 GB/s for 1 to 12 accelerators, along with the NVIDIA results (for 1-8 accelerators). Please note all results in this discussion are parameterized by the number of accelerator cards employed on the x-axis.

These outcomes are promising, however, be sure to consider the power consumption when choosing your solution.

Figure 2: SmartSSD Drive Performance Results for CSV Parsing

The Power-Performance Comparison

Figure 3 shows the results of including power consumption as a consideration for analysis. They are presented in terms of performance achieved per unit of power, with the following assumptions based on the related material cited in the discussion above:

Tesla V100 GPU: 250-watt max power
SmartSSD Drive FPGA: 30-watt max power

Figure 3: Bandwidth per Watt Comparison for CSV Parsing

In this scenario, calculations show almost a 25x increase in performance/power for the SmartSSD over GPUDirect Storage with eight accelerators each.

FPGA vs. GPU: Power/Performance Final Thoughts

The advantages of computational storage can enhance the performance of data analytics and AI applications. However, for the approach to be practical and useful for deployment, evaluations must consider power consumption.

We have presented throughput performance curves parameterized by power for two different computational storage approaches for CSV data parsing. Results show that when comparing a like number of accelerators, the SmartSSD drive outperforms the GPUDirect storage approach in terms of performance/power

GPUDirect is a research system from NVIDIA to be made available via the NVIDIA DGX-2 appliance platform.

The Samsung SmartSSD Drive is a deployable production PCIe-pluggable platform, shipping and available now through Xilinx and distributors.

For more information, check out:

The Samsung SmartSSD page for workload benefits using Samsung SmartSSD Drive.