Transforming AI Networks with AMD Pensando™ Pollara 400

jason_gmitter · ‎10-10-2024

The advent of generative AI and large language models (LLMs) has created unprecedented challenges for traditional Ethernet networks in AI clusters. These advanced AI/ML models demand intense communication capabilities, including tightly coupled parallel processing, rapid data transfers, and low-latency communication - requirements that conventional Ethernet, designed for general-purpose computing, has historically struggled to meet. Despite these challenges, Ethernet remains the preferred choice for network technology in AI clusters due to its widespread adoption and abundance of operational expertise. However, the limitations of traditional Ethernet in supporting specialized AI workloads have become increasingly apparent.

The AMD Pensando™ Pollara 400 emerges as a significant advancement in AI networking, specifically designed to address these issues. Pollara 400 optimizes performance to meet the requirements of modern AI environments while allowing customers to leverage familiar Ethernet-based fabrics. The Pollara 400 effectively bridges the gap between Ethernet's broad compatibility and the specialized demands of AI workloads, providing a solution that combines the best of both worlds. By addressing the specific communication needs of AI/ML models, the Pollara 400 enables organizations to fully harness the potential of their AI workloads without sacrificing the benefits of Ethernet infrastructure. This innovative approach represents a crucial step forward in adapting networking technology to the evolving landscape of AI computing.

Agile and Efficient Distribution in an Open Environment

What is AMD Pensando™ Pollara 400?: The Pollara 400 is a specialized network accelerator explicitly designed to optimize data transfer within back-end AI networks for GPU-to-GPU communication. It delivers a fully programmable 400 Gigabit per second (Gbps) RDMA Ethernet Network Interface Card (NIC), enhancing the efficiency of AI workloads. Traditional DC Ethernet, which typically focuses on providing services such as user-access, segmentation, and multi-tenancy—to name just a few examples—often falls short when it comes to meeting the demanding requirements of modern AI workloads. These workloads demand high bandwidth, low latency, and efficient communication patterns not prioritized in conventional Ethernet designs. To address the challenges of AI workloads, a network that can support distributed computing over multiple GPU nodes, with low jitter and RDMA, is needed. The Pollara 400 is designed to manage the unique communication patterns of AI workloads and offer high throughput across all available links, along with congestion avoidance, reduced tail latency, scalable performance, and fast job completion times. Additionally, it provides an open environment that doesn't limit customers to specific vendors, giving them more flexibility.

Standout Capabilities

P4 Programmability: The P4 programmable architecture allows the Pollara 400 to be versatile, enabling it to introduce innovations today while remaining adaptable to evolving standards in the future, such as those set by the Ultra Ethernet Consortium (UEC). This programmability ensures that the AMD Pensando™ Pollara 400 can adapt to new protocols and requirements, future-proofing AI infrastructure investments. By leveraging P4, AMD enables customers to customize network behavior, implement bespoke RDMA transports, and optimize performance for specific AI workloads, all while maintaining compatibility with future industry standards.

Multipathing & Intelligent Packet Spraying: Pollara 400 supports advanced adaptive packet spraying, which is crucial for managing AI models' high bandwidth and low latency requirements. This technology fully utilizes available bandwidth, particularly in CLOS fabric architectures, resulting in fast message completion times and lower tail latency. Pollara 400 integrates seamlessly with AMD Instinct™ Accelerator and AMD EPYC™ CPU infrastructure, providing reliable, high-speed connectivity for GPU-to-GPU RDMA communication. By intelligently spraying packets of a QP (Queue Pair) across multiple paths, it minimizes the chance of creating hot spots and congestion in AI networks, ensuring optimal performance. The Pollara 400 allows customers to choose their preferred Ethernet switching vendor, whether a lossy or lossless implementation. Importantly, the Pollara 400 drastically reduces network configuration and operational complexity by eliminating the requirement for a lossless network. This flexibility and efficiency make the Pollara 400 a powerful solution for enhancing AI workload performance and network reliability. 

In-Order Message Delivery: The Pollara 400 offers advanced capabilities for handling out-of-order packet arrivals, a frequent occurrence with multipathing and packet spraying techniques. This sophisticated feature allows the receiving Pollara 400 to efficiently process data packets that may arrive in a different sequence than originally transmitted, placing them directly into GPU memory without any delay. By managing this complexity at the NIC level, the system maintains high performance and data integrity without placing an additional burden on the GPU. This intelligent packet handling contributes to reduced latency and improved overall system efficiency. 

Fast Loss Recover with Selective Retransmission: The Pollara 400 enhances network performance through in-order message delivery and selective acknowledgment (SACK) retransmission. Unlike RoCEv2's Go-back-N mechanism, which resends all packets from the point of failure, SACK allows the Pollara 400 to identify and retransmit only lost or corrupted packets. This targeted approach optimizes bandwidth utilization, reduces latency in packet loss recovery, and minimizes redundant data transmission. By combining efficient in-order delivery with SACK retransmission, the AMD Pensando™ Pollara 400 enables smooth data flow and optimal resource utilization. These features result in faster job completion times, lower tail latencies, and more efficient bandwidth use, making it ideal for demanding AI networks and large-scale machine learning operations.

Path Aware Congestion Control:  The Pollara 400 employs real-time telemetry and network-aware algorithms to effectively manage network congestion, including incast scenarios. Unlike RoCEv2, which relies on PFC and ECN in a lossless network, the AMD UEC ready RDMA transport offers a more sophisticated approach:

Maintains per-path congestion status
Dynamically avoids congested paths using adaptive packet-spraying
Sustains near wire-rate performance during transient congestion
Optimizes packet flow across multiple paths without requiring PFC
Implements per-flow congestion control to prevent interference between data flows

These features simplify configuration, reduce operational overhead, and avoid common issues like congestion spreading, deadlock, and head-of-line blocking. The path-aware congestion control enables deterministic performance across the network, crucial for large-scale AI operations. By intelligently handling congestion without a fully lossless network, AMD Pensando™ Pollara 400 reduces network complexity, streamlining deployment in AI-driven data centers

Rapid Fault Detection in High-Performance AI Networks: High-performance networks are crucial for efficient data synchronization in AI GPU clusters. AMD Pensando™ Pollara 400 employs sophisticated methods for rapid fault detection, essential for maintaining optimal performance. Standard protocols' timeout mechanisms are often too slow for AI applications, which require aggressive fault detection to address the critical factors of reducing idle GPU time and increasing throughput of AI training and inference tasks, ultimately decreasing job completion time.

AMD Pensando™ Pollara 400 Rapid Fault Detection include Sender-Based ACK Monitoring, which leverages the sender's ability to track acknowledgments (ACKs) across multiple network paths.
AMD Pensando™ Pollara 400 Receiver-Based Packet Monitoring is another technique that focuses on the receiver's perspective, monitoring incoming packet flows. The receiver tracks packet reception on each distinct network path, and a potential fault is identified if packets stop arriving on a path for a specified duration.
AMD Pensando™ Pollara 400 Probe-Based Verification mechanism is employed upon suspicion of a fault (triggered by either of the above methods), a probe packet is transmitted on the suspected faulty path. If no response is received to the probe within a specified timeframe, the path is confirmed as failed. This additional step helps in distinguishing between transient network issues and actual path failures.

Rapid fault detection mechanisms offer significant advantages. By identifying issues in milliseconds, they enable near-instantaneous failover, minimizing GPU idle time. Swift detection and isolation of faulty paths optimize network resource allocation, ensuring uninterrupted AI workloads on healthy paths. This approach enhances overall AI performance, potentially reducing training times and improving inference accuracy.

Final Thoughts: The AMD Pensando™ Pollara 400 more than just a network card; it's a foundational component of a robust AI infrastructure. It addresses the limitations of traditional RoCEv2 Ethernet networks by offering features like real-time telemetry, adaptive packet spray with intelligent path aware congestion control to alleviate incast scenarios, selective acknowledgement, and robust error detection. AI workloads require networks that support bursty data flows, minimal jitter, noise isolation, and high bandwidth to ensure optimal GPU performance. When paired with "best of breed" standards-compliant Ethernet switches, AMD Pensando™ Pollara 400 forms the backbone of a high-efficiency, low-latency AI cloud environment.

With its ability to deliver high throughput, low latency, and exceptional scalability, combined with the flexibility of P4 programmability, AMD Pensando™ Pollara 400 is an essential tool in the arsenal of any AI cloud infrastructure. This programmable approach not only enhances the NIC's versatility but also allows for rapid deployment of new networking features, ensuring that AI infrastructures can evolve as quickly as the AI technologies they support.