General Discussions

kingfish · ‎01-05-2021

AMD has filed for a patent on a chiplet-based approach to GPU design. One of the key goals of this approach is to create larger GPU configurations than are possible with a single, monolithic die.

AMD is the third company to share a little information on how it might approach this problem, though that’s probably stretching the definition of “sharing” a bit. You can find the patent here — we’ll briefly look at what Intel and Nvidia have proposed before we talk about AMD’s patent filing.

Intel has previously stated that its Ponte Vecchio data center GPU would use a new memory architecture (Xe-MF), with EMIB and Foveros. EMIB is a technique for connecting different chips on the same package, while Foveros uses large through-silicon vias to connect off-die hardware blocks at effectively on-die connectivity. This approach relies specifically on packaging and interconnect technology Intel has designed for its own use.

Nvidia proposed what it called a Multi-Chip Module GPU, or MC-GPU, that resolved problems intrinsic to distributing workloads across multiple GPUs by using NUMA, with additional features intended to reduce on-package bandwidth usage like an L1.5 cache, though it acknowledged unavoidable latency penalties when hopping across the various interconnected GPUs.

AMD’s method envisions a GPU chiplet organized somewhat differently from what we’ve seen from the 7nm CPUs it has launched to date. Organizing a GPU into an effective chiplet design can be difficult due to restrictions on inter-chiplet bandwidth. This is less of a problem with CPUs, where cores don’t necessarily communicate all that much, and there aren’t nearly as many of them. A GPU has thousands of cores, while even the largest x86 CPUs have just 64.

One of the problems Nvidia highlighted in its 2017 paper was the need to take pressure off the limited bandwidth available for MC-GPU to MC-GPU communication. The proposed L1.5 cache architecture that the company proposes is meant to alleviate this problem.

The implementation AMD describes above is different from what Nvidia envisions. AMD ties both work group processors (shader cores) and GFX (fixed-function units) directly to the L1 cache. The L1 cache is itself connected to a Graphics Data Fabric (GDF), which also connects the L1 and the L2. L2 cache is coherent within any single chiplet, and any WGP or GFX block can read data from any part of the L2.

In order to wire multiple GPU chiplets into a cohesive GPU processor, AMD first connects the L2 cache banks to the HPX passive crosslink above, using a scalable data fabric (SDF). That crosslink is what handles the job of inter-chiplet communication. The SDF on each chiplet is wired together through the HPX passive crosslink — that’s the single, long arrow connecting two chiplets above. This crosslink also attaches to the L3 cache banks on each chiplet. In this implementation, the GDDR lanes are wired to the L3 cache.

AMD’s patent assumes that only one GPU chiplet connects with the CPU, with the passive interconnect tying the rest together via a large, shared L3 cache. Nvidia’s MC-GPU doesn’t use an L3 in this fashion.

Theoretically, this is all very interesting, and we’ve already seen AMD ship a GPU with a big honkin’ L3 on it, courtesy of RDNA2’s Infinity Cache. Whether AMD will actually ship a part using GPU chiplets is a very different question from whether it wants patents on various ideas it might want to use.

Decoupling the CPU and GPU essentially reverses the work that went into combining them in the first place. One of the basic challenges the GPU chiplet approach must overcome is the intrinsically higher latencies created by moving these components away from each other.

Multi-chip GPUs are a topic that AMD and Nvidia have both been discussing for years. This patent doesn’t confirm that any products will hit the market in the near term, or even that AMD will ever approach this tech

General Discussions

AMD Files Patent for Its Own GPU Chiplet Implementation