Apple’s New M1 Ultra Packs a Revolutionary GPU

Apple’s new M1 Ultra SoC, announced yesterday, appears to be a genuine breakthrough. The new M1 Ultra is made from two M1 Max chips and features a new GPU integration approach not seen in-market before. While the SoC contains two GPUs — one per M1 Max — games and applications running on an Apple M1 Ultra see a single GPU.

During its unveil, Apple acknowledged that the M1 Max SoC has a feature the company didn’t disclose last year. From the beginning, the M1 Max was designed to support a high-speed interconnect via silicon interposer. According to Apple, this inter-chip network, dubbed UltraFusion, can provide 2.5TB/s of low-latency bandwidth. The company claims this is “more than 4x the bandwidth of the leading multi-chip interconnect technology.”

Apple doesn’t appear to be referencing CPU interconnects here. AMD’s Epyc CPUs use Infinity Fabric, which supports a maximum of 204.8GB/s of bandwidth across the entire chip when paired with DDR4-3200. Intel’s Skylake Xeons use Ultra Path Interconnect (UPI), with a 41.6GB/s connection between two sockets. Neither of these is anywhere close to 625GB/s. Apple may be referencing Nvidia’s GA100, which can offer ~600GB/s of bandwidth via NVLink 3.0. If we assume NVLink 3.0 is the appropriate comparison, Apple is claiming its new desktop SoC offers 4x the inter-chip bandwidth of Nvidia’s top-end server GPU.

According to Apple, providing such a massive amount of bandwidth allows the M1 to behave and be recognized in software as a single chip with a unified 128GB memory pool it shares with the CPU. The company claims there’s never been anything like it. They might be right. We know Nvidia and AMD have both done some work on the concept of dis-aggregated GPUs, but neither company has ever brought a product to market.

The Long Road to GPU Chiplets

The concept of splitting GPUs into discrete blocks and aggregating them together on-package predates common usage of the word “chiplet”, even though that’s what we’d call this approach today. Nvidia performed a study on the subject several years ago.

GPUs are some of the largest chips manufactured on any given iteration of a process node. The same economy of scale that makes CPU chiplets affordable and efficient could theoretically benefit GPUs the same way. The problem with GPU chiplets is that scaling workloads across multiple cards typically requires a great deal of fabric bandwidth between the chips themselves. The more chiplets you want to combine, the more difficult it is to wire all of them together with no impact on sustained performance.

Memory bandwidth and latency limitations are part of why AMD, Intel, and Nvidia have never shipped a dual graphics solution that could easily take advantage of the integrated GPU built into many CPUs today. Apple has apparently found a way around this problem where PC manufacturers haven’t. The reason for that may be explained more by the company’s addressable market than by design shortcomings at Intel or AMD.

Apple Has Unique Design Incentives

Both Intel and AMD manufacture chips for other people to build things with. Apple builds components only for itself. Intel and AMD maintain and contribute to manufacturing ecosystems for desktops and laptops, and its customers value flexibility.

Companies like Dell, HP, and Lenovo want to be able to combine CPUs and GPUs in various ways to hit price points and appeal to customers. From Apple’s perspective, however, the money customers forked over for a third-party GPU represents revenue it could earned for itself. While both Apple and PC OEMs earn additional profits when they sell a system with a discrete GPU, sharing those profits with AMD, Nvidia, and Intel is the price OEMs pay for not doing the GPU R&D themselves.

A PC customer who builds a 16-core desktop almost certainly expects the ability to upgrade the GPU over time. Some high core count customers don’t care much about GPU performance, but for those that do, the ability to upgrade a system over time is a major feature. Apple, in contrast, has long downplayed system upgradability.

The closest x86 chips to the M1 Ultra would be the SoCs inside the Xbox Series X and PlayStation 5. While neither console features on-package RAM, they both offer powerful GPUs integrated directly on-package in systems meant to sell for $500. One reason we don’t see such chips in the PC market is because OEMs value flexibility and modularity more than they value the ability to standardize on a handful of chips for years at a time.

It may be that one reason we haven’t seen this kind of chip from AMD, Intel, or Nvidia is because none of them have had particular incentive to build it.

How Apple’s M1 Max Uses Memory Bandwidth

When Apple’s M1 Max shipped, tests showed that the CPU cores can’t access the system’s full bandwidth. Out of the 400GB/s of theoretical bandwidth available to the M1 Max, the CPU can only use ~250GB/s of it.

The rest of the bandwidth is allocated to other blocks of the SoC. Anandtech measured the GPU as pulling ~90GB/s of bandwidth and the rest of the fabric at 40-50GB/s during heavy use.

Given these kind of specs, slapping down two chips side-by-side, with duplicate RAM pools, doesn’t automatically sound like an enormous achievement. AMD ships eight chiplets mounted on a common interposer in a 64-core Epyc CPU. But this is where Apple’s scaling claims have weight.

In order for the M1 Ultra GPU to work as a unified solution, it means both GPUs share data and memory addresses across the two physical dies. In a conventional multi-GPU solution, a pair of cards with 16GB of VRAM each will appear as 2x16GB cards, not a single card with 32GB of VRAM. Nvidia’s NVLink allows two or more GPUs to pool VRAM, but the degree of performance improvement varies considerably depending on the workload.

As far as what kind of GPU performance customers should expect? That’s unclear. The M1 Max performs well in video processing workloads but is a mediocre gaming GPU. The M1 Ultra should see strong scaling thanks to doubling up its GPU resources, but Apple’s lackluster support for Mac gaming could undercut any performance advantage the hardware can deliver.

Apple’s big breakthrough here is in creating a GPU in two distinct slices that apparently behaves like a single logical card. AMD and Nvidia have continued working on graphics chiplets over the years, implying we’ll see discrete chiplet solutions from both companies in the future. We’ll have more to say about the performance ramifications of Apple’s design once we see what benchmarks show us about scaling.

Continue reading

Benchmark Results Show Apple M1 Beating Every Intel-Powered MacBook Pro

Apple’s New M1 Ultra Packs a Revolutionary GPU

The Long Road to GPU Chiplets

Apple Has Unique Design Incentives

How Apple’s M1 Max Uses Memory Bandwidth

Continue reading

Benchmark Results Show Apple M1 Beating Every Intel-Powered MacBook Pro

Scientists Create Ultra-Hard Diamonds at Room Temperature

Intel Tops Semiconductor Revenue, as TSMC, Nvidia, and AMD Vault Upwards

How Does Windows Use Multiple CPU Cores?