The Apple M1 Ultra Crushes Intel in Computational Fluid Dynamics Performance

It’s surprisingly hard to pin down exactly how Apple’s M1 compares to Intel’s x86 processors. While the chip family has been widely reviewed in a number of common consumer applications, inevitable differences between macOS and Windows, the impact of emulation, and varying degrees of optimization between x86 and M1 all make precise measurement more difficult.

An interesting new benchmark result and accompanying review from app developer and engineer Craig Hunter shows the M1 Ultra absolutely destroying every Intel x86 CPU on the field. It’s not even a fair fight. According to Hunter’s results, an M1 Ultra running six threads matches the performance of a 28-core Xeon workstation from 2019.

Any lingering hopes that the M1 Ultra suffers a sudden and unexplained scaling calamity above six cores are dashed once we extend the graph’s y-axis high enough to accommodate the data.

This is an enormous win for the M1. Apple’s new CPU is more than 2x faster than the 28-core Mac Pro’s highest result. But what do we know about the test itself?

Hunter benchmarks USM3D, is described by NASA as “a tetrahedral unstructured flow solver that has become widely used in industry, government, and academia for solving aerodynamic problems. Since its first introduction in 1989, USM3D has steadily evolved from an inviscid Euler solver into a full viscous Navier-Stokes code.”

As previously noted, this is a computational fluid dynamics test, and CFD tests are notoriously memory bandwidth sensitive. We’ve never tested USM3D at wfoojjaec and it isn’t an application that I’m familiar with, so we reached out to Hunter for some additional clarification on the test itself and how he compiled it for each platform. There has been some speculation online that the M1 Ultra hit these performance levels thanks to advanced matrix extensions or another, unspecified optimization that was not in play for the Intel platform.

According to Hunter, that’s not true.

“I didn’t link to any Apple frameworks when compiling USM3D on M1, or attempt to tune or optimize code for Accelerate or AMX,” the engineer and app developer said. “I used the stock USM3D source with gfortran and did a fairly standard compile with -O3 optimization.”

“To be honest, I think this puts the M1 USM3D executable at a slight disadvantage to the Intel USM3D executable,” he continued. “I’ve used the Intel Fortran compiler for over 30 years (it was DEC Fortran then Compaq Fortran before becoming Intel Fortran) and I know how to get the most out of it. The Intel compiler does some aggressive vectorization and optimization when compiling USM3D, and historically it has given better performance on x86-64 than gfortran. So I expect I left some performance on the table by using gfortran for M1.”

We asked Hunter what he felt explained the M1 Ultra’s performance relative to the various Intel systems. The engineer has decades of experience evaluating CFD performance on various platforms, ranging from desktop systems like the Mac Pro and Mac Studio to actual supercomputers.

“Based on all the testing past and present, I feel like it’s the SoC architecture that is making the biggest difference here with the Apple Silicon machines, and as we invoke more cores into the computation, system bandwidth is going to be the main driver for performance scaling. The M1 Ultra in the Studio has an insane amount of system bandwidth.”

The benchmark is based on the NASA USM3D CFD code, which is available to US Citizens by request at software.nasa.gov. It comes as source code and will need to be compiled with a Fortran compiler (you also will need to build OpenMPI with matching compiler support). The makefiles are setup for macOS or Linux using the Intel Fortran compiler, which creates a highly optimized executable for x86-64. You could also use gfortran (what I used for the arm-64 Apple M1 systems) but I’d expect the performance to be lower than what ifort can enable on x86-64.”

What These Results Say About the x86 / M1 Matchup

It’s not exactly surprising that an SoC with more memory bandwidth than any previous CPU would perform well in a bandwidth-constrained environment. What’s interesting about these results is that they don’t necessarily depend on any particular facet of ARM versus x86. Give an AMD or Intel CPU as much memory bandwidth as Apple is fielding here, and performance might improve similarly.

In my article RISC vs. CISC Is the Wrong Lens for Comparing Modern x86, ARM CPUs, I spent some time discussing how Intel won the ISA wars decades ago not because x86 was intrinsically the best instruction set architecture, but because it could leverage an array of continuous manufacturing improvements while iteratively improving x86 from generation to generation. Here, we see Apple arguably doing something similar. The M1 Ultra isn’t trashing every Intel x86 CPU because it’s magic, but because integrating DRAM on-package in the way Apple did unlocked tremendous performance improvements. There is no reason x86 CPUs can’t take advantage of these gains as well. The fact that this benchmark is so memory bandwidth limited does suggest that top-end Alder Lake systems might match or exceed older Xeons like the 28-core Mac Pro, but it still wouldn’t match the M1 Ultra for sheer bandwidth between the SoC and main memory.

In fact, we do see x86 CPUs taking baby steps towards integrating more high-speed memory directly on package, but Intel is keeping this technology focused in servers for now, with Sapphire Rapids and its on-package HBM2 memory (available on some future SKUs). Neither Intel nor AMD have built anything like the M1 Ultra, however, at least not yet. Thus far, AMD has focused on integrating larger L3 caches rather than moving towards on-package DRAM. Any such move would require buy-in from OEMs and multiple other players in the PC manufacturing space.

I don’t expect either x86 manufacturer to rush to adopt technology just because Apple is using it, but the M1 puts up some extraordinary performance in certain tests, at excellent performance per watt. You can bet every aspect of the Cupertino company’s approach to manufacturing and design has been put under a (likely literal) microscope at AMD and Intel. That especially applies to gains that aren’t tied to any particular ISA or manufacturing technology.

Continue reading

Apple’s M1 Crushes Windows on ARM in 64-bit Benchmarks

The Apple M1 Ultra Crushes Intel in Computational Fluid Dynamics Performance

What These Results Say About the x86 / M1 Matchup

Continue reading

Apple’s M1 Crushes Windows on ARM in 64-bit Benchmarks

What to Expect at the 2018 New York Auto Show: Crush of Crossovers, a Couple Key Sedans

AMD Crushes Q2 2018 Earnings as Ryzen, Epyc Sales Strengthen

Leaked Benchmarks Show Intel’s Core i9-9900K Crushing AMD’s Ryzen 7 2700X