Linus Torvalds isn’t happy with the way Intel has treated support for Error Correcting Code (ECC) memory, and he blames the silicon giant for essentially killing the technology outside of servers. ECC memory is used to catch and correct single-bit errors in memory. It can’t correct multi-bit errors, but just fixing single-bit can make a significant difference to system stability.
There was a time when you could buy ECC support on mainstream chipsets, but Intel phased out that capability on non-Xeon platforms a number of years ago. The 975X may have been the last consumer Intel platform to support it, and that family launched 15 years ago. The Xeon 3450 chipset was cross-compatible with certain high-end CPUs in the Nehalem family, but that’s still a Xeon chipset — not a mainstream part.
As a result, support for ECC in consumer products — and the availability of ECC RAM for consumer products — both fell off a cliff. Linus summarizes his case in a rather lengthy post, arguing that the continued persistence of Rowhammer and the fact that single-bit errors have never gone away to declare Intel’s ECC policies “bad and misguided.” He actually takes on the entire DRAM industry, writing:
The memory manufacturers claim it’s because of economics and lower power. And they are lying bastards – let me once again point to row-hammer about how those problems have existed for several generations already, but these f*ckers happily sold broken hardware to consumers and claimed it was an “attack”, when it always was “we’re cutting corners.
Torvalds also refers to numerous incidents of kernel “oopsies” that he feels may be better explained by a hardware error. While objective data on this kind of thing is hard to come by, a 2009 Google report on memory errors provides some evidence he’s right, though obviously a 2009 paper may have limited applicability to DDR4 RAM in 2020.
Google’s conclusion from 2009 was straightforward: “We found the incidence of memory errors and the range of error rates across different DIMMs (dual in-line memory modules) to be much higher than previously reported… Memory errors are not rare events.” The team detected error rates that it describes as “orders of magnitude higher than previously reported.”
They conclude: “error correcting codes are crucial for reducing the large number of memory errors to a manageable number of uncorrectable errors.”
AMD’s Current Support of Limited Value
On paper, AMD’s Ryzen family supports ECC unofficially (Threadripper has official ECC support). As Ian Cutress points out later in the thread, however, just because a motherboard claims ECC support doesn’t mean that support is actually enabled. We don’t run into this situation very often, but CPUs and motherboards report their various feature sets via registers, which applications like CPUID then check to determine and report which features a chip supports. An application claiming to check to make sure a given feature is supported (SSE, AVX, ECC, etc), can only report what the CPU or motherboard claims about its own operation via register flags. It can’t actually check to see that support exists, unless the application actually contains a feature test — like, say, a small benchmark that literally can’t run unless AVX support is functional.
Because AMD’s support is unofficial, it means no one is standing over OEMs with a whip to make sure they properly implement the feature, and they aren’t testing to make sure the feature actually works. Because it’s possible to set the bit for “Supports ECC” in a motherboard register without actually implementing functional ECC, there are motherboards out there that claim to support the standard and appear to do so if you scan them with a utility, but don’t actually implement ECC at all. The only way to guarantee that ECC compatibility works on an AMD Ryzen motherboard is to run a utility that forces an ECC error.
As for whether we’ll see the feature make a return to Intel desktops or officially debut for Ryzen, that’s unclear. It would require buy-in from memory manufacturers, and it’s not clear very many people in the PC market would spring for it. Most people buy on price, and since you never know about the PC crashes you don’t have, it’s hard to sell people on the benefit. Then again, we’re going to see the x86 CPU manufacturers facing much stiffer challenges from ARM over the next 2-5 years than we’ve ever seen before. It wouldn’t be surprising to see Intel and/or AMD “rediscover” some features, especially if those features allow them to claim increased stability compared to previous products.
Feature image shows registered DDR4-2133 DIMMs. Registered DIMMs often also support ECC, but it’s possible to find unbuffered ECC RAM as well.
Earth Will Lose Its Oxygen in a Billion Years, Killing Most Living Organisms
A new study supported by NASA's exoplanet habitability research lays out how the Sun will eventually bake the planet, turning Earth from a lush, oxygen-rich world to a dried-up husk with no complex life.
Scientists Trace Origin of Dinosaur-Killing Asteroid
A team of scientists from the Southwest Research Institute (SwRI) are trying to identify the source of this impact, and the data currently indicates similar asteroids could be more common than we thought.
New Research Says Dinosaur-Killing Asteroid Struck Earth in Springtime
Scientists have studied the so-called Chicxulub impact for decades but only on millennia timescales. For the first time, researchers have zeroed in on when the asteroid or comet walloped Earth, and they say it happened on an otherwise pleasant spring day.