Linus Torvalds isn’t happy with the way Intel has treated support for Error Correcting Code (ECC) memory, and he blames the silicon giant for essentially killing the technology outside of servers. ECC memory is used to catch and correct single-bit errors in memory. It can’t correct multi-bit errors, but just fixing single-bit can make a significant difference to system stability.
There was a time when you could buy ECC support on mainstream chipsets, but Intel phased out that capability on non-Xeon platforms a number of years ago. The 975X may have been the last consumer Intel platform to support it, and that family launched 15 years ago. The Xeon 3450 chipset was cross-compatible with certain high-end CPUs in the Nehalem family, but that’s still a Xeon chipset — not a mainstream part.
As a result, support for ECC in consumer products — and the availability of ECC RAM for consumer products — both fell off a cliff. Linus summarizes his case in a rather lengthy post, arguing that the continued persistence of Rowhammer and the fact that single-bit errors have never gone away to declare Intel’s ECC policies “bad and misguided.” He actually takes on the entire DRAM industry, writing:
The memory manufacturers claim it’s because of economics and lower power. And they are lying bastards – let me once again point to row-hammer about how those problems have existed for several generations already, but these f*ckers happily sold broken hardware to consumers and claimed it was an “attack”, when it always was “we’re cutting corners.
Torvalds also refers to numerous incidents of kernel “oopsies” that he feels may be better explained by a hardware error. While objective data on this kind of thing is hard to come by, a 2009 Google report on memory errors provides some evidence he’s right, though obviously a 2009 paper may have limited applicability to DDR4 RAM in 2020.
Google’s conclusion from 2009 was straightforward: “We found the incidence of memory errors and the range of error rates across different DIMMs (dual in-line memory modules) to be much higher than previously reported… Memory errors are not rare events.” The team detected error rates that it describes as “orders of magnitude higher than previously reported.”
They conclude: “error correcting codes are crucial for reducing the large number of memory errors to a manageable number of uncorrectable errors.”
AMD’s Current Support of Limited Value
On paper, AMD’s Ryzen family supports ECC unofficially (Threadripper has official ECC support). As Ian Cutress points out later in the thread, however, just because a motherboard claims ECC support doesn’t mean that support is actually enabled. We don’t run into this situation very often, but CPUs and motherboards report their various feature sets via registers, which applications like CPUID then check to determine and report which features a chip supports. An application claiming to check to make sure a given feature is supported (SSE, AVX, ECC, etc), can only report what the CPU or motherboard claims about its own operation via register flags. It can’t actually check to see that support exists, unless the application actually contains a feature test — like, say, a small benchmark that literally can’t run unless AVX support is functional.
Because AMD’s support is unofficial, it means no one is standing over OEMs with a whip to make sure they properly implement the feature, and they aren’t testing to make sure the feature actually works. Because it’s possible to set the bit for “Supports ECC” in a motherboard register without actually implementing functional ECC, there are motherboards out there that claim to support the standard and appear to do so if you scan them with a utility, but don’t actually implement ECC at all. The only way to guarantee that ECC compatibility works on an AMD Ryzen motherboard is to run a utility that forces an ECC error.
As for whether we’ll see the feature make a return to Intel desktops or officially debut for Ryzen, that’s unclear. It would require buy-in from memory manufacturers, and it’s not clear very many people in the PC market would spring for it. Most people buy on price, and since you never know about the PC crashes you don’t have, it’s hard to sell people on the benefit. Then again, we’re going to see the x86 CPU manufacturers facing much stiffer challenges from ARM over the next 2-5 years than we’ve ever seen before. It wouldn’t be surprising to see Intel and/or AMD “rediscover” some features, especially if those features allow them to claim increased stability compared to previous products.
Feature image shows registered DDR4-2133 DIMMs. Registered DIMMs often also support ECC, but it’s possible to find unbuffered ECC RAM as well.
Man Blames Apple After iPhone Scam App Steals $1 Million in Bitcoin
He made the mistake of downloading an app from the iOS App Store. In the blink of an eye, his fortune was gone, and he blames Apple.
Musk Blames Starship SN11 Failure on Methane Fuel Leak
The most recent Starship rocket blew up in mid-air while beginning its landing burn. Now, SpaceX CEO Elon Musk has announced a cause: a leaky pipe. We've all been there.
Samsung Blames Pandemic for Bait-and-Switching Customers
Samsung blames the pandemic for its decision to bait-and-switch customers of the 970 EVO Plus and has not updated its documentation in a meaningful way to address the problem.
Gigabyte Blames Media for DOA Hardware, Will Replace Defective Power Supplies
Gigabyte will replace your 750W or 850W power supply if you're impacted by its recent failures, but the company wants you to know that pesky reviewers are the real problem.