Amazon’s New World Beta Didn’t Directly Kill EVGA RTX 3090s

Amazon’s New World Beta Didn’t Directly Kill EVGA RTX 3090s

Earlier this summer, Amazon released its MMO, New World, for beta testing. It didn’t take long for reports of failed RTX 3090 GPUs to begin popping up online. The failures were specifically concentrated in EVGA RTX 3090 GPUs. Back when Ampere launched, some wondered whether the different capacitors used on some models were a potential source of instability.

The strange part of the story was the fact that software almost never kills hardware. There are two exceptions to this policy: Firmware updates (which stretch the definition of “software”) and applications that can somewhat mimic thermal viruses like Furmark and Prime95. Even then, well-cooled components with adequate VRM cooling typically do not fail. And the bare handful of applications that can cause component failure almost never do so without contributing factors like the age of the equipment and its operating temperature.

EVGA pledged to replace the affected cards and it analyzed 24 of the models and shared the results with PCWorld. According to Gordon Mah Ung: “Under an X-ray analysis, they [the affected RTX 3090s] appear to have “poor workmanship” on soldering around the card’s MOSFET circuits that powered the impacted cards.” The failures were confined to the RTX 3090 family and EVGA shipped new cards to gamers immediately, it says, rather than waiting for units to be returned.

MOSFETs (Metal Oxide Semiconductor Field Effect Transistors) are a critical component in power circuitry. Poor soldering around the MOSFETs themselves could cause failure by increasing temperatures and resistance to the point that the hardware failed.

Why Did RTX 3090s Fail When Running New World, Specifically?

Here’s where we have to split a hair. The EVGA cards died because they suffered from manufacturing defects, not because of something wrong with New World. But New World exposed the failure point because of design decisions Amazon made.

GPUs are enormous parallel calculation machines, but they don’t have a subjective understanding of “enough.” To a GPU, “enough” is either “the frame rate I’ve been ordered to deliver” or “as fast as I possibly can without violating one of my pre-programmed constraints on temperature, voltage, and current.” There’s nothing in between.

Amazon’s New World Beta Didn’t Directly Kill EVGA RTX 3090s

It’s possible that the reason the RTX 3090s died is that menu options don’t typically require a lot of RAM to store the data. If the rendering workload can be held within on-die L2 or requires minimal memory access, it means it can be fed to the execution units as quickly as those execution units can fetch it. A simple workload without bottlenecks is more likely to scale. In a more complicated test scenario (i.e., actual gameplay), the GPU’s execution efficiency falls and frame rates correspondingly drop. This limits the heat build-up on the chip.

The fact that the overwhelming majority of RTX 3090 cards didn’t die is proof that Amazon’s uncapped frame rate wasn’t enough of a problem to harm most cards. It killed a handful of GPUs that had pre-existing manufacturing defects, however, because doing nearly nothing as quickly as possible likely generated more heat than doing lots of complex things at a much lower frame rate.

I have a theory that this is a larger problem for GPUs than CPUs because of fundamental differences in their design. A GPU is designed to perform one operation across a huge range of cores simultaneously. A CPU is designed to execute complex, branchy code as quickly as possible. If you write a simplistic application on the CPU, what you’ll get is a single-core, single-thread program with SSE2 optimization — or x87, if you really want to get historic. If you want to run that code at peak performance, you have to optimize it for AVX2 or AVX-512. It needs to support multiple cores. The naive implementation on x86 doesn’t lend itself to high execution efficiency across a modern, multi-core chip. Storing the code in L1 or L2 will make it execute more quickly, but without multi-core support, you’ll still only be stressing a single core.

This is not to say that one cannot write an x86 power virus, or make coding mistakes that drive up power consumption, but GPUs seem to offer a swifter path to the same outcome. “Perform this minimally bottlenecked operation across the entire chip as quickly as you can” is an easier way to stress a GPU because GPUs are built to run that kind of operation to begin with.

Purely a theory on my part, but it would also explain why specific feature tests and sometimes older titles are a good way to test maximum power consumption on GPUs. Asking the card to do less sometimes encourages it to run hotter than asking it to do more.

Continue reading

$340K Worth of MSI RTX 3090 GPUs Stolen in Factory Heist
$340K Worth of MSI RTX 3090 GPUs Stolen in Factory Heist

MSI has been hit by a factory theft the company believes was an inside job. Roughly $340,000 worth of RTX 3090 GPUs were stolen.

Nvidia RTX 3080 Ti Replaces 3090 for Many Gamers, 3070 Ti Coming Soon
Nvidia RTX 3080 Ti Replaces 3090 for Many Gamers, 3070 Ti Coming Soon

Nvidia has announced the RTX 3080 Ti with 12GB of VRAM and more GPU cores, with theoretical availability starting June 3. An updated RTX 3070 Ti will follow, with theoretical availability on June 10.

The Amazon MMO New Worlds Beta Is Frying RTX 3090 GPUs
The Amazon MMO New Worlds Beta Is Frying RTX 3090 GPUs

If you plan to play Amazon's New World MMO, capping your frame rate is a very good idea.

Rumor Mill: Nvidia Prepping Flagship RTX 3090 Ti GPU
Rumor Mill: Nvidia Prepping Flagship RTX 3090 Ti GPU

Nvidia is reportedly ready to launch the swan song of its Ampere lineup, the RTX 3090 Ti.