AMD-Powered Exascale Supercomputer Has a System Failure Every Few Hours

It’s a fact of life that keeping your PC running in tip-top shape requires constant vigilance. Now just imagine trying to keep a supercomputer with millions of parts running smoothly. That’s the task Justin Whitt, Program Director at Oak Ridge National Laboratory (ORNL), is currently dealing with. In a recent interview, he discussed some of the issues the lab has had with its new Frontier supercomputer. This monster machine is the world’s first exascale computer. It debuted in June in the #1 spot on the TOP500 list of the world’s most powerful computers. He has described some of the teething problems with Frontier, saying it’s currently experiencing system failures every few hours.

Frontier was under construction for three years, at a reported cost of $600 million dollars. It runs on the HPE Cray EX235a architecture with AMD EPYC 64C processors. It features 74 cabinets with more than 9,400 AMD-powered nodes for a total of 606,208 CPU cores. Its 37,888 AMD Radeon Instinct MI250X GPUs give it 8,335,360 GPU cores as well. It uses over 90 miles of cabling to connect everything.

Although maintaining a system like this sounds onerous, that’s Whitt’s job, and he says it’s been an interesting journey so far. He described the issues the team is facing in an interview with InsideHPC: “We are working through issues in hardware and making sure that we understand (what they are) because you’re going to have failures at this scale.” He summarized the situation by saying, “Mean time between failure on a system this size is hours, it’s not days.”

Whitt declined to go into detail about which hardware was experiencing issues. InsideHPC says that previously it was the HPE Slingshot fabric, a 64-port, high-speed switch that connects the blades within the system. Those issues were apparently resolved, as the system was able to run the High-Performance Linpack benchmark. Currently, some of the problems are apparently related to the AMD Instinct GPU accelerators. “The issues span lots of different categories, the GPUs are just one,” said Whitt. He said the trouble is pretty evenly spread out amongst Frontier’s various hardware. Also, the issues apparently arise when the computer is executing extremely demanding workloads, according to the report. Whitt says running a benchmark is a different can of worms compared with running scientific applications.

Regardless, the system isn’t expected to be fully operational for scientific tasks until January 2023. This means Whitt and his team still have some time to troubleshoot. Still, it seems like the team has its work cut out for it. Whitt said going a single day without a failure “would be outstanding.” He then said the goal for uptime “is still hours,” which is apparently longer than its current failure rate. Though that sounds like a difficult situation, Frontier has 60 million parts, so it’s not surprising there are some “hiccups,” according to Whitt. Despite these issues and COVID-19-related supply chain delays, Whitt says the company is still on track for the rollout date, when Frontier will begin its actual job of running user programs and not just benchmarks.

Continue reading

The Iconic Arecibo Observatory Will Be Demolished Following Cable Failures

The Arecibo dish was damaged following a series of cable failures, and the National Science Foundation (NSF) has decided it would be too dangerous to repair.

Famed Arecibo Observatory Collapses Following Cable Failures

The National Science Foundation (NSF) previously expressed concern this could happen, which is why it decided last month the dish would be demolished rather than repaired. Gravity took care of that a bit quicker than expected as the 900-ton suspended platform plummeted into the dish overnight, completely destroying the iconic instrument.

Tesla Ordered to Recall 150K+ Vehicles to Repair Memory Failures

Tesla has been asked — or "asked" — to recall some 159,000 vehicles to repair a NAND memory issue that will eventually cause failures on every affected vehicle.

Tesla Will Recall 134,000+ Vehicles Affected by Inevitable eMMC Failure

Tesla will recall Model S and Model Y vehicles manufactured over specific model years in order to repair an inevitable system failure.