Just past 4 PM on August 14, 2003, a widespread power outage rippled across the Northeastern and Midwestern United States as well as throughout Ontario. By the time the events had run their course, more than 55 million people across two countries had lost power. In some cases, service took weeks to restore. Initially, it wasn’t clear what had happened or gone wrong, forensic analysts spent weeks pouring over logs from multiple power plants across the region to see how the failure had developed and metastasized.
What makes the entire affair remarkable is that it happened due to a simple software bug. Power distribution grids must maintain a very close alignment between the load on the network, its supply, and its ability to transmit power from Point A to B. At 12 PM, incorrect telemetry data took a power flow monitoring tool offline at the Midwest Independent Transmission System Operator. The telemetry problem was corrected, but the monitor wasn’t restarted. Meanwhile, at FirstEnergy, a race condition in the company’s software stalled alarms from sounding and prevented the power monitoring team from observing any data that would’ve showed them that the power distribution grid had begun to fail.
The main problem kicked off when high-voltage transmission lines began to sag under heavy load and made contact with trees. As power lines heat up, the metal inside of them expands, causing them to sag. In this case, the lines hit tree branches that hadn’t been properly trimmed. This causes a transient flashover event as current carried by the line is transferred to other objects. The grid is designed to deal with this kind of problem, but at the time, it dealt with it by disconnecting the lines in question — which increased the load on other high-voltage lines in the same area. And as the current feeding through those lines started to increase, the same problem happened again. And again. And again. Other plant operators began to notice the problem, but they focused on fixing issues in their local network rather than informing other plant operators.
The first line in Northeast Ohio failed at 2:02. The problem built from there, until 15 138kV lines failed nearly at once at 3:41 PM in northern Ohio. A post-mortem report on the blackout describes it thusly:
At approximately 16:10, something remarkable happened: the power began to flow backwards through the system. Power began to flow in a counterclockwise fashion, from Pennsylvania, through New York, into Ontario, and then into Michigan and Ohio. Typically, the 3700 MW of load served by this rever[s]ed power flow was not met by Ontario, but Michigan and Ohio had lost all of their other major tie lines.
Wikipedia’s timeline of the event offers additional color. As failures cascaded across the system, power networks began taking automatic actions designed to preserve local stability. In this case, however, those actions — like cutting lines and diverting power to still-functioning ties — only served to heighten the failure as surges hit some areas and under-voltage conditions struck others.
As lines failed, some areas became isolated. This sudden disruption between power demand and supply caused power plants to automatically drop offline to prevent damage. Once 4:10 hit, the failures accelerated. I know that the below is a giant wall of text, but give it a skim to see how the collapse built.
The Atlantic has put together a photo essay of the event and how it impacted New Yorkers, including those stuck in the city’s transit systems. The blackout was a telling example of how vulnerable modern infrastructure can be to disruption and a reminder that our sophisticated failure prevention systems are only as effective as the scenarios they contemplate. Long term, the blackout led to the declaration of the North American Electric Reliability Council (NERC) as an Energy Reliability Organization (ERO), whose job was to develop and enforce mandatory reliability standards with regard to the national power grid. Previously, compliance with NERC policies aimed at encouraging grid reliability and stability was strongly encouraged, but voluntary.
Feature image by Wikimedia
Today’s Surprisingly Excellent Net Neutrality Explainer Is Brought to You by Burger King
Burger King — yes, Burger King — made a pretty darn good video to explain what net neutrality is, why it's important, and why the FCC killing it last month represents a loss for Americans.
Open Source Turns 20, Powers Computing as We Know It Today
Open source software turned 20 yesterday. Few software concepts can claim to have influenced the world so much.
Healthcare at SXSW 2018: Tomorrow’s Promise, Today’s Problems
Health expenditures in the U.S. are approximately 18 percent of GDP, up from around 5 percent in the 1960s. The reasons include a population that’s aging and living longer. But it also reflects a system with ever-rising costs that have gone mostly unchecked, and a system built around the providers of services, not around customer satisfaction.