Analyzing Power Outages
Created by Rachel Boeke (boeker@umich.edu) for EECS 398 at the University of Michigan.
Introduction
How can we predict a power outage and its severity?
Using power outage data provided by researchers at Purdue University, I wanted to answer the question:
- What are the factors that indicate a severe power outage may occur?
- In other words, what risk factors may an energy company want to look into when predicting the location and severity of its next major power outage?
About the Dataset:
- Number of observations/rows: 1534
- Number of columns: 55
Some of the relevant factors that may affect the severity of an outage include:
- Location:
- U.S._STATE: The U.S. state where the outage occurred
- POSTAL.CODE: The two-letter postal abbreviation for the U.S. state where the outage occurred
- NERC.REGION: The North American Electric Reliability Corporation (NERC) regions involved in the outage event
- Time:
- OUTAGE.START.DATE: This variable indicates the day of the year when the outage event started (as reported by the corresponding Utility in the region)
- OUTAGE.START.TIME: This variable indicates the time of the day when the outage event started (as reported by the corresponding Utility in the region)
- Climate:
- CLIMATE.REGION: U.S. Climate regions as specified by National Centers for Environmental Information (nine climatically consistent regions in continental U.S.A.)
- ANOMALY.LEVEL: This represents the oceanic El Niño/La Niña (ONI) index referring to the cold and warm episodes by season. It is estimated as a 3-month running mean of ERSST.v4 SST anomalies in the Niño 3.4 region (5°N to 5°S, 120–170°W)
- CLIMATE.CATEGORY: This represents the climate episodes corresponding to the years. The categories—“Warm”, “Cold” or “Normal” episodes of the climate are based on a threshold of ± 0.5 °C for the Oceanic Niño Index (ONI)
- Land-use characteristics
- PCT_LAND: Percentage of land area in the U.S. state as compared to the overall land area in the continental U.S. (in %)
- AREAPCT_URBAN: Percentage of the land area of the U.S. state represented by the land area of the urban areas (in %)
- PCT_WATER_TOT: Percentage of water area in the U.S. state as compared to the overall water area in the continental U.S. (in %)
- Population:
- POPULATION: Population in the U.S. state in a year
- TOTAL.CUSTOMERS: Annual number of total customers served in the U.S. state
Relevant columns that characterize the severity of an outage include:
-
OUTAGE.DURATION: Duration of outage events (in minutes)
-
DEMAND.LOSS.MW: Amount of peak demand lost during an outage event (in Megawatt) [but in many cases, total demand is reported]
-
CUSTOMERS.AFFECTED: Number of customers affected by the power outage event
Additional fields and field descriptions are provided here.
Data Cleaning and Exploratory Data Analysis
Data Cleaning
To clean the data, I completed the following steps:
- Combined OUTAGE.START.DATE and OUTAGE.START.TIME into one pd.Timestamp column, OUTAGE.START
- Combined OUTAGE.RESTORATION.DATE and OUTAGE.RESTORATION.TIME into one pd.Timestamp column, OUTAGE.RESTORATION
- Converted numeric fields from object to float data types
- Replaced missing “NA” values with np.nan
- Retained only relevant columns
The first 10 rows of the cleaned data are shown below:
U.S._STATE | POSTAL.CODE | NERC.REGION | MONTH | OUTAGE.START | CLIMATE.REGION | ANOMALY.LEVEL | CLIMATE.CATEGORY | PCT_LAND | AREAPCT_URBAN | PCT_WATER_TOT | RES.PRICE | COM.PRICE | IND.PRICE | TOTAL.PRICE | RES.SALES | COM.SALES | IND.SALES | TOTAL.SALES | OUTAGE.DURATION | DEMAND.LOSS.MW | CUSTOMERS.AFFECTED | POPULATION | TOTAL.CUSTOMERS | YEAR | CAUSE.CATEGORY |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Minnesota | MN | MRO | 7 | 2011-07-01 17:00:00 | East North Central | -0.3 | normal | 91.5927 | 2.14 | 8.40733 | 11.6 | 9.18 | 6.81 | 9.28 | 2.33292e+06 | 2.11477e+06 | 2.11329e+06 | 6.56252e+06 | 3060 | nan | 70000 | 5.34812e+06 | 2.5957e+06 | 2011 | severe weather |
Minnesota | MN | MRO | 5 | 2014-05-11 18:38:00 | East North Central | -0.1 | normal | 91.5927 | 2.14 | 8.40733 | 12.12 | 9.71 | 6.49 | 9.28 | 1.58699e+06 | 1.80776e+06 | 1.88793e+06 | 5.28423e+06 | 1 | nan | nan | 5.45712e+06 | 2.64074e+06 | 2014 | intentional attack |
Minnesota | MN | MRO | 10 | 2010-10-26 20:00:00 | East North Central | -1.5 | cold | 91.5927 | 2.14 | 8.40733 | 10.87 | 8.19 | 6.07 | 8.15 | 1.46729e+06 | 1.80168e+06 | 1.9513e+06 | 5.22212e+06 | 3000 | nan | 70000 | 5.3109e+06 | 2.5869e+06 | 2010 | severe weather |
Minnesota | MN | MRO | 6 | 2012-06-19 04:30:00 | East North Central | -0.1 | normal | 91.5927 | 2.14 | 8.40733 | 11.79 | 9.25 | 6.71 | 9.19 | 1.85152e+06 | 1.94117e+06 | 1.99303e+06 | 5.78706e+06 | 2550 | nan | 68200 | 5.38044e+06 | 2.60681e+06 | 2012 | severe weather |
Minnesota | MN | MRO | 7 | 2015-07-18 02:00:00 | East North Central | 1.2 | warm | 91.5927 | 2.14 | 8.40733 | 13.07 | 10.16 | 7.74 | 10.43 | 2.02888e+06 | 2.16161e+06 | 1.77794e+06 | 5.97034e+06 | 1740 | 250 | 250000 | 5.48959e+06 | 2.67353e+06 | 2015 | severe weather |
Univariate Analysis
I looked at the distribution of relevant variables. To start, I examined the severity metrics DEMAND.LOSS.MW, OUTAGE.DURATION, and CUSTOMERS.AFFECTED to see what might be considered a “severe” outage:
Customers Affected (Thousands) | Outage Duration (Hours) | Demand Loss (MW) | |
---|---|---|---|
count | 1091 | 1476 | 829 |
mean | 143.456 | 43.7566 | 536.287 |
std | 286.986 | 99.0414 | 2196.45 |
min | 0 | 0 | 0 |
25% | 9.65 | 1.70417 | 3 |
50% | 70.135 | 11.6833 | 168 |
75% | 150 | 48 | 400 |
max | 3241.44 | 1810.88 | 41788 |
Based on the distribution of these severity measures, an outage may be considered “severe” if it affects over 150k customers, lasts longer than 48hrs, or causes a greater than 400MW loss in demand.
Next, I looked at outages by state:
Outages seem to most commonly occur in California, Texas, Washington, Michigan, New York, Maryland, Pennsylvania, Illinois, Florida, and Ohio.
Bivariate Analysis
I moved on to bivariate analysis next. First, I examined the distribution of climate anomaly levels in each region during reported outages:
Across all regions it seems like outages tend to occur when weather is colder than normal (i.e. ANOMALY.LEVEL < 0)
Second, I looked at mean climate anomaly levels across states during outages:
Mean anomaly level appears to vary more across states than across regions - when predicting possibility of an outage, it may be better to look at anomaly levels at the state level rather than the regional level.
Interesting Aggregates
I was interested in looking at outage severity by state, so I created the pivot table below showing the mean DEMAND.LOSS.MW, OUTAGE.DURATION, and PERCENT.POP.IMPACTED (CUSTOMERS.AFFECTED / POPULATION * 100) by state:
U.S._STATE | DEMAND.LOSS.MW | OUTAGE.DURATION | PERCENT.POP.IMPACTED |
---|---|---|---|
District of Columbia | 1280 | 2755.33 | 27.019 |
West Virginia | 700 | 9576 | 14.2758 |
Hawaii | 536 | 845.4 | 11.119 |
New Mexico | 346.667 | 158.333 | 9.15146 |
Oklahoma | 197.143 | 3095.86 | 8.33663 |
South Carolina | 1699.71 | 3237.86 | 6.05849 |
North Dakota | 155 | 720 | 5.0341 |
Nebraska | 492.667 | 3221.33 | 4.91067 |
Iowa | 337.5 | 3055.5 | 3.13314 |
Kansas | 175 | 7296.5 | 2.77462 |
The top 10 states with the most severe outages on average are listed above.
Though my main focus was not on causes of outages, I was still curious about this part of the data and
decided to examine the causes of outages over the years:
Imputation
I decided not to fill in any missing values mainly because there wasn’t much missing data, so missing values did not impact analysis much. The two columns with the most missing data were measures of outage severity, DEMAND.LOSS.MW and CUSTOMERS.AFFECTED.
A summary of missing values across all columns of the dataset is below:
Column | Percent Data Missing |
---|---|
U.S._STATE | 0.0 % |
POSTAL.CODE | 0.0 % |
NERC.REGION | 0.0 % |
MONTH | 0.59 % |
OUTAGE.START | 0.59 % |
CLIMATE.REGION | 0.39 % |
ANOMALY.LEVEL | 0.59 % |
CLIMATE.CATEGORY | 0.59 % |
PCT_LAND | 0.0 % |
AREAPCT_URBAN | 0.0 % |
PCT_WATER_TOT | 0.0 % |
RES.PRICE | 1.43 % |
COM.PRICE | 1.43 % |
IND.PRICE | 1.43 % |
TOTAL.PRICE | 1.43 % |
RES.SALES | 1.43 % |
COM.SALES | 1.43 % |
IND.SALES | 1.43 % |
TOTAL.SALES | 1.43 % |
OUTAGE.DURATION | 3.78 % |
DEMAND.LOSS.MW | 45.96 % |
CUSTOMERS.AFFECTED | 28.88 % |
POPULATION | 0.0 % |
TOTAL.CUSTOMERS | 0.0 % |
YEAR | 0.0 % |
CAUSE.CATEGORY | 0.0 % |
Framing a Prediction Problem
I decided to build a model that predicts the severity of an outage in terms of demand loss (MW). I chose this metric because demand loss captures both customers affected and outage duration. This is a regression problem because DEMAND.LOSS.MW is a continuous, numeric variable.
At the time of prediction, some factors available might be:
- U.S._STATE
- POPULATION
- ANOMALY.LEVEL
- PCT_LAND
- AREAPCT_URBAN
- PCT_WATER_TOT
These are all known characteristics of a state or region that can be used to predict outage severity in the future.
I used mean squared error (MSE) to evaluate the model, because MSE is a commonly used metric for regression models.
Baseline Model
Anomaly level, percent urban area, percent water total, percent land, and U.S. state seemed to be possible factors having some relationship with demand loss. I started by using these variables (ANOMALY.LEVEL, AREAPCT_URBAN, PCT_WATER_TOT, PCT_LAND, and U.S._STATE) as features.
ANOMALY.LEVEL, AREAPCT_URBAN, PCT_WATER_TOT, and PCT_LAND are all quantitative features. U.S._STATE is a nominal feature and required One Hot Encoding.
The initial model I used was sklearn’s basic LinearRegression model.
The baseline model’s MSE was 884,973. I also evaluated the model by looking at the difference between actual and predicted DEMAND.LOSS.MW:
Difference Between Actual and Predicted Demand Loss (MW) | |
---|---|
count | 171 |
mean | 8.48 |
std | 13.23 |
min | 0.04 |
25% | 2.2 |
50% | 4.95 |
75% | 9.8 |
max | 129.91 |
It appears most of the predicted data points fall within ~10MW of the actual demand loss, with 75% falling within 9.8MW. In terms of actual power, 10MW is a lot - this is not a great model
Final Model
I used the following plots to determine which features might be valuable to add to the model:
Each of these plots looked like it might fit a normal distribution. PCT_WATER_TOT looked like it might fit 1/x. PCT_LAND looked like a polynomial.
Following this analysis, I added a QuantileTransformer to all quantitative columns, a 1/x FunctionTransformer to PCT_WATER_TOT, and a PolynomialFeatures transformer to PCT_LAND. I kept the OneHotEncoder used for U.S._STATES.
I used GridSearchCV to tune hyperparameters polynomial degree and number of quantiles. This was more efficient than manually searching for the optimal hyperparameters.
Finally, I tested four different sklearn regression models: LinearRegression, Ridge, Lasso, and ElasticNet.
The results were:
Model | MSE | Optimal Degree | Optimal Number of Quantiles |
---|---|---|---|
LinearRegression | 767545 | 4 | 1 |
Ridge | 767545 | 4 | 7 |
Lasso | 946649 | 1 | 4 |
ElasticNet | 758063 | 1 | 8 |
The ElasticNet model performed the best with an MSE of 758,063, using an optimal degree of 1 and optimal number of quantiles 8. All models except Lasso improved over the baseline model, but there is still room for improvement.