This analysis uses data from the EPA National Emissions Inventory, with pollutants, emissions, sources, year. The data was curated and provided as part of an Exploraroty Data Analysis course taught by Roger Peng. The data includs PM2.5 type emissions data, (air pollutants that are less than 2.5 microns) from 1998 to 2011. A second database was included to provide more detailed information for each emission measurement included in the inventory. The initial goals for the analysis were to create simple exploratory graphs..
The data provided was currated and quite clean in an rds format.
root <- "~/Documents/Coursera/jhsph/exploratory/data"
NEI <- readRDS(paste0(root,"/summarySCC_PM25.rds"))
SCC <- readRDS(paste0(root,"/Source_Classification_Code.rds"))
The SCC database contains metadata for the types of observations, and can be joined on the SCC key. I merge the Sector and Data Categories from the SCC database into the NEI database. This is a fairly time consuming operation.
# Merge in the Sector and Data Category variables
NEI<-merge(NEI,SCC[,c("SCC","EI.Sector","Data.Category")],by="SCC")
A summary of the NEI data:
summary(NEI)
## SCC fips Pollutant
## Length:6497651 Length:6497651 Length:6497651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Emissions type year
## Min. : 0.0 Length:6497651 Min. :1999
## 1st Qu.: 0.0 Class :character 1st Qu.:2002
## Median : 0.0 Mode :character Median :2005
## Mean : 3.4 Mean :2004
## 3rd Qu.: 0.1 3rd Qu.:2008
## Max. :646952.0 Max. :2008
##
## EI.Sector
## Mobile - On-Road Diesel Heavy Duty Vehicles :1309512
## Mobile - Non-Road Equipment - Gasoline :1113076
## Mobile - On-Road Gasoline Light Duty Vehicles:1109010
## Mobile - Non-Road Equipment - Diesel : 713697
## Mobile - On-Road Diesel Light Duty Vehicles : 493285
## Mobile - Non-Road Equipment - Other : 492000
## (Other) :1267071
## Data.Category
## Biogenic: 0
## Event : 5743
## Nonpoint: 484450
## Nonroad :2293459
## Onroad :3183602
## Point : 530397
##
This National Emissions Inventory data set is just PM 2.5 Emissions measurements from disparate sources between 1999 and 2008. Note that there are no negative Emissions values, and no NA’s. It’s pretty clean data. The number of observations made each year seems to be increasing. If we segment the histogram on the Data Category, we see the increase in observations for the different types. The Point type observations increased roughly 5-fold in 2008.
library(ggplot2)
qplot(factor(year), fill=Data.Category, data=NEI, ylab="Number of Observations",xlab="Year",main="NEI PM2.5 Observations")
From the summary notice that the maximum emission measurement is ~6 orders of magnitude larger than the mean and median. Apparently the data is very right skewed, probaboy a power law distribution. Is this reasonable? First, lets just count frequency of observations in various segments, for measurements from 0 on up.
sum(NEI$Emissions<=1)
## [1] 5952119
sum(NEI$Emissions>1 & NEI$Emissions<=10)
## [1] 375313
sum(NEI$Emissions>10 & NEI$Emissions<=100)
## [1] 134493
sum(NEI$Emissions>100 & NEI$Emissions<=1000)
## [1] 33055
sum(NEI$Emissions>1000)
## [1] 2671
We can also look at a density estimation for each of these segments. The distribution for each of these segments takes the form of a power law distribution with a long tail.
plot(density(NEI[NEI$Emissions<=1,"Emissions"]), xlab="PM2.5 Emissions Observations (tons)",main="Density Estimation for PM2.5: 0 to 1 tons")
plot(density(NEI[NEI$Emissions<=10 & NEI$Emissions>1,"Emissions"]), xlab="PM2.5 Emissions Observations (tons)",main="Density Estimation for PM2.5: 1 to 10 tons")
plot(density(NEI[NEI$Emissions<=100 & NEI$Emissions>10,"Emissions"]), xlab="PM2.5 Emissions Observations (tons)",main="Density Estimation for PM2.5: 10 to 100 tons")
plot(density(NEI[NEI$Emissions<=1000 & NEI$Emissions>100,"Emissions"]), xlab="PM2.5 Emissions Observations (tons)",main="Density Estimation for PM2.5: 100 to 1000 tons")
plot(density(NEI[NEI$Emissions<=10000 & NEI$Emissions>1000,"Emissions"]), xlab="PM2.5 Emissions Observations (tons)",main="Density Estimation for PM2.5: 1000 to 10000 tons")
plot(density(NEI[NEI$Emissions>10000 ,"Emissions"]), xlab="PM2.5 Emissions Observations (tons)",main="Density Estimation for PM2.5: 10000 tons and higher")
What is giving the largest emissions observations? Let’s try to find out by subsetting, sorting and then querying the SCC database.
NEIbig<-NEI[NEI$Emissions>10000,]
head(NEIbig[order(NEIbig$Emissions, decreasing=T),c("Emissions","SCC")],10)
## Emissions SCC
## 6176734 646951.97 2810001000
## 6174112 112619.81 2810001000
## 6176473 77387.86 2810001000
## 6175741 66696.32 2810001000
## 6197154 58896.10 2810090000
## 6171481 34515.36 2810001000
## 6174905 29487.50 2810001000
## 6174381 27585.05 2810001000
## 6173402 24287.35 2810001000
## 5873474 20799.70 2280003200
Notice that only a few SCC’s are responsible for the largest observations, and they are:
sccs<-unique(NEIbig[order(NEIbig$Emissions, decreasing=T),"SCC"][1:20])
SCC[SCC$SCC %in% sccs, c("SCC","Data.Category","EI.Sector")]
## SCC Data.Category EI.Sector
## 4 10100202 Point Fuel Comb - Electric Generation - Coal
## 2238 2280003200 Nonpoint Mobile - Commercial Marine Vessels
## 4495 2810001000 Event Fires - Wildfires
## 4528 2810090000 Nonpoint Miscellaneous Non-Industrial NEC
The large emitters are coming from specific and sensbible categories. The four largest PM2.5 observations are from wildfire events. Interesting.
The NEI data also has 59 unique sector categories which can also be used to segment the data.
unique(SCC$EI.Sector)
## [1] Fuel Comb - Electric Generation - Coal
## [2] Fuel Comb - Electric Generation - Oil
## [3] Fuel Comb - Electric Generation - Natural Gas
## [4] Fuel Comb - Electric Generation - Other
## [5] Fuel Comb - Electric Generation - Biomass
## [6] Fuel Comb - Industrial Boilers, ICEs - Coal
## [7] Fuel Comb - Industrial Boilers, ICEs - Oil
## [8] Fuel Comb - Industrial Boilers, ICEs - Natural Gas
## [9] Fuel Comb - Industrial Boilers, ICEs - Other
## [10] Fuel Comb - Industrial Boilers, ICEs - Biomass
## [11] Fuel Comb - Comm/Institutional - Coal
## [12] Fuel Comb - Comm/Institutional - Oil
## [13] Fuel Comb - Comm/Institutional - Natural Gas
## [14] Fuel Comb - Comm/Institutional - Other
## [15] Fuel Comb - Comm/Institutional - Biomass
## [16] Industrial Processes - NEC
## [17] Fuel Comb - Residential - Other
## [18] Fuel Comb - Residential - Oil
## [19] Fuel Comb - Residential - Natural Gas
## [20] Fuel Comb - Residential - Wood
## [21] Mobile - On-Road Gasoline Light Duty Vehicles
## [22] Mobile - On-Road Gasoline Heavy Duty Vehicles
## [23] Mobile - On-Road Diesel Light Duty Vehicles
## [24] Mobile - On-Road Diesel Heavy Duty Vehicles
## [25] Mobile - Non-Road Equipment - Gasoline
## [26] Mobile - Non-Road Equipment - Other
## [27] Mobile - Non-Road Equipment - Diesel
## [28] Mobile - Aircraft
## [29] Mobile - Commercial Marine Vessels
## [30] Mobile - Locomotives
## [31] Dust - Paved Road Dust
## [32] Dust - Unpaved Road Dust
## [33] Industrial Processes - Chemical Manuf
## [34] Commercial Cooking
## [35] Industrial Processes - Non-ferrous Metals
## [36] Industrial Processes - Ferrous Metals
## [37] Industrial Processes - Petroleum Refineries
## [38] Industrial Processes - Oil & Gas Production
## [39] Dust - Construction Dust
## [40] Industrial Processes - Mining
## [41] Solvent - Non-Industrial Surface Coating
## [42] Solvent - Industrial Surface Coating & Solvent Use
## [43] Solvent - Degreasing
## [44] Solvent - Dry Cleaning
## [45] Solvent - Graphic Arts
## [46] Solvent - Consumer & Commercial Solvent Use
## [47] Industrial Processes - Storage and Transfer
## [48] Miscellaneous Non-Industrial NEC
## [49] Bulk Gasoline Terminals
## [50] Gas Stations
## [51] Waste Disposal
## [52] Agriculture - Livestock Waste
## [53] Agriculture - Crops & Livestock Dust
## [54] Fires - Agricultural Field Burning
## [55] Agriculture - Fertilizer Application
## [56] Fires - Wildfires
## [57] Fires - Prescribed Fires
## [58] Industrial Processes - Cement Manuf
## [59] Industrial Processes - Pulp & Paper
## 59 Levels: Agriculture - Crops & Livestock Dust ...
For instance, we can drill down to the sector level and investigate the increases in Point Type measurements in 2008. Segmenting on year and sector and plotting a histogram vs. Sector. Unfortunately I’m not certain what NEC is in the Industrial Process Sector. My best guess is that it’s the National Electric Code pertaining to installation of electric wiring.
library(plyr)
NEIp<-NEI[NEI$Data.Category=="Point", ] # Subset by Point
# Here I remove the less used Sectors for clarity when plotting
SF<-count(NEIp,"EI.Sector")
Sectors<-SF[SF$freq>2000,"EI.Sector"]
NEIp<-droplevels(NEIp[NEIp$EI.Sector %in% Sectors,])
q<-qplot(factor(EI.Sector),data=NEIp,fill=factor(year),
ylab="Frequency",main="Point Type Emissions Measurements by Sector")
q+theme(axis.title.x=element_blank(),axis.text.x=element_text(angle=60, hjust=1))
Let’s ask and try to answer some fairly basic questions about this data. First, are there any obvious trends in the total US emissions per year? We could break this down by category or sector. There does appear to be an overal downard trend of the total over the four aggregated time slices. (Note: I’m just going to sum the data. I will not try to figure out if the sum is a representative statistic.) There also are differences from Event type emissions
Now lets look for any trends in the aggregated Emissions data.
NEItotal<-aggregate(Emissions ~ year + Data.Category, NEI, sum)
qplot(factor(year), Emissions, fill=Data.Category, data=NEItotal, geom="bar",
stat="identity", xlab="PM2.5 Emissions (in tons)", ylab="Year", main="Total PM2.5 Emissions")
It turns out that all of the event type emissions are coming for wildfires. And, oddly enough, there are no wildfire events in 2008. Clearly, even this “clean data” has some serious issues. Some googling will show that the EPA Emissions Inventory did cover wildfires in 2008.
NEIe<-droplevels(NEI[NEI$Data.Category=="Event",])
unique(NEIe$year)
## [1] 2005 1999 2002
str(NEIe)
## 'data.frame': 5743 obs. of 8 variables:
## $ SCC : chr "2810001000" "2810001000" "2810001000" "2810001000" ...
## $ fips : chr "32001" "48481" "31081" "36019" ...
## $ Pollutant : chr "PM25-PRI" "PM25-PRI" "PM25-PRI" "PM25-PRI" ...
## $ Emissions : num 32.73 2.08 1.37 74.38 225.41 ...
## $ type : chr "NONPOINT" "NONPOINT" "NONPOINT" "NONPOINT" ...
## $ year : int 2005 1999 1999 1999 1999 1999 2002 1999 1999 2002 ...
## $ EI.Sector : Factor w/ 1 level "Fires - Wildfires": 1 1 1 1 1 1 1 1 1 1 ...
## $ Data.Category: Factor w/ 1 level "Event": 1 1 1 1 1 1 1 1 1 1 ...
Next we drill into emissions trends for areas which can be selected on fips. E.g for Baltimore with fips 24510, and Seattle with fips 53033. There do appear to be qualitative differences in the PM2.5 emissions trends from these two areas.
NEIcities<-aggregate(Emissions ~ year + Data.Category + fips,
NEI[NEI$fips=='24510' | NEI$fips=='53033',], sum)
qplot(factor(year), Emissions, data=NEIcities, facets=.~fips, fill=Data.Category, geom="bar", stat="Identity", ylab="Emissions (in tons)", xlab="Year",
main="Baltimore (24510) and Seattle (53033) PM2.5 Emissions")
What’s going on with the Nonpoint emissions sources in Seattle?
Drilling down into the Nonpoint sources for Seattle and segmenting by Sector, it appears that wood burning stoves are a primary source for increases in the 2008 emissions.
NEIsea<-aggregate(Emissions ~ year + EI.Sector,
NEI[NEI$fips=='53033' & NEI$Data.Category=='Nonpoint',], sum)
q<-qplot(factor(EI.Sector),Emissions, data=NEIsea[NEIsea$Emissions>50,],
geom="bar",stat="Identity", facets=year~.,
fill=EI.Sector, ylab="Emissions (in tons)",
main="Seattle Nonpoint PM2.5 Emissions")
q+theme(axis.title.x=element_blank(),axis.text.x=element_text(angle=60, hjust=1))
What’s going on with Point sources for Baltimore? Drilling down into the Point sources for Baltimore and segmenting on Sector, it seems the NEC Industrial Process generally dominates the Point Emissions. That and Natural Gas Industrial Boilers apparentally are the cause for the Point Emissions increases in 2005.
NEIbm<-aggregate(Emissions ~ year + EI.Sector,
NEI[NEI$fips=='24510' & NEI$Data.Category=='Point',], sum)
q<-qplot(factor(EI.Sector),Emissions, data=NEIbm[NEIbm$Emissions>2,],
geom="bar",stat="Identity", facets=year~.,
fill=EI.Sector, ylab="Emissions (in tons)",
main="Baltimore Point PM2.5 Emissions")
q+theme(axis.title.x=element_blank(),axis.text.x=element_text(angle=60, hjust=1))
What about emissions for “Coal Combustion related sources”? We can find sources from the SCC database that match this query, and use those to subset the data, and see that there is likely a downward trend for coal related sources.
# I found this to be a reasonable query of the SCC database
coalSCCs <- SCC[grepl("Coal",SCC$Short.Name),]
coalSCCs <- coalSCCs[grepl("Combustion",coalSCCs$SCC.Level.One),"SCC"]
# Subset the NEI data then aggregate on year
NEIcoal <- NEI[NEI$SCC %in% coalSCCs,]
NEIcoaltotal <- aggregate(Emissions ~ year + Data.Category, data=NEIcoal, sum)
qplot(factor(year), Emissions, fill=Data.Category, data=NEIcoaltotal, geom="bar",
stat="identity", ylab="Emissions (in tons)", xlab="Year",
main="Coal Combustion PM2.5 Emissions")
What about trends in motor vehicle sources? Selecting on road motor vehicle sources can be done in a single simple query to the SCC Database. Here I compare a few different FIPS. Los Angelas is 06037, Baltimore is 24510, Seattle is 53033 and DC is 11001.
# Want to select on road motor vehicle emissions sources
mvs <- SCC[grepl("Mobile - On-Road",SCC$EI.Sector),"SCC"]
# Subset by fips, then SCC's
NEImv <- NEI[NEI$fips=='24510' | NEI$fips=='11001' |
NEI$fips=='53033' | NEI$fips=="06037",]
NEImv <- NEImv[NEImv$SCC %in% mvs,]
# aggregate total Emissions by year and fips
NEImvtotal <- aggregate(Emissions ~ year + fips, NEImv, sum)
qplot(factor(year),Emissions, data=NEImvtotal, fill=fips, geom="bar",
stat="identity",position="dodge",ylab="Emissions PM2.5 (tons)",
xlab="Year", main="Motor Vehicle PM2.5 Emissions")
The National Emissions Inventory data is interesting. It’s amazing how much data we collect in our world today, and this is only expected to grow. I hope that data like this will be used ultimately to make better informed policy decisions which will make our world a better place. As an assignment for this Data Analysis course, it was an appropriately challenging and a really fun data set to dig into. I ultimately found the barplot’s from ggplot2 to offer the best and easiest visualization for these simple questions.