Introduction

This tutorial covers essential mapping functions for data visualization using ggplot2. I cover how to load in and work with shapefiles downloaded from the internet (e.g., from the Census) and how to create maps using these files. I then show how to merge in data to shapefiles to create maps for applied work in political science and economics (and other disciplines), using replication data from published research.

Finally, I show how to work with R packages that have built in map data for create chloropleths at the state and congressional district level.

More tutorials and code, as well as my published work, are available at my website joshuamccrain.com.


Working with maps in R and ggplot

First let’s load in required packages – make sure to install these first if you don’t already have them.

library(sf)
library(ggplot2)
library(tidyverse)
library(rgdal)
library(maps)

Now, we’re going to load in a non-standard shape file: Designated Market Areas (DMAs). DMAs are a geographic construction that were created by Nielsen that designates media markets. These are widely recognized by both advertisers, broadcasters, and the FCC. They’re also used in a variety of political science and economics research, including in one of my papers: “Local News and National Politics” (with Gregory Martin, APSR, 2019).

Download the dma_2008.zip file from this dataverse repository: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IVXEHT and extract its contents to the same folder as your R file (make sure to have the working directory set to that folder as well!).

There are many different packages available to read in shapefiles. Here I’m using the rgdal package because it works well when converting to a dataframe, which we need to do to later merge in additional data.

dma <- readOGR("DMAs.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "J:\Dropbox\Dropbox\tutorials\maps\DMAs.shp", layer: "DMAs"
## with 206 features
## It has 19 fields
dma.df <- fortify(dma, region = "DMA")
dma.df <- rename(dma.df, DMA = id)

We’ve read the shapefile in and used fortify() to convert it from a complex spatial object into a simple dataframe, much easier to work with.

We can go ahead and plot this using ggplot and the geom_polygon() function. This function requires that the aesthetics for x and y had already been set, which in most shapefiles will be called long and lat. NB: the group=group argument here is important – try running it without it to see why.

We also add the function coord_map() to this code to get a better looking projection – this is a deep rabbit hole to go down if you want to know why.

ggplot(dma.df, aes(x=long, y=lat, group=group)) + 
  geom_polygon() +
  coord_map()

We already have a pretty standard map, now all I’m going to do are some simple aesthetic adjustments to see the outlines of the DMAs, remove the map’s background color, grid lines, and x and y axes.

ggplot(dma.df, aes(x=long, y=lat, group=group)) +
  geom_polygon(color="black", size=.5, fill="white") +
  coord_map() +
  theme_void()

Great! This is something we can work with. To make this practical and applicable to real world reserach, let’s make some chloropleths. I’m going to work with replication data from my paper, linked above, on the effect of Sinclair expansion. Sinclair is a conglomerate owner of local television news stations that famously had its local anchors read the same scripts night after night. In this paper, we coded which stations they owned across the country.

Merging and plotting real data to make chloropleths

Download the replication file (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/G3X4EW) and extract tveyes_stations.RData. We’re now going to load that into our workspace and take a look at its variables:

load("tveyes_stations.RData")

names(stations)
##  [1] "callsign"             "affiliation"          "sinclair2017"        
##  [4] "station_id"           "fcc_id"               "street"              
##  [7] "city"                 "state"                "zip"                 
## [10] "countyfips"           "county"               "dma"                 
## [13] "dmacode"              "age0_9"               "age10_19"            
## [16] "age20_29"             "age30_39"             "age40_49"            
## [19] "age50_59"             "age60_69"             "age70_79"            
## [22] "age80"                "edu_no_hs"            "edu_hs_grad"         
## [25] "edu_some_college"     "edu_college_grad"     "edu_grad_deg"        
## [28] "inc_10k"              "inc_10k_20k"          "inc_40k_50k"         
## [31] "inc_50k_60k"          "inc_60k_75k"          "inc_75k_100k"        
## [34] "inc_100k_125k"        "inc_125k_150k"        "inc_150k_200k"       
## [37] "inc_200k"             "inc_20k_30k"          "inc_30k_40k"         
## [40] "race_white"           "race_black"           "race_asian"          
## [43] "race_other"           "total_race"           "total_income"        
## [46] "total_age"            "total_edu"            "age0_9_pct"          
## [49] "age10_19_pct"         "age20_29_pct"         "age30_39_pct"        
## [52] "age40_49_pct"         "age50_59_pct"         "age60_69_pct"        
## [55] "age70_79_pct"         "age80_pct"            "edu_no_hs_pct"       
## [58] "edu_hs_grad_pct"      "edu_some_college_pct" "edu_college_grad_pct"
## [61] "edu_grad_deg_pct"     "inc_10k_pct"          "inc_10k_20k_pct"     
## [64] "inc_40k_50k_pct"      "inc_50k_60k_pct"      "inc_60k_75k_pct"     
## [67] "inc_75k_100k_pct"     "inc_100k_125k_pct"    "inc_125k_150k_pct"   
## [70] "inc_150k_200k_pct"    "inc_200k_pct"         "inc_20k_30k_pct"     
## [73] "inc_30k_40k_pct"      "race_white_pct"       "race_black_pct"      
## [76] "race_asian_pct"       "race_other_pct"       "total_pop"           
## [79] "dem_vote"             "rep_vote"             "third_vote"          
## [82] "other_vote"           "total_vote"           "dem_vote_pct"        
## [85] "rep_vote_pct"         "third_vote_pct"       "other_vote_pct"      
## [88] "college"              "age60plus"            "inc100kplus"         
## [91] "total_pop_MM"         "sinclair"

Great – this dataset has one observation per television station, but our data is at the DMA level. So let’s condense it down to one observation per DMA and note how many stations Sinclair owns in each DMA, how many were acquired by Sinclair in 2017, and how the DMA voted for Trump in 2016. This is pretty straightforward dplyr usage, so if you’re not familiar take a look at one of the many tutorials available on this.

dma.stations <- stations %>% 
  ungroup %>% 
  group_by(dmacode) %>% 
  summarize(sinclair_total = sum(sinclair),
            sinclair2017 = sum(sinclair2017),
            rep_vote_pct = max(rep_vote_pct)) %>% 
  mutate(sinclair_total = sinclair_total + sinclair2017)

head(dma.stations)
## # A tibble: 6 x 4
##   dmacode sinclair_total sinclair2017 rep_vote_pct
##   <chr>            <dbl>        <dbl>        <dbl>
## 1 500                  2            0        0.428
## 2 501                  0            0        0.348
## 3 502                  0            0        0.526
## 4 503                  1            0        0.542
## 5 504                  0            0        0.368
## 6 505                  0            0        0.419

That gives us a dataframe with 5 variables that we can then merge into our dataframe used to create the DMA map from above. Let’s do that by merging on the three digit DMA code:

dma.df <- dma.df %>% 
  left_join(dma.stations, by = c("DMA" = "dmacode"))

Now let’s do the same plot as above, but fill in each DMA based on how many stations are owned by Sinclair: 0, 1 or 2. The main difference in this code is that I’ve added fill=sinclair_total to the ggplot function, telling ggplot to use the sinclair_total variable to fill in each polygon (DMA).

ggplot(dma.df, aes(x=long, y=lat, group=group, fill=sinclair_total)) +
  geom_polygon(color="black", size=.5) +
  scale_fill_gradient(low = "white", high="red") + # this allows you to control the gradient
  coord_map() +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(legend.position="none") +
  ggtitle("Sinclair Ownership as of 2017")

This plot is nearly identical to plots in our published paper. We can also fill in each DMA by other variables, in this case the relative Republican voteshare:

ggplot(dma.df %>% mutate(rep_vote_pct = rep_vote_pct^2), # squaring the voteshare to highlight differences
       aes(x=long, y=lat, group=group, fill=rep_vote_pct)) +  # changing the `fill` variable
  geom_polygon(color="black", size=.5) +
  scale_fill_gradient(low = "white", high="purple")+
  coord_map() +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(legend.position="none") +
  ggtitle("DMAs by Republican Voteshare")

Let’s make a separate map that colors in DMAs based on how many stations are owned by Sinclair and whether Sinclair acquired a station in 2017. To do this, I’m creating a new dataframe that simply only keeps DMAs where Sinclair had such an acquisition. I then add this in as another layer in the ggplot call:

new.sinclair <- dma.df %>% 
  filter(sinclair2017 > 0)

ggplot(dma.df, aes(x=long, y=lat, group=group, fill=sinclair_total)) +
  geom_polygon(color="black", size=.5) +
  scale_fill_gradient(low = "white", high="red")+
  geom_polygon(data = new.sinclair, 
               aes(x=long, y=lat, group=group), 
               fill = "dodgerblue", color="black", size=.5) + # new layer for 2017 acquisitions
  coord_map() +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(legend.position="none") +
  ggtitle("Newly Acquired Sinclair Stations")

Off the shelf map data: States and Congressional districts

If you’re lucky, you can work with geographic data that is pre-formatted and available in packages. ggplot includes quite a few map objects built in, one of which is “state”. Take a look at the documentation using ?map_data to see other options. Let’s load that into a dataframe object and plot it as a map:

states_map <- map_data("state")

ggplot(states_map, aes(x=long, y=lat, group=group)) +
  geom_polygon(fill="white", color="black", size=.5) +
  coord_map()

Very easy and flexible. Now let’s overlay these state boundaries with the DMA data and Sinclair ownership information from above. This is intuitive once you understand that ggplot works in layers, and all we’re doing is adding additional layers:

ggplot(dma.df, aes(x=long, y=lat, group=group, fill=sinclair_total)) + 
  geom_polygon(color="#666666", size=.5) +
  scale_fill_gradient(low = "white", high="red")+
  geom_polygon(data = states_map, aes(x = long, y = lat, group = group), fill=NA, color="black") + # states' borders
  geom_polygon(data = new.sinclair, 
               aes(x=long, y=lat, group=group), 
               fill = "dodgerblue", color="#666666", size=.5) + # newly acquired stations
  coord_map() +
  theme_void() +
  theme(legend.position="none") 

Congressional districts

Finally, let’s take a quick look at working with congressional districts using the USAboundaries package. Unfortunately this only has post-2010 redistricting boundaries available, and pre-2010 districts are surprisingly difficult to come by. However, if you find the shape file you can work with it in the same way as the DMA data from above.

Load in the library, create a dataframe using the key function us_congressional(), and the plot the basic map:

library(USAboundaries)

#?us_congressional to see other options

cong.map <- us_congressional(resolution = "high")

ggplot(cong.map) +
  geom_sf()

That doesn’t look great. The issue is this package includes data from all congressional districts, including Alaska, Hawaii, and territories. Let’s filter the dataframe down to only the US mainland (sorry!):

cong.map <- filter(cong.map, state_name %in% state.name & !(state_name %in% c("Hawaii", "Alaska")))

ggplot(cong.map) +
  geom_sf(color="black", fill="white") +
  theme_void()

There we go. Finally, just to show the potential flexibility of these kinds of maps, you can filter to certain states (or districts) easily. You can also color each district or state based on whatever criteria you choose (or data you’ve merged in):

ggplot(cong.map %>% filter(state_name %in% c("North Carolina", "South Carolina", "Virginia")),
       aes(fill = state_name)) +
  geom_sf(color="black") +
  theme_void() +
  theme(legend.position = "none")

head(cong.map)
## Simple feature collection with 6 features and 11 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -124.4096 ymin: 33.87981 xmax: -114.0428 ymax: 42.00105
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##   statefp cd115fp      affgeoid geoid lsad cdsessn        aland     awater
## 1      06      40 5001500US0640  0640   C2     115    149408224    1743217
## 2      06      02 5001500US0602  0602   C2     115  33546252741 4122497535
## 3      06      31 5001500US0631  0631   C2     115    565262021    6412076
## 4      06      10 5001500US0610  0610   C2     115   4714659697   54177484
## 5      06      09 5001500US0609  0609   C2     115   3245118661  134846563
## 6      32      04 5001500US3204  3204   C2     115 132082704590  628524273
##   state_name state_abbr jurisdiction_type                       geometry
## 1 California         CA             state MULTIPOLYGON (((-118.281 34...
## 2 California         CA             state MULTIPOLYGON (((-122.4463 3...
## 3 California         CA             state MULTIPOLYGON (((-117.7044 3...
## 4 California         CA             state MULTIPOLYGON (((-121.557 37...
## 5 California         CA             state MULTIPOLYGON (((-121.8513 3...
## 6     Nevada         NV             state MULTIPOLYGON (((-119.4398 3...

This becomes clear when you examine the dataframe created: the statefp and cd115fp are standardized fips codes. This makes it very straightforward to add in any additional data, say from DW-NOMINATE or the Census.

Conclusion

This tutorial is just scratching the surface of GIS in R and ggplot. Hopefully you see how easy and flexible this is, and how many ways there are to add in your own data for your own visualization goals. There are lots of shapefiles available out there and R works great with most of them. Enjoy!