Introduction

This tutorial provides a brief introduction to the purrr package, focusing on what I find to be the most useful functions and how they combine with dplyr to make your life easier. The purrr package is incredibly versatile and can get very complex depending on your application. Here, my goal is to build intuition around particularly the map family of functions by showing real-world applications, including modeling and visualization.

If you’re familiar with the logic behind base R’s apply family of packages, this intuition should be familiar. purrr is also meant to replace the now deprecated dplyr::do.


Introduction to purrr

First, let’s prepare our workspace. You’ll need to install tidyverse if you haven’t already.

library(tidyverse)

map

Now let’s look at how some of the key purrr functions work with toy examples. The most commonly used function is map:

x <- seq(5, 10)
map(x, ~ .* 5)
## [[1]]
## [1] 25
## 
## [[2]]
## [1] 30
## 
## [[3]]
## [1] 35
## 
## [[4]]
## [1] 40
## 
## [[5]]
## [1] 45
## 
## [[6]]
## [1] 50

The map function needs two arguments: a vector or list and a function or formula.

In this basic case I’m passing map a vector, x, which is simply a sequence of numbers. I am then performing a basic operation to each number in that sequence, multiplying it by 5. The map function returns a list so that’s what we get. We can also use variations of map_ which allow returning other types of data outputs. I’ll cover more of those later, but this is a basic example using map_dbl:

map_dbl(x, ~ .*5)
## [1] 25 30 35 40 45 50

Other common variants include map_lgl, map_int, and map_chr.

Note you can also pipe in your data:

x %>% map_dbl(~ .*5)
## [1] 25 30 35 40 45 50

You can also construct your own functions and pass those through map:

custom_formula <- function(value){
  round(value^2 * rnorm(1, 5, 15), 1)
}

output <- map_dbl(x, ~custom_formula(.))

output
## [1]  115.9 1443.9  127.2 1114.0 -457.3 1235.8

An additional twist on map is map2. This is useful when you have two vectors or list that you want to combine in one single formula. An important note here is that both must be the same length. If not, things get complicated:

x <- seq(2000, 2010)
y <- seq(10, 20)

map2_dbl(x, y, ~ .x + .y)
##  [1] 2010 2012 2014 2016 2018 2020 2022 2024 2026 2028 2030

Note that map functions work just as well with other types of data, such as strings or characters:

movies <- c("A New Hope",
            "The Empire Strikes Back",
            "Return of the Jedi",
            "Phantom Menace",
            "Attack of the Clones",
            "Revenge of the Sith",
            "The Force Awakens",
            "The Last Jedi",
            "Rise of Skywalker")

years <- c(1977, 1980, 1983, 1999, 2002, 2005, 2015, 2017, 2019)

map2_chr(movies, years, ~paste(.x, .y, sep=": "))
## [1] "A New Hope: 1977"              "The Empire Strikes Back: 1980"
## [3] "Return of the Jedi: 1983"      "Phantom Menace: 1999"         
## [5] "Attack of the Clones: 2002"    "Revenge of the Sith: 2005"    
## [7] "The Force Awakens: 2015"       "The Last Jedi: 2017"          
## [9] "Rise of Skywalker: 2019"

Now let’s say you wanted to perform an operation to each of these strings:

map2_chr(movies, years, ~paste(.x, .y, sep=": ")) %>% 
  map_chr(~str_to_upper(.))
## [1] "A NEW HOPE: 1977"              "THE EMPIRE STRIKES BACK: 1980"
## [3] "RETURN OF THE JEDI: 1983"      "PHANTOM MENACE: 1999"         
## [5] "ATTACK OF THE CLONES: 2002"    "REVENGE OF THE SITH: 2005"    
## [7] "THE FORCE AWAKENS: 2015"       "THE LAST JEDI: 2017"          
## [9] "RISE OF SKYWALKER: 2019"

pluck

If you’ve ever worked with lists in R you know the syntax is somewhat counter intuitive. Instead of having to do things like list[[1]][2] to recover specific elements from lists we can use the very useful pluck function:

Here’s the basic functionality:

example <- list(movies, years,
                preference = c(2, 1, 3, 7, 8, 9, 4, 6, 5))

example
## [[1]]
## [1] "A New Hope"              "The Empire Strikes Back"
## [3] "Return of the Jedi"      "Phantom Menace"         
## [5] "Attack of the Clones"    "Revenge of the Sith"    
## [7] "The Force Awakens"       "The Last Jedi"          
## [9] "Rise of Skywalker"      
## 
## [[2]]
## [1] 1977 1980 1983 1999 2002 2005 2015 2017 2019
## 
## $preference
## [1] 2 1 3 7 8 9 4 6 5
example[[1]][5]
## [1] "Attack of the Clones"
example[[2]][5]
## [1] 2002
example[[3]][5]
## [1] 8


With pluck:

pluck(example, 1)
## [1] "A New Hope"              "The Empire Strikes Back"
## [3] "Return of the Jedi"      "Phantom Menace"         
## [5] "Attack of the Clones"    "Revenge of the Sith"    
## [7] "The Force Awakens"       "The Last Jedi"          
## [9] "Rise of Skywalker"
pluck(example, 1, 5)
## [1] "Attack of the Clones"
pluck(example, 2, 5)
## [1] 2002
pluck(example, 3, 5)
## [1] 8

Why might you want to use this other than easier syntax? Maybe something like this, where we want to extract the last word of each movie title:

example %>% 
  pluck(1) %>% 
  map_chr(~ word(., -1))
## [1] "Hope"      "Back"      "Jedi"      "Menace"    "Clones"    "Sith"     
## [7] "Awakens"   "Jedi"      "Skywalker"

What’s going on here?

  • example %>% pluck(1): pipe in our list and grab the first list element, the movie title;
  • map_chr(~ word(., 1)): take each movie title and extract the last word from it using the word function.

Here’s how we could still do it with pipes but without the purrr functionality:

example %>% 
  .[[1]] %>% 
  word(-1)
## [1] "Hope"      "Back"      "Jedi"      "Menace"    "Clones"    "Sith"     
## [7] "Awakens"   "Jedi"      "Skywalker"

Not too bad, but you can see how this could get complex and hard to read quickly with things like .[[1]].

Ok, now we’ll go onto something more complex using some real data.




Applications of purrr with replication data

I’m going to use a really nice replication file available for download at Harvard’s Dataverse here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CI2EPI. I spend quite a bit of time talking about and using this dataset in my comprehensive R for Congressional Data guide. Let’s download this file and load it into your workspace:

df <- read.csv("allCongressDataPublishV2.csv")

This is a great dataset that has the congressional district-Congress as its unit of observation (so one row is one congressional district per term of Congress).

Let’s jump right in. One common task, especially for description, is running the same model specification over various subsets of the data. This is quite easy with purrr and replaces the do function from dplyr. You can also do this with the apply family of functions, but if you ever scale this up and want to run it on especially large data you’ll notice performance issues. purrr solves this problem.


Multiple models:

Let’s say we want to run a certain model for each state in the data. A model of interest might be how bill introduction is correlated with distrinct income. And maybe we also want to control for some obvious factors, such as ideological extremity and the member of Congress’ age (older members may be near retirement and not as interested in introducing new legislation).

The baseline model might look like this (after some basic data cleaning):

df <- filter(df, !is.na(medianIncome) & 
               !is.na(state) &
               !is.na(dwnom1) &
               !is.na(age)) %>% 
  mutate(log_income = log(medianIncome),
         abs_dwnom = abs(dwnom1))

summary(lm(numSpon ~ log_income + abs_dwnom + age, data=df))
## 
## Call:
## lm(formula = numSpon ~ log_income + abs_dwnom + age, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.179  -7.049  -2.517   3.824 107.431 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.47993    2.64380  -0.938  0.34827    
## log_income   0.69062    0.26401   2.616  0.00892 ** 
## abs_dwnom    0.64276    0.68857   0.933  0.35061    
## age          0.15535    0.01268  12.255  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.84 on 7035 degrees of freedom
## Multiple R-squared:  0.02465,    Adjusted R-squared:  0.02424 
## F-statistic: 59.27 on 3 and 7035 DF,  p-value: < 2.2e-16

Given this naive model, there’s a positive correlation between district median income and bill introduction that is statistically significant. Now how do we run this same model over all 50 states? This is pretty easy using map_df and the tidy function from the broom package:

library(broom)

model_function <- function(state.abb){
  
  model <- lm(numSpon ~ log_income + abs_dwnom + age,
              data=df %>% filter(state == state.abb))
  
  output <- tidy(model) %>% 
    filter(term %in% c("log_income", "abs_dwnom", "age")) %>% 
    select(term, estimate, std.error) %>% 
    mutate(state = state.abb)
  
  return(output)
}

There are slightly shorter ways of doing this with longer piping procedures, but I prefer this way for readability and debugging.

What’s going on here?

First, we create the model specification and assign it to the object model. The key difference between this and the above is that we filter the data down to the specific state of interest. Next, we run the model and tidy the model output, which returns a dataframe. Then we simply clean up the dataframe, keeping the coefficients and statistics of interest. You can see what this looks like for one state:

model_function("NC")
## # A tibble: 3 x 4
##   term       estimate std.error state
##   <chr>         <dbl>     <dbl> <chr>
## 1 log_income    1.62     1.50   NC   
## 2 abs_dwnom     2.12     3.48   NC   
## 3 age           0.328    0.0719 NC

Now let’s turn this into a dataframe and plot the results!

result_df <- map_df(unique(df$state), ~model_function(.))

As you can see this returns a dataframe that essentially stacks each individual state’s results on top of eachother. This is because we used map_df instead of regular map, which would have returned a dataframe of lists. That is also fine, and you now know how to work with those, but this format makes it easier to visualize our results!

Let’s visualize this as a coefficient plot for log_income. I like using the dotwhisker package (https://cran.r-project.org/web/packages/dotwhisker/vignettes/dotwhisker-vignette.html which makes things much easier. I cover this in greater detail here.

library(dotwhisker)

result_df %>% 
  filter(term == "log_income") %>% 
  select(-term) %>% 
  rename(term = state) %>%
  arrange(estimate) %>% 
  dwplot() +
    theme_minimal() +
    geom_vline(xintercept = 0, alpha=.5, linetype=2) +
    theme(legend.position = 'none') +
  xlab("Estimate") + ylab("State")

Not bad. Let’s try one more complicated example, creating maps with estimated results that vary over time. To do this let’s create a new time period variable based on the congressional term:

df$time.group <- NA
df$time.group[df$congNum >= 98 & df$congNum <102] <- "1983-1990"
df$time.group[df$congNum >= 102 & df$congNum < 107] <- "1991-2000"
df$time.group[df$congNum >= 107 & df$congNum <= 113] <- "2001-2012"

Now let’s try out the nest function, which essentially collapses your data based on a group_by argument into a dataframe of dataframes:

group.df <- df %>% 
  group_by(time.group, state) %>% 
  nest

What we can do now is run the same model function over each state and each time period using map, tidy, and unnest.

group.results <- group.df %>% 
  mutate(output = map(data, ~lm(numSpon ~ log_income + abs_dwnom + age, data=.) %>% 
                        tidy %>% 
                        filter(term == "log_income") %>% 
                        select(estimate))) %>%
  unnest(output) %>%
  select(state, time.group, estimate)

head(group.results)
## # A tibble: 6 x 3
## # Groups:   state, time.group [56]
##   state time.group estimate
##   <fct> <chr>         <dbl>
## 1 AL    1983-1990      6.86
## 2 AR    1983-1990    -28.7 
## 3 AZ    1983-1990    -81.2 
## 4 CA    1983-1990    -11.2 
## 5 CO    1983-1990    -25.4 
## 6 CT    1983-1990    -60.8

The logic of what’s going on here is as follows:

  1. Take the grouped dataframe and pipe it into a mutate;
  2. The mutate takes the data column for each time period by state grouping and applies the map function to it.
  3. The map function passes the same model specification as above, tidys it and then grabs just the estimate for log_income.
  4. We then unnest it, which turns the dataframe of dataframes back into something we can work with.
  5. Finally we select the columns of interest.

Making maps of model results

Now let’s plot this on a map, again using the map function. I go into greater detail about making maps here: http://congressdata.joshuamccrain.com/visualization.html#3_creating_maps.

states_map <- map_data("state")

states_map <- states_map %>% 
  left_join(data.frame(region = str_to_lower(state.name), state=state.abb)) 
## Joining, by = "region"
states_map.long <- left_join(states_map, group.results)
## Joining, by = "state"

Now let’s create our visualization function so we can pass it to map. You’ll see why we do it this way next.

make.map <- function(time){
  ggplot(states_map.long %>% filter(time.group == time),
         aes(x=long, y=lat, group=group, fill=estimate)) +
  geom_polygon() +
  geom_polygon(data = states_map, aes(x = long, y = lat, group=group),
               fill=NA, color="black", size=.5, inherit.aes=F) +
  scale_fill_gradient() +
  coord_map() +
  ggtitle(time) +
  theme_void() +
  theme(legend.position = "none")
}

maps <- map(unique(states_map.long$time.group), ~make.map(.))

The reason creating a list of ggplot objects is useful is we can then call each one separately if we’d like:

maps[[3]]




Additional resources

This has really been a crash course on the intuition of purrr and its most commonly used functions. There is a lot of flexibility here and tons of other applications. Here are some good resources: