This tutorial provides a brief introduction to the purrr
package, focusing on what I find to be the most useful functions and how they combine with dplyr
to make your life easier. The purrr
package is incredibly versatile and can get very complex depending on your application. Here, my goal is to build intuition around particularly the map
family of functions by showing real-world applications, including modeling and visualization.
If you’re familiar with the logic behind base R’s apply
family of packages, this intuition should be familiar. purrr
is also meant to replace the now deprecated dplyr::do
.
purrr
First, let’s prepare our workspace. You’ll need to install tidyverse
if you haven’t already.
map
Now let’s look at how some of the key purrr
functions work with toy examples. The most commonly used function is map
:
## [[1]]
## [1] 25
##
## [[2]]
## [1] 30
##
## [[3]]
## [1] 35
##
## [[4]]
## [1] 40
##
## [[5]]
## [1] 45
##
## [[6]]
## [1] 50
The map
function needs two arguments: a vector or list and a function or formula.
In this basic case I’m passing map
a vector, x
, which is simply a sequence of numbers. I am then performing a basic operation to each number in that sequence, multiplying it by 5. The map
function returns a list so that’s what we get. We can also use variations of map_
which allow returning other types of data outputs. I’ll cover more of those later, but this is a basic example using map_dbl
:
## [1] 25 30 35 40 45 50
Other common variants include map_lgl
, map_int
, and map_chr
.
Note you can also pipe in your data:
## [1] 25 30 35 40 45 50
You can also construct your own functions and pass those through map:
custom_formula <- function(value){
round(value^2 * rnorm(1, 5, 15), 1)
}
output <- map_dbl(x, ~custom_formula(.))
output
## [1] 115.9 1443.9 127.2 1114.0 -457.3 1235.8
An additional twist on map
is map2
. This is useful when you have two vectors or list that you want to combine in one single formula. An important note here is that both must be the same length. If not, things get complicated:
## [1] 2010 2012 2014 2016 2018 2020 2022 2024 2026 2028 2030
Note that map
functions work just as well with other types of data, such as strings or characters:
movies <- c("A New Hope",
"The Empire Strikes Back",
"Return of the Jedi",
"Phantom Menace",
"Attack of the Clones",
"Revenge of the Sith",
"The Force Awakens",
"The Last Jedi",
"Rise of Skywalker")
years <- c(1977, 1980, 1983, 1999, 2002, 2005, 2015, 2017, 2019)
map2_chr(movies, years, ~paste(.x, .y, sep=": "))
## [1] "A New Hope: 1977" "The Empire Strikes Back: 1980"
## [3] "Return of the Jedi: 1983" "Phantom Menace: 1999"
## [5] "Attack of the Clones: 2002" "Revenge of the Sith: 2005"
## [7] "The Force Awakens: 2015" "The Last Jedi: 2017"
## [9] "Rise of Skywalker: 2019"
Now let’s say you wanted to perform an operation to each of these strings:
## [1] "A NEW HOPE: 1977" "THE EMPIRE STRIKES BACK: 1980"
## [3] "RETURN OF THE JEDI: 1983" "PHANTOM MENACE: 1999"
## [5] "ATTACK OF THE CLONES: 2002" "REVENGE OF THE SITH: 2005"
## [7] "THE FORCE AWAKENS: 2015" "THE LAST JEDI: 2017"
## [9] "RISE OF SKYWALKER: 2019"
pluck
If you’ve ever worked with lists in R you know the syntax is somewhat counter intuitive. Instead of having to do things like list[[1]][2]
to recover specific elements from lists we can use the very useful pluck
function:
Here’s the basic functionality:
## [[1]]
## [1] "A New Hope" "The Empire Strikes Back"
## [3] "Return of the Jedi" "Phantom Menace"
## [5] "Attack of the Clones" "Revenge of the Sith"
## [7] "The Force Awakens" "The Last Jedi"
## [9] "Rise of Skywalker"
##
## [[2]]
## [1] 1977 1980 1983 1999 2002 2005 2015 2017 2019
##
## $preference
## [1] 2 1 3 7 8 9 4 6 5
## [1] "Attack of the Clones"
## [1] 2002
## [1] 8
With pluck
:
## [1] "A New Hope" "The Empire Strikes Back"
## [3] "Return of the Jedi" "Phantom Menace"
## [5] "Attack of the Clones" "Revenge of the Sith"
## [7] "The Force Awakens" "The Last Jedi"
## [9] "Rise of Skywalker"
## [1] "Attack of the Clones"
## [1] 2002
## [1] 8
Why might you want to use this other than easier syntax? Maybe something like this, where we want to extract the last word of each movie title:
## [1] "Hope" "Back" "Jedi" "Menace" "Clones" "Sith"
## [7] "Awakens" "Jedi" "Skywalker"
What’s going on here?
example %>% pluck(1)
: pipe in our list and grab the first list element, the movie title;map_chr(~ word(., 1))
: take each movie title and extract the last word from it using the word
function.Here’s how we could still do it with pipes but without the purrr
functionality:
## [1] "Hope" "Back" "Jedi" "Menace" "Clones" "Sith"
## [7] "Awakens" "Jedi" "Skywalker"
Not too bad, but you can see how this could get complex and hard to read quickly with things like .[[1]]
.
Ok, now we’ll go onto something more complex using some real data.
purrr
with replication dataI’m going to use a really nice replication file available for download at Harvard’s Dataverse here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CI2EPI. I spend quite a bit of time talking about and using this dataset in my comprehensive R for Congressional Data guide. Let’s download this file and load it into your workspace:
This is a great dataset that has the congressional district-Congress as its unit of observation (so one row is one congressional district per term of Congress).
Let’s jump right in. One common task, especially for description, is running the same model specification over various subsets of the data. This is quite easy with purrr
and replaces the do
function from dplyr
. You can also do this with the apply
family of functions, but if you ever scale this up and want to run it on especially large data you’ll notice performance issues. purrr
solves this problem.
Let’s say we want to run a certain model for each state in the data. A model of interest might be how bill introduction is correlated with distrinct income. And maybe we also want to control for some obvious factors, such as ideological extremity and the member of Congress’ age (older members may be near retirement and not as interested in introducing new legislation).
The baseline model might look like this (after some basic data cleaning):
df <- filter(df, !is.na(medianIncome) &
!is.na(state) &
!is.na(dwnom1) &
!is.na(age)) %>%
mutate(log_income = log(medianIncome),
abs_dwnom = abs(dwnom1))
summary(lm(numSpon ~ log_income + abs_dwnom + age, data=df))
##
## Call:
## lm(formula = numSpon ~ log_income + abs_dwnom + age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.179 -7.049 -2.517 3.824 107.431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.47993 2.64380 -0.938 0.34827
## log_income 0.69062 0.26401 2.616 0.00892 **
## abs_dwnom 0.64276 0.68857 0.933 0.35061
## age 0.15535 0.01268 12.255 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.84 on 7035 degrees of freedom
## Multiple R-squared: 0.02465, Adjusted R-squared: 0.02424
## F-statistic: 59.27 on 3 and 7035 DF, p-value: < 2.2e-16
Given this naive model, there’s a positive correlation between district median income and bill introduction that is statistically significant. Now how do we run this same model over all 50 states? This is pretty easy using map_df
and the tidy
function from the broom package:
library(broom)
model_function <- function(state.abb){
model <- lm(numSpon ~ log_income + abs_dwnom + age,
data=df %>% filter(state == state.abb))
output <- tidy(model) %>%
filter(term %in% c("log_income", "abs_dwnom", "age")) %>%
select(term, estimate, std.error) %>%
mutate(state = state.abb)
return(output)
}
There are slightly shorter ways of doing this with longer piping procedures, but I prefer this way for readability and debugging.
What’s going on here?
First, we create the model specification and assign it to the object model
. The key difference between this and the above is that we filter the data down to the specific state of interest. Next, we run the model and tidy
the model output, which returns a dataframe. Then we simply clean up the dataframe, keeping the coefficients and statistics of interest. You can see what this looks like for one state:
## # A tibble: 3 x 4
## term estimate std.error state
## <chr> <dbl> <dbl> <chr>
## 1 log_income 1.62 1.50 NC
## 2 abs_dwnom 2.12 3.48 NC
## 3 age 0.328 0.0719 NC
Now let’s turn this into a dataframe and plot the results!
As you can see this returns a dataframe that essentially stacks each individual state’s results on top of eachother. This is because we used map_df
instead of regular map
, which would have returned a dataframe of lists. That is also fine, and you now know how to work with those, but this format makes it easier to visualize our results!
Let’s visualize this as a coefficient plot for log_income
. I like using the dotwhisker
package (https://cran.r-project.org/web/packages/dotwhisker/vignettes/dotwhisker-vignette.html which makes things much easier. I cover this in greater detail here.
library(dotwhisker)
result_df %>%
filter(term == "log_income") %>%
select(-term) %>%
rename(term = state) %>%
arrange(estimate) %>%
dwplot() +
theme_minimal() +
geom_vline(xintercept = 0, alpha=.5, linetype=2) +
theme(legend.position = 'none') +
xlab("Estimate") + ylab("State")
Not bad. Let’s try one more complicated example, creating maps with estimated results that vary over time. To do this let’s create a new time period variable based on the congressional term:
df$time.group <- NA
df$time.group[df$congNum >= 98 & df$congNum <102] <- "1983-1990"
df$time.group[df$congNum >= 102 & df$congNum < 107] <- "1991-2000"
df$time.group[df$congNum >= 107 & df$congNum <= 113] <- "2001-2012"
Now let’s try out the nest
function, which essentially collapses your data based on a group_by
argument into a dataframe of dataframes:
What we can do now is run the same model function over each state and each time period using map
, tidy
, and unnest
.
group.results <- group.df %>%
mutate(output = map(data, ~lm(numSpon ~ log_income + abs_dwnom + age, data=.) %>%
tidy %>%
filter(term == "log_income") %>%
select(estimate))) %>%
unnest(output) %>%
select(state, time.group, estimate)
head(group.results)
## # A tibble: 6 x 3
## # Groups: state, time.group [56]
## state time.group estimate
## <fct> <chr> <dbl>
## 1 AL 1983-1990 6.86
## 2 AR 1983-1990 -28.7
## 3 AZ 1983-1990 -81.2
## 4 CA 1983-1990 -11.2
## 5 CO 1983-1990 -25.4
## 6 CT 1983-1990 -60.8
The logic of what’s going on here is as follows:
data
column for each time period by state grouping and applies the map
function to it.map
function passes the same model specification as above, tidy
s it and then grabs just the estimate
for log_income
.unnest
it, which turns the dataframe of dataframes back into something we can work with.Now let’s plot this on a map, again using the map
function. I go into greater detail about making maps here: http://congressdata.joshuamccrain.com/visualization.html#3_creating_maps.
states_map <- map_data("state")
states_map <- states_map %>%
left_join(data.frame(region = str_to_lower(state.name), state=state.abb))
## Joining, by = "region"
## Joining, by = "state"
Now let’s create our visualization function so we can pass it to map
. You’ll see why we do it this way next.
make.map <- function(time){
ggplot(states_map.long %>% filter(time.group == time),
aes(x=long, y=lat, group=group, fill=estimate)) +
geom_polygon() +
geom_polygon(data = states_map, aes(x = long, y = lat, group=group),
fill=NA, color="black", size=.5, inherit.aes=F) +
scale_fill_gradient() +
coord_map() +
ggtitle(time) +
theme_void() +
theme(legend.position = "none")
}
maps <- map(unique(states_map.long$time.group), ~make.map(.))
The reason creating a list of ggplot objects is useful is we can then call each one separately if we’d like:
This has really been a crash course on the intuition of purrr
and its most commonly used functions. There is a lot of flexibility here and tons of other applications. Here are some good resources: