Introduction

This tutorial is meant as an introduction to RSelenium for webscraping. If you’ve come across a website you can’t figure out how to scrape – typically because the normal downloading of HTML doesn’t produce what you want – RSelenium is more often than not your solution.

This tutorial assumes basic knowledge of R, rvest, and tidyverse functionality (specifically, pipes: %>%). However, the usage of these packages in what follows is fairly straightforward in order to cleanly demonstrate the utility of RSelenium.

For more tutorials and code, see my website joshuamccrain.com.

Why RSelenium?

Many websites are difficult to scrape because they dynamically pull data from databases using JavaScript and jQuery. For example, on common social media sites such as LinkedIn or Facebook, as you scroll down the page new content is loaded and the URL doesn’t change. These websites are much more difficult to scrape. An easy scraping task is when we can adjust the URL to load a new page based on some systematic pattern.

For example, if we look at OpenSecrets’ website, we see that the URLs change systematically, such as at https://www.opensecrets.org/federal-lobbying/top-spenders. If we wanted to scrape the top spender’s (the Chamber of Commerce) total spent on lobbying over multiple years, we can go to its page for 2019 and see this URL summary?cycle=2019&id=D000019798&name=US+Chamber+of+Commerce. We first notice that this url has a particular construction, including cycle=2019. An easy way to scrape each year of lobbying expenditures for the Chamber of Commerce would be to loop through a vector of years from, e.g., 2000 to 2019. This is an easy scraping task for which there are many existing tutorials.

A more complicated example

If you find a website that you’d like to scrape that does not change its URL for each individual page, this is a different problem. Here’s a simple sort of website like this that also has applied uses for economics and political science research (I am currently using scraped data from this website in multiple working papers): The FCC’s television broadcast signal strength data https://www.fcc.gov/media/engineering/dtvmaps

This website has a simple form in which you enter a zip code and it gives you the available local TV stations in that zip code and their signal strength (see image below). You’ll also notice the URL stays fixed, presenting us a dilemma.

Example 1

The problem is now clear, especially if you want to scrape the signal strength for every zip code in the state or country (in the 10,000s). Let’s look at how to automate using RSelneium.

Using RSelenium

As you’ll see, this exercise relies also on rvest, the most commonly used package in R for scraping. We also load in the tidyverse library which I assume some familiarity with.

The way I think about Rselenium and rvest interacting is that RSelenium is used to first load the page we want to scrape and then download the HTML from that page. From then on, we can rely on typical scraping tools and concepts provided by rvest.

RSelenium is particularly useful when scraping something behind a login or in other settings where it is hard to automate or simulate human behavior on a website (Note: these statements assume you have permission to scrape a given website).

Getting started

First let’s install the required packages and load them into the workspace.

install.packages("RSelenium")
install.packages("rvest")
install.packages("tidyverse")

Opening up a browser

Now, we use RSelenium to open up a browser on your computer which it will then use, and to which we can pass commands. I prefer using Firefox as this browser for simplicity and speed.

rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]

You’ll notice a lot happens when in your console when you run this. Among other things, this line of code downloads a light instance of Firefox onto your machine that it can then load and work with. If this worked succesfully, you should now have a blank Firefox browser window open in your computer.

A couple of notes:

  • I am not getting into the technical details of how Selenium works, but you can find more info here
  • If you get an error based on the port option in this function call, try changing the number to anything else until you don’t.
  • I have not used this on Mac OS, but the documentation appears to be the same.
  • You can do the rest of the steps here without physically opening up a browser through the use of Docker. I’ve had problems with this in the past, but it’s worth learning.

The second line in the snippet above assigns the browser client to an object, remDr.


Working with RSelenium

Now we’re ready to do some scraping. First, let’s navigate to the page of interest: https://www.fcc.gov/media/engineering/dtvmaps

remDr$navigate("https://www.fcc.gov/media/engineering/dtvmaps")

If this worked, the newly-opened Firefox browser should have loaded this page!

Our next step is to feed a zip code into the form. First, we need to find information for the form using HTML, CSS or possibly XPath. If you’re not familiar with how this works, I’d recommend starting with some simple scraping tutorials with rvest. It’s not necessary to fully understand this to move forward, but it does help in scraping generally.

For those experienced with scraping, you’ll know that right click > inspect on the web page element of interest is the best way to figure out this information. So, right click on the text box of the form and click inspect. You’ll see something similar to the image below.

What we want to do is uniquely identify this form, which we can do using the form’s id, in this case startpoint (evident in the highlighted portion of the inspect window).

What we want to do is have the browser send a string of text into that form (in this case, a zip code). Let’s try it out:

zip <- "30308"
remDr$findElement(using = "id", value = "startpoint")$sendKeysToElement(list(zip))
# other possible ("xpath", "css selector", "id", "name", "tag name", "class name", "link text", "partial link text")

What this snippet of code is doing is assigning a zip code to the object zip, which we are then passing through a Selenium function. This function, findElement takes two parameters, using and value. In this case we’re using the unique ID for the form field identified through the right click > inspect, which has the value of “startpoint”. Note that there are other possible ways to identify web page elements with Selenium, most of which are straightforward if you’re otherwise familiar with scraping – I’ve found myself frequently using xpath, link text, and partial link text depending on the setting.

The next function that is processed, after Selenium finds the element of interest, is sendKeysToElement which takes a list by construction. This function does exactly what it sounds like, which is simulates keyboard entry and sends it to the element we identified.

NB: Selenium works in a classic object oriented programming fashion. First, we assign the browser to an object (remDr). Within that object, we tell it to do things, in this case find certain elements (findElement()). Once that element is found, we can then have Selenium perform tasks on that element.

If this code successfully ran, you should now see the value of zip – 30308 – appear in the form field. Great! Now we need to submit the form and have it load the page with the stations and signal strength data. This is the same task as above, but we now need to find the unique identifier for the submit button and then tell Selenium to click it.

By right clicking on the form submit button (“Go!”), we see that its id is btnSub. Again, we tell Selenium to find that element, and now we add an additional task which is to click it (clickElement):

remDr$findElements("id", "btnSub")[[1]]$clickElement()

If this code runs sucessfully, you should see it load the same page as in the first screenshot as above! Now we can get the HTML and process it in any way we like, similar to a more traditional scraping situation.

Extracting data from HTML

First, save the HTML to an object. This is the last time we need to use a Selenium function for the time being, as everything else is down with rvest and tidyverse.

Sys.sleep(5) # give the page time to fully load
html <- remDr$getPageSource()[[1]]

You’ll see that html is a large character object with lots of HTML, CSS and JavaScript – something R knows how to work with (with some help).

Now, let’s extract the table of stations and their signal strength. To identify it, first right click > inspect.

signals <- read_html(html) %>% # parse HTML
  html_nodes("table.tbl_mapReception") %>% # extract table nodes with class = "tbl_mapReception"
  .[3] %>% # keep the third of these tables
  .[[1]] %>% # keep the first element of this list
  html_table(fill=T) # have rvest turn it into a dataframe
View(signals)

This is where the art of webscraping comes in. When we inspect this table we see three tables with a class of tbl_mapReception. If you’re familiar with CSS, you know that class is not a unique identifier. Howver, in this case we can still use it by grabbing the third table on the page and using rvest functions to turn it into a dataframe (html_table()).

When we View this new dataframe (signals) we see it needs some cleaning.

names(signals) <- c("rm", "callsign", "network", "ch_num", "band", "rm2") # rename columns

signals <- signals %>%
  slice(2:n()) %>% # drop unnecessary first row
  filter(callsign != "") %>% # drop blank rows
  select(callsign:band) # drop unnecessary columns

Now we have a nicely formatted dataframe of stations and their callsigns with some extra information:

head(signals)
##   callsign network ch_num band
## 1  WXIA-TV     NBC     11 Hi-V
## 2  WUVG-DT    UNIV     34  UHF
## 3  WGCL-TV     CBS     46  UHF
## 4     WUPA     THE     69  UHF
## 5  WPCH-TV     IND     17  UHF
## 6  WAGA-TV     FOX      5  UHF

Finally, let’s add in the actual signal strength. This is a bit more complicated to extract. I want a continuous variable here, which is available when we click on an individual callsign, e.g.:

As before, right click on the link for any of the stations and click inspect to see the relevant surrounding HTML. There are a couple of ways of going about this, but here what I’m going to do is use the rvest function html_attr() to grab the plain text from the link, which contains the signal strength number.

To see this, we’ll first run part of the code without saving to an object:

read_html(html) %>% 
  html_nodes(".callsign") %>% 
  html_attr("onclick")

In the console this brings up a vector of strings in this format:

# [1] "getdetail(29421,51163,'WXIA-TV Facility ID: 51163 <br>WXIA-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=51163 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/51163 target=_new>Public File</a>)<br>City of License: ATLANTA, GA<br>RF Channel: 10<br>RX Strength: 109 dbuV/m<br>Tower Distance: 3 mi; Direction: 105°','WXIA-TV<br>Distance to Tower: 3 miles<br>Direction to Tower: 105 deg',33.7566666666667,-84.3319444444444,'WXIA-TV')"                                                                                                                          
 # [2] "getdetail(28291,48813,'WUVG-DT Facility ID: 48813 <br>WUVG-DT (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=48813 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/48813 target=_new>Public File</a>)<br>City of License: ATHENS, GA<br>RF Channel: 18<br>RX Strength: 111 dbuV/m<br>Tower Distance: 4 mi; Direction: 48°<br>Repacked Channel: 18<br>Repacking Dates: 8/3/2019 to 9/6/2019','WUVG-DT<br>Distance to Tower: 4 miles<br>Direction to Tower: 48 deg',33.8073333333333,-84.3393055555556,'WUVG-DT')"

Each elements of this vector contains the signal strength for each station. So all we have to do is figure out a way to extract that specific number, which is feasible through a simple regular expression. Let’s add that line to the above code:

read_html(html) %>% 
  html_nodes(".callsign") %>% 
  html_attr("onclick") %>% 
  str_extract("(?<=RX Strength: )\\s*\\-*[0-9.]+")
##  [1] "109" "111" "110" "110" "110" "110" "109" "108" "106" "101" "100"
## [12] "100" "99"  "97"  "95"  "81"  "47"  "56"  "53"  "30"  "30"

Great! Now we have a vector of signal strength for each of the stations in the dataframe. Let’s save it to an object and add it to our existing dataframe.

strength <- read_html(html) %>% 
  html_nodes(".callsign") %>% 
  html_attr("onclick") %>% 
  str_extract("(?<=RX Strength: )\\s*\\-*[0-9.]+")

signals <- cbind(signals, strength)
signals
##    callsign network ch_num band strength
## 1   WXIA-TV     NBC     11 Hi-V      109
## 2   WUVG-DT    UNIV     34  UHF      111
## 3   WGCL-TV     CBS     46  UHF      110
## 4      WUPA     THE     69  UHF      110
## 5   WPCH-TV     IND     17  UHF      110
## 6   WAGA-TV     FOX      5  UHF      110
## 7   WHSG-TV    TRIN     63  UHF      109
## 8      WATL    MY N     36  UHF      108
## 9    WSB-TV     ABC      2  UHF      106
## 10     WPBA     PBS     30  UHF      101
## 11  WATC-DT     ETV     57  UHF      100
## 12  WIRE-CD                 UHF      100
## 13  WYGA-CD                 UHF       99
## 14  WANN-CD                 UHF       97
## 15  WPXA-TV     ION     14  UHF       95
## 16     WGTV     PBS      8 Hi-V       81
## 17  WNGH-TV     PBS     18 Lo-V       47
## 18  WKTB-CD                 UHF       56
## 19  WSKC-CD                 UHF       53
## 20  WJSP-TV     PBS     28 Lo-V       30
## 21     WCIQ     PBS      7 Hi-V       30

Now we have a dataframe we can save in any format! That’s all there is to it. I’ll conclude by wrapping up with some possible extensions for this type of project.


Extensions: Iteration

In reality, it’s unlikely you would want to do something like this for one zip code at a time. In other words, you’d like pass a vector (or dataframe) of zip codes into a function that iterates the entire process from above. I’ll walk through a couple needed additions to make this work before presenting the whole code.

Step 1: submitting a new zip code and preparing for iteration

Try running this code from above, now with a new zip code:

zip <- "27511"
remDr$findElement(using = "id", value = "startpoint")$sendKeysToElement(list(zip))

What you’ll see is the search box now has a messy string: “3030827511”! This won’t work, so we first have to clear the existing text from the box before continuing:

remDr$findElement("id", "startpoint")$clearElement()

What if the new zip code you pass to it is invalid? This kind of thing happends frequently either due to the raw data being wrong or the website’s functionality isn’t working properly. Let’s find out:

zip <- "27511111"
remDr$findElement(using = "id", value = "startpoint")$sendKeysToElement(list(zip))
remDr$findElements("id", "btnSub")[[1]]$clickElement()

We get an alert window telling us “Address Not Found”. How can we deal with this error elegantly? There is no correct solution, but I propose the following. First, check if there is an error and if so, create a blank dataframe and move onto the next zip code.

alert <- try(remDr$getAlertText(), silent=T) # check if there is an alert window
  
  if(class(alert) != "try-error") { # if an alert window is present, do the following
    
    signals <- data.frame(callsign = NA, network = NA, ch_num = NA, band = NA, strength = NA, cont.strength = NA)
    remDr$acceptAlert()
    remDr$findElement("id", "startpoint")$clearElement()
    
  } else { # if no alert, continue on as normal
    
    # normal scraping procedure code here
    
  }

Putting it into a function

So we have some basic error checking and we now know how to pass new zip codes into the search box. Let’s read in a dataframe of zip codes, then wrap everything in a function in order to iterate:

zips.df <- read.csv("zip_code_data.csv") # csv of zip codes

rD <- rsDriver(browser="firefox", port=4557L)
remDr <- rD[["client"]]

remDr$navigate("https://www.fcc.gov/media/engineering/dtvmaps")

scrape.zips <- function(zip){ # our scraping function
  
  remDr$findElement("id", "startpoint")$sendKeysToElement(list(zip))
  remDr$findElements("id", "btnSub")[[1]]$clickElement()
  
  alert <- try(remDr$getAlertText(), silent=T)
  
  if(class(alert) != "try-error") {
    
    signals <- data.frame(callsign = NA, network = NA, ch_num = NA, band = NA, strength = NA, cont.strength = NA)
    remDr$acceptAlert()
    remDr$findElement("id", "startpoint")$clearElement()
    
  } else {
    Sys.sleep(2)
    
    html <- remDr$getPageSource()[[1]]
    
    cont.strength <- read_html(html) %>% 
      html_nodes(".callsign") %>% 
      html_attr("onclick") %>% 
      str_extract("(?<=RX Strength: )\\s*\\-*[0-9.]+")
    
    signals <- read_html(html) %>%
      html_nodes("table.tbl_mapReception") %>%
      .[3] %>%
      .[[1]] %>%
      html_table(fill=T)
    
    names(signals) <- c("rm", "callsign", "network", "ch_num", "band", "rm2")
    
    signals <- signals %>%
      slice(2:n()) %>%
      filter(callsign != "") %>%
      select(callsign:band)
    
    strength <- read_html(html) %>%
      html_nodes("table.tbl_mapReception:nth-child(3) .ae-img") %>%
      html_attr("src")
    
    if(length(strength)==0) { strength <- "none" }
    if(length(cont.strength)==0) { cont.strength <- "none" }
    
    signals <- cbind(signals, strength) %>% cbind(cont.strength)
    
    signals <- mutate(signals, strength = strength %>% str_extract("strength."))
  }
  
  remDr$findElement("id", "startpoint")$clearElement()

  return(signals)
  
  Sys.sleep(runif(1, 1, 3))

}

Finally, let’s create a “safe” scrape function that elegantly handles errors that we haven’t anticipated:

scrape_safe <- function(zip){
  
  result <- try(scrape.zips(zip))
  
  if (class(result) == "try-error") { # if there is any error caught, return a blank dataframe and keep going
    cat("Error encountered for zip:", zip, "\n")
    return(data.frame()) 
    Sys.sleep(runif(1, 1, 3))
  } else { # if no error, keep going as normal to next zip
    return(result)
  }
}

Then all that is left is to iterate over all of the zips. You can do this a number of ways including with purrr, for loops, or other tidyverse functions. Here I’m going to use do():

zips.df <- zips.df %>%
  group_by(zip) %>%
  do(scrape_safe(.$zip))

Conclusion

RSelenium offers lots of other functionality not covered here. This is to serve as an introduction as to how Selenium works and interacts with other common R packages such as rvest. Among other features of Selenium, you can have it take screenshots, click on specific links or parts of the page, scroll down pages, and input any keyboard stroke onto any part of a web page. It’s very versatile when combined with classic scraping techniques and makes virtually any website scrapable.