I’m covering a popular coding task for my first post: scraping the web for information. There are many good tutorials on how to do this out there (e.g., here and here). For this post I tried to keep things pretty simple and detailed since I’m a still a bit low on the scraping learning curve, and I thought I might be able to help from my perspective as a beginner. 

I also wanted to focus on something unique that I could do with the results. After all, why scrape unless it can get interesting data more efficiently than visiting the website(s) and just viewing it?

You’ll find some expandable R coding notes below that may help you with this often long and rather messy process. I actually think web scraping is an excellent way to test your coding skills because of this. Feel free to ignore them and get more quickly to the results, which I hope you find worthwhile!

I. Load R packages

library(robotstxt)  # check website scraping rules
library(readr)  # reading data
library(dplyr)  # "wrangling" data
library(rvest)  # web scraping
library(knitr)  # HTML table creation
library(kableExtra)  # additional HTML table formatting
library(rworldmap)  # world chloropleth mapping

I do respect the rules contained in robots.txt (if it is present). For example, one of my first ideas for this post was to scrape Google search results for “plastic straws” to analyze the frequency and types of news hits on this hot topic. Unfortunately, this does not appear to be OK with Google:

rt_google.com <- robotstxt(domain = "google.com")
head(rt_google.com$permissions) # head() provides first 6 rows of permissions
##fielduseragentvalue
##1Disallow*/search
##2Allow*/search/about
##3Allow*/search/static
##4Allow*/search/howsearchworks
##5Disallow*/sdch
##6Disallow*/groups

Row 1 of the output basically says that no type of automated scraping of the “/search” pages is allowed. You should also always read the website’s terms of service (e.g., www.policies.google.com/terms).

There’s a lot about the legality and ethics of web scraping and I’m no expert, but you can find out plenty (including on diverging opinions) by searching the web. Here is a great blog post with a lot of interesting discussion in the comments.

After more searching, I finally found another site aligned with my interests that appeared to welcome (or at least not disallow) scraping, the Global Invasive Species Database1 (www.iucngisd.org/gisd/). In fact, the site as of this writing apparently has no robots.txt file:

rt_iucngisd.org <- robotstxt(domain = "iucngisd.org")
##get_robotstxt(): iucngisd.org; HTTP status: 404

Regardless of this, I still checked the site’s terms of service, where my scraping and subsequent analysis appears to fall under Derivative works. Next, I needed to determine the scraping pathway; that is, how are the individual species’ data accessed on the site? 

Luckily this was simply a base URL for the site with the genus and species name appended to the end of it separated by a “+”:

“http://www.iucngisd.org/gisd/speciesname/”  & “Genus+species”

Now all I needed was a list of species names. For this I used the Advanced Search Options to download a .csv text file of all the Liliopsida species (113) in the database. I picked this taxonomic group (flowering plants) in part because of the fundamental ways some of these species can alter ecosystems where they become invasive, and also because of their number which would make manually visiting each species’ page to gather information very inefficient. 

Once saved to my computer, I was ready to use R to process the file, scrape the data, and analyze it:

II. Scraping sequence

Step 1. Load the list of 113 Liliopsida species from the GISD search.
SppList <- read_delim("export_gisd_Liliopisda.csv", ";", escape_double = FALSE, trim_ws = TRUE) 
# to display some of this loaded data frame as an HTML table: 
kable(SppList[1:5,]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
  1. The file has the extension “.csv” and the separator is a semicolon (a standard often used in Europe) not a comma, so the read_delim() function is used instead of read_csv().
  2. The result of the kable() and kable_styling() functions chained together with %>% (a “pipe” from the magrittr package that gets loaded by the dplyr package) produces a clean-looking HTML table output.
Species Kingdom Phylum Class Order Family System
Aegilops triuncialis Plantae Magnoliophyta Liliopsida Cyperales Poaceae Terrestrial
Agapanthus praecox Plantae Magnoliophyta Liliopsida Liliales Liliaceae Terrestrial
Agave americana Plantae Magnoliophyta Liliopsida Liliales Agavaceae Terrestrial
Agave sisalana Plantae Magnoliophyta Liliopsida Liliales Agavaceae Terrestrial
Agrostis capillaris Plantae Magnoliophyta Liliopsida Cyperales Poaceae Terrestrial
Step 2. Make a list of URLs using the genus + species names concatenated.
As usual there are many ways to do this in R, but I decided to add the URLs to the existing SppList data frame for access later, and generate separate genus and species name columns in the process:
SppList <- SppList %>% 
  tidyr::separate(Species, c("genus", "species"), sep = " ", remove = FALSE) %>% 
  tidyr::unite(c(genus, species), col="genspp", sep = "+", remove = FALSE) %>% 
  mutate(url=paste0("http://www.iucngisd.org/gisd/speciesname/", genspp)) %>% 
  select(-genspp)

SppList[1:5,]$url  # display the first 5 URLs

The tidyr package separate() and unite() functions are used within the dplyr chain to (1) split the Species column names by the space " " between them into new genus and species columns; and (2) combine these columns into a new genspp column of the names joined by a "+". The mutate() function then creates a new url column of the base URL combined with the genspp name (utilizing the paste0() function) and the genspp column is dropped by select().

##[1]"http://www.iucngisd.org/gisd/speciesname/Aegilops+triuncialis"
##[2]"http://www.iucngisd.org/gisd/speciesname/Agapanthus+praecox"
##[3]"http://www.iucngisd.org/gisd/speciesname/Agave+americana"
##[4]"http://www.iucngisd.org/gisd/speciesname/Agave+sisalana"
##[5]"http://www.iucngisd.org/gisd/speciesname/Agrostis+capillaris

This resulted in 113 URLs of pages that I wanted to scrape information from. But what information? I am always interested in geographic trends and patterns so I settled on getting the lists of countries where the species are from (their origin or “native range”) and the lists of countries where they ended up (their destination or “alien range”). 

This data is contained in two HTML lists on the DISTRIBUTION tab of a page. For example, Agrostis triuncialis is considered an alien species in Japan and the U.S. and a native species in many Eurasian countries.

Step 3. Examine the HTML of a page.
How to get the path to this particular data is greatly facilitated by inspecting the HTML, and I’ve provided a screenshot taken of the Chrome browser inspector (accessed by right-clicking the page) to show how I did it:
InspectHTML
Click to expand
Step 4. Connect to the site and extract information from each page.

The two XPaths of interest were //*[@id="ar-col"] for the alien range countries and //*[@id="nr-col"] for the native range countries. I used these identifiers within the structure of a loop to iterate through each URL and collect the countries in two separate lists. 

Loops can be a bit tricky to understand if you haven’t tried one before, but I will emphasize that:

  1. I used a small subset of URLs (1:5) in testing the loop (I left in the brackets to demonstrate this).
  2. I preallocated the lists (a_ranges and n_ranges) to hold the data using the vector() function, and named them using the names() function.
  3. The cat() function is a simple trick to make a counter to view the progress of the loop.
  4. I added a slight delay to each “scrape” of 5 seconds using the Sys.sleep() function to avoid overwhelming the site’s server (some robots.txt files specify the amount of delay). If I was scraping much more than 100 pages, I would also try to contact the site administrator to make sure this would be OK.
urls <- SppList[1:113,]$url
sppnames <- SppList[1:113,]$Species

a_ranges <- vector(mode = "list", length = length(urls))
n_ranges <- vector(mode = "list", length = length(urls))

names(a_ranges) <- sppnames
names(n_ranges) <- sppnames

for (i in 1:length(urls)) {

  cat(". ")

  tryCatch(
  page % read_html(), error = function(e){NA})

  tryCatch(
  a_range % html_nodes(xpath='//*[@id="l-1st-step"]') %>%
  html_nodes("li") %>% html_text(), error = function(e){NA})

  tryCatch(
  n_range % html_nodes(xpath='//*[@id="nr-col"]') %>%
  html_nodes("li") %>% html_text(), error = function(e){NA})

  a_ranges[[i]] <- a_range
  n_ranges[[i]] <- n_range

  Sys.sleep(5)
}

To deconstruct this loop a bit:

  1. The for() function defines "i" to represent the sequence of URLs in the vector named urls created on line 1.
  2. The tryCatch() functions help keep the loop running in case errors are encountered during the scraping.
  3. Each scrape (done by rvest package functions) is comprised of:
    • Reading the page (e.g., for "i"=1, urls[i] %>% read_html() reads the first URL’s page).
    • Extracting the country names in the HTML nodes (html_nodes() and html_text() gets the text from the list elements in the XPath).
    • Adding the country names to the preallocated and named lists (the double-brackets [[i]] achieves this)
Step 5. Processing the lists and cleaning things up.

Great, the scraping is done! The next step is to convert the lists to data frames for analysis. You will also often find that some cleanup of the text is needed. In this case the countries in the ar_ranges list have a bracketed number before them (as in “[1] Japan”), and this unwanted text along with extra white-space is removed using the gsub() function nested in the trimsw() function. 

This cleanup is important for any subsequent task where the country names might need to be joined to other data properly, like I did for the rworldmap package mapping function below.

a_ranges_df <- data.frame(Species = rep(names(a_ranges), sapply(a_ranges, length)),
                 AR = unlist(a_ranges), row.names = NULL, stringsAsFactors = FALSE)

a_ranges_df$AR <- trimws(gsub("[[:digit:]]|\\[|\\]","", a_ranges_df$AR))

a_ranges_df <- inner_join(a_ranges_df, SppList[, 1:9], by = "Species")

n_ranges_df <- data.frame(Species = rep(names(n_ranges), sapply(n_ranges, length)),
                 NR = unlist(n_ranges), row.names = NULL, stringsAsFactors = FALSE)

n_ranges_df <- inner_join(n_ranges_df, SppList[, 1:9], by = "Species")
  1. The data.frame() function creates a data frame with a Species column of the list item name repeated (the rep() function) for as many rows as there are countries for a species (accomplished by the sapply() function) and an AR/NR column of country names from converting the list to a vector (the unlist() function).
  2. The gsub() function cleaning up the AR data replaces any number ([[:digit:]]) and the two brackets ([ and ]) with nothing (""). Note the OR operator | separating these items as well as the \\ which “escapes” the brackets as special characters.
  3. The dplyr package inner_join() functions join SppList data back to each new data frame (useful later on for data summaries by Family, System, etc.).

III. Results

I first calculated the frequencies of each country across the 113 species accounts to provide a relative (and very rough) indication of which countries may be impacted the most by these species or be the greatest potential source of them:
tbl_a % group_by(AR) %>% summarise(count = n()) %>% 
  arrange(-count) %>% slice(1:5) %>% rename(Country=AR, `Species Count` = count)

tbl_n % group_by(NR) %>% summarise(count = n()) %>% 
  arrange(-count) %>% slice(1:5) %>% rename(Country=NR, `Species Count` = count)

col <- tbl_df(cbind(rep("",5)))
tbl_an <- cbind(tbl_a, col, tbl_n) 
colnames(tbl_an)[3] % kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "compressed"), 
                full_width = FALSE) %>% 
  column_spec(c(1,4),  width = "10em") %>% 
  column_spec(c(2,3,5),  width = "8em") %>% 
  add_header_above(c("Top Alien Range Countries" = 2, " " = 1, 
                     "Top Native Range Countries" = 2))
  1. The data summaries are obtained by dplyr chains that assign country names as a grouping variable (the group_by() function), count them (the n() function within the summarise() function), sort the counts in descending order (the arrange() function, take the top five (the slice() function), and finally rename the columns to something nicer for display (the rename() function).
  2. To display them side by side, the cbind() function is used to bind them together (with a blank, empty space-named column in between them to produce some separation).
  3. Functions in the kableExtra package set the HTML table column widths and add an additional header (note the middle blank column gets " " white space for a header; using "" results in an error).
Top Alien Range Countries
Top Native Range Countries
Country Species Count Country Species Count
united states 84 china 25
australia 61 india 23
new zealand 58 italy 20
new caledonia 39 russian federation 20
french polynesia 38 spain 19

I emphasize rough for a few reasons that I can think of: (1.) as useful as I believe these accounts are, the information they contain is unlikely to be perfect and some countries may be missing from the lists; (2) just because a country makes either list doesn’t mean that the species is known to be harmful in that country (e.g., it may be merely established) or that the country is a known source; and (3.) the will and the resources each country has to get this information can vary widely.

That said, a whopping 84 of the Liliopsida species are considered alien to the U.S., followed by Austrialia with 61 and New Zealand with 58. China includes the native ranges of the most species at 25, followed closely by India with 23.

I next calculated the frequency of each county pair across the joined alien and native range data, both as a global total and considering the U.S. as a source alone, to get an idea of the magnitude of the potential (not proven) connections between a country as a source and a country as a destination:

an_ranges_df <- inner_join(a_ranges_df, n_ranges_df, by = "Species")

tbl_gl % count(NR, AR) %>% arrange(-n) %>% 
  slice(1:5) %>% rename(`Native Range`=NR, `Alien Range`=AR) 

tbl_us % filter(NR=="united states") %>% count(NR, AR) %>% 
  arrange(-n) %>% slice(1:5) %>% rename(`Native Range`=NR, `Alien Range`=AR)

tbl_glus <- cbind(tbl_gl, col, tbl_us) 
colnames(tbl_glus)[4] % kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "compressed"), 
                full_width = FALSE, position = "center") %>% 
  column_spec(c(1,2,5,6),  width = "10em") %>%
  column_spec(c(3,4,7),  width = "5em") %>% 
  add_header_above(c("Top (Potential) Global Connections" = 3, " " = 1, 
                     "Top (Potential) U.S. Origin Connections" = 3))
  1. After joining the alien range and native range data, the count() function in the dplyr chains is a handy way to get a count of the country pairs.
  2. The filter() function in the tbl_us chain provides a U.S. native range-filtered total (note the use of ==).
Top (Potential) Global Connections
Top (Potential) U.S. Origin Connections
Native Range Alien Range n Native Range Alien Range n
china united states 23 united states new zealand 8
india united states 21 united states united states 8
italy united states 19 united states australia 7
russian federation united states 18 united states ecuador 4
spain united states 18 united states fiji 4

Not surprisingly given their prior rankings, the highest pair of countries for the global total is China and the U.S. While this does not prove that China is the actual source of most of the alien occurrences of these species in the U.S. (e.g., India is a close second), it could indicate something about the species in the China region and/or implicate factors such as trade (a powerful driver of species introductions) that may increase their chances of becoming established here.

Although the numbers of connections are relatively low for the U.S. filtered data it is interesting that South Pacific countries (New Zealand, Australia, and Fiji) are prominent. But what about the U.S. being paired with itself? Mostly this is because some species native to one area of the U.S. have become established in another (e.g., Spartina alterniflora is native on the east coast and alien on the west coast); a more likely occurrence for larger countries. Several species are in both lists for rather complex taxonomic reasons (e.g., if interested, check out Phragmites australis).

The rworldmap mapping procedure first requires that the countries and their counts get matched to the map via the aptly-named joinCountryData2Map() function. This function also provides statistics on the matching success, here for the alien range countries:

a_ranges_df_c % group_by(AR) %>% summarise(count = n())

gtdMapAR <- joinCountryData2Map(a_ranges_df_c, 
                               nameJoinColumn="AR",
                               joinCode="NAME")
##187 codes from your data successfully matched countries in the map
##45 codes from your data failed to match with a country code in the map
##56 codes from the map weren't represented in your data

The matching was a success overall; most of the 45 codes from the scraped and cleaned AR data that failed were for geographic entities such as Africa that I didn’t attempt to look into further to try and fix.

Producing the chloropleth map of the species counts was done using the mapCountryData() function with some customization of the color ramp and legend:

colfunc_a <- colorRampPalette(c("white", "red"))

map1Params <- mapCountryData(gtdMapAR, 
                nameColumnToPlot = "count", 
                catMethod = c(0,1,5,10,15,20,25,85), 
                colourPalette=c(colfunc_a(7)[1:6],"darkred"),
                mapTitle = "Liliopsida Species Counts: Alien Range Countries",
                addLegend = FALSE)
map1Params$legendText 25")
do.call(addMapLegendBoxes, c(map1Params, x = "left", horiz = FALSE, cex = 0.8, 
                             title = "# Species"))
  1. Assigning the colorRampPalette() function to the colfunc_a object creates a flexible way to generate a number of color values along a gradient from white to red.
  2. The mapCountryData() function takes the base map with the count data attached that was created previously (gtdMapAR), and symbolizes it, adds a title, etc.
    • I chose breaks of 5 for the count categories (catMethod parameter), with 0 and >25 also represented (0 is actually NA in the base map data, since the counts are >0).
    • Note the use of colfunc_a() to generate 7 color values, the last of which is replaced manually with “darkred” to set the >25 category apart a bit (the first value, white, is assigned to NA).
    • The addLegend parameter is FALSE because I wanted to replace the default continuous bar legend with a box legend.
  3. The box legend is added with the do.call() function (after the vector of category labels is added first to the map object by accessing its legendText slot) and formatted to be vertical, left, etc. with additional parameters.
AR_cpleth
One geographic trend evident in this map is the high counts for North America, the South Pacific, and South Africa. Some of the zero counts are “holes” that are a bit suspect (e.g., for Mongolia in between Russia and China). Interestingly, there is even an occurrence for Antarctica (Poa pratensis).
Moving on to the native range map, the country name matching was also generally successful:
n_ranges_df_c % group_by(NR) %>% summarise(count = n())

gtdMapNR <- joinCountryData2Map(n_ranges_df_c, 
                               nameJoinColumn="NR",
                               joinCode="NAME")
##170 codes from your data successfully matched countries in the map
##29 codes from your data failed to match with a country code in the map
##73 codes from the map weren't represented in your data
colfunc_n <- colorRampPalette(c("white", "blue"))

map2Params <- mapCountryData(gtdMapNR, 
                nameColumnToPlot="count", 
                catMethod=c(0,1,5,10,15,20,25), 
                colourPalette=colfunc_n(6),
                mapTitle = "Liliopsida Species Counts: Native Range Countries", 
                addLegend = FALSE)
map2Params$legendText <- c("0", "1-5", "6-10", "11-15", "16-20", "20-25")
do.call(addMapLegendBoxes, c(map2Params, x = "left", horiz = FALSE, cex = 0.8, 
                             title = "# Species"))
NR_cpleth

A somewhat reversed geographic trend in the northern hemisphere is evident in the native range map compared to the alien range map, with higher counts of species in Asia and Europe on the whole. Could this difference between the maps be a reflection of a some historical Old World-New World pathway for these species?

Right as I was finishing up this post (honestly!), I found a published journal article2 that resembles what I did (they even used the rworldmap package). Please read it for a much more in-depth and sophisticated interpretation of data in the GISD and other sources than I could hope to do here.

So that basically wraps it up, anyone reading this with coding expertise (perhaps even in another program like Python) might do things better, or differently. I would love to hear about it, or anything else relevant to this process or particular subject matter in the comments. Best of luck with your scraping!

1 Invasive Species Specialist Group ISSG 2015. The Global Invasive Species Database. Version 2015.1

2 Turbelin, A. J., Malamud, B. D., & Francis, R. A. (2016). Mapping the global state of invasive alien species: patterns of invasion and policy responses. Global Ecology and Biogeography, 26(1), 78-92. https://onlinelibrary.wiley.com/doi/abs/10.1111/geb.12517


Jim Sheehan

Ecologist, GIS and remote sensing analyst, developing data scientist, and founder of Natural Data Solutions.

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *