I’m covering a popular coding task for my first post: scraping the web for information. There are many good tutorials on how to do this out there (e.g., here and here). For this post I tried to keep things pretty simple and detailed since I’m a still a bit low on the scraping learning curve, and I thought I might be able to help from my perspective as a beginner.
I also wanted to focus on something unique that I could do with the results. After all, why scrape unless it can get interesting data more efficiently than visiting the website(s) and just viewing it?
You’ll find some expandable R coding notes below that may help you with this often long and rather messy process. I actually think web scraping is an excellent way to test your coding skills because of this. Feel free to ignore them and get more quickly to the results, which I hope you find worthwhile!
I. Load R packages
library(robotstxt) # check website scraping rules library(readr) # reading data library(dplyr) # "wrangling" data library(rvest) # web scraping library(knitr) # HTML table creation library(kableExtra) # additional HTML table formatting library(rworldmap) # world chloropleth mapping
I do respect the rules contained in robots.txt (if it is present). For example, one of my first ideas for this post was to scrape Google search results for “plastic straws” to analyze the frequency and types of news hits on this hot topic. Unfortunately, this does not appear to be OK with Google:
rt_google.com <- robotstxt(domain = "google.com") head(rt_google.com$permissions) # head() provides first 6 rows of permissions
## | field | useragent | value | |
## | 1 | Disallow | * | /search |
## | 2 | Allow | * | /search/about |
## | 3 | Allow | * | /search/static |
## | 4 | Allow | * | /search/howsearchworks |
## | 5 | Disallow | * | /sdch |
## | 6 | Disallow | * | /groups |
Row 1 of the output basically says that no type of automated scraping of the “/search” pages is allowed. You should also always read the website’s terms of service (e.g., www.policies.google.com/terms).
There’s a lot about the legality and ethics of web scraping and I’m no expert, but you can find out plenty (including on diverging opinions) by searching the web. Here is a great blog post with a lot of interesting discussion in the comments.
After more searching, I finally found another site aligned with my interests that appeared to welcome (or at least not disallow) scraping, the Global Invasive Species Database1 (www.iucngisd.org/gisd/). In fact, the site as of this writing apparently has no robots.txt file:
rt_iucngisd.org <- robotstxt(domain = "iucngisd.org")
## | get_robotstxt(): iucngisd.org; HTTP status: 404 |
Regardless of this, I still checked the site’s terms of service, where my scraping and subsequent analysis appears to fall under Derivative works. Next, I needed to determine the scraping pathway; that is, how are the individual species’ data accessed on the site?
Luckily this was simply a base URL for the site with the genus and species name appended to the end of it separated by a “+”:
“http://www.iucngisd.org/gisd/speciesname/” & “Genus+species”
Now all I needed was a list of species names. For this I used the Advanced Search Options to download a .csv text file of all the Liliopsida species (113) in the database. I picked this taxonomic group (flowering plants) in part because of the fundamental ways some of these species can alter ecosystems where they become invasive, and also because of their number which would make manually visiting each species’ page to gather information very inefficient.
Once saved to my computer, I was ready to use R to process the file, scrape the data, and analyze it:
II. Scraping sequence
Step 1. Load the list of 113 Liliopsida species from the GISD search.SppList <- read_delim("export_gisd_Liliopisda.csv", ";", escape_double = FALSE, trim_ws = TRUE) # to display some of this loaded data frame as an HTML table: kable(SppList[1:5,]) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
- The file has the extension “.csv” and the separator is a semicolon (a standard often used in Europe) not a comma, so the
read_delim()
function is used instead ofread_csv()
. - The result of the
kable()
andkable_styling()
functions chained together with %>% (a “pipe” from the magrittr package that gets loaded by the dplyr package) produces a clean-looking HTML table output.
Species | Kingdom | Phylum | Class | Order | Family | System |
---|---|---|---|---|---|---|
Aegilops triuncialis | Plantae | Magnoliophyta | Liliopsida | Cyperales | Poaceae | Terrestrial |
Agapanthus praecox | Plantae | Magnoliophyta | Liliopsida | Liliales | Liliaceae | Terrestrial |
Agave americana | Plantae | Magnoliophyta | Liliopsida | Liliales | Agavaceae | Terrestrial |
Agave sisalana | Plantae | Magnoliophyta | Liliopsida | Liliales | Agavaceae | Terrestrial |
Agrostis capillaris | Plantae | Magnoliophyta | Liliopsida | Cyperales | Poaceae | Terrestrial |
SppList <- SppList %>% tidyr::separate(Species, c("genus", "species"), sep = " ", remove = FALSE) %>% tidyr::unite(c(genus, species), col="genspp", sep = "+", remove = FALSE) %>% mutate(url=paste0("http://www.iucngisd.org/gisd/speciesname/", genspp)) %>% select(-genspp) SppList[1:5,]$url # display the first 5 URLs
The tidyr package separate()
and unite()
functions are used within the dplyr chain to (1) split the Species column names by the space " "
between them into new genus and species columns; and (2) combine these columns into a new genspp column of the names joined by a "+"
. The mutate()
function then creates a new url column of the base URL combined with the genspp name (utilizing the paste0()
function) and the genspp column is dropped by select()
.
## | [1] | "http://www.iucngisd.org/gisd/speciesname/Aegilops+triuncialis" | |
## | [2] | "http://www.iucngisd.org/gisd/speciesname/Agapanthus+praecox" | |
## | [3] | "http://www.iucngisd.org/gisd/speciesname/Agave+americana" | |
## | [4] | "http://www.iucngisd.org/gisd/speciesname/Agave+sisalana" | |
## | [5] | "http://www.iucngisd.org/gisd/speciesname/Agrostis+capillaris |
This resulted in 113 URLs of pages that I wanted to scrape information from. But what information? I am always interested in geographic trends and patterns so I settled on getting the lists of countries where the species are from (their origin or “native range”) and the lists of countries where they ended up (their destination or “alien range”).
This data is contained in two HTML lists on the DISTRIBUTION tab of a page. For example, Agrostis triuncialis is considered an alien species in Japan and the U.S. and a native species in many Eurasian countries.
The two XPaths of interest were //*[@id="ar-col"]
for the alien range countries and //*[@id="nr-col"]
for the native range countries. I used these identifiers within the structure of a loop to iterate through each URL and collect the countries in two separate lists.
Loops can be a bit tricky to understand if you haven’t tried one before, but I will emphasize that:
- I used a small subset of URLs (1:5) in testing the loop (I left in the brackets to demonstrate this).
- I preallocated the lists (a_ranges and n_ranges) to hold the data using the
vector()
function, and named them using thenames()
function. - The
cat()
function is a simple trick to make a counter to view the progress of the loop. - I added a slight delay to each “scrape” of 5 seconds using the
Sys.sleep()
function to avoid overwhelming the site’s server (some robots.txt files specify the amount of delay). If I was scraping much more than 100 pages, I would also try to contact the site administrator to make sure this would be OK.
urls <- SppList[1:113,]$url sppnames <- SppList[1:113,]$Species a_ranges <- vector(mode = "list", length = length(urls)) n_ranges <- vector(mode = "list", length = length(urls)) names(a_ranges) <- sppnames names(n_ranges) <- sppnames for (i in 1:length(urls)) { cat(". ") tryCatch( page % read_html(), error = function(e){NA}) tryCatch( a_range % html_nodes(xpath='//*[@id="l-1st-step"]') %>% html_nodes("li") %>% html_text(), error = function(e){NA}) tryCatch( n_range % html_nodes(xpath='//*[@id="nr-col"]') %>% html_nodes("li") %>% html_text(), error = function(e){NA}) a_ranges[[i]] <- a_range n_ranges[[i]] <- n_range Sys.sleep(5) }
To deconstruct this loop a bit:
- The
for()
function defines"i"
to represent the sequence of URLs in the vector named urls created on line 1. - The
tryCatch()
functions help keep the loop running in case errors are encountered during the scraping. - Each scrape (done by rvest package functions) is comprised of:
- Reading the page (e.g., for
"i"=1
,urls[i] %>% read_html()
reads the first URL’s page). - Extracting the country names in the HTML nodes (
html_nodes()
andhtml_text()
gets the text from the list elements in the XPath). - Adding the country names to the preallocated and named lists (the double-brackets
[[i]]
achieves this)
- Reading the page (e.g., for
Great, the scraping is done! The next step is to convert the lists to data frames for analysis. You will also often find that some cleanup of the text is needed. In this case the countries in the ar_ranges list have a bracketed number before them (as in “[1] Japan”), and this unwanted text along with extra white-space is removed using the gsub()
function nested in the trimsw()
function.
This cleanup is important for any subsequent task where the country names might need to be joined to other data properly, like I did for the rworldmap package mapping function below.
a_ranges_df <- data.frame(Species = rep(names(a_ranges), sapply(a_ranges, length)), AR = unlist(a_ranges), row.names = NULL, stringsAsFactors = FALSE) a_ranges_df$AR <- trimws(gsub("[[:digit:]]|\\[|\\]","", a_ranges_df$AR)) a_ranges_df <- inner_join(a_ranges_df, SppList[, 1:9], by = "Species") n_ranges_df <- data.frame(Species = rep(names(n_ranges), sapply(n_ranges, length)), NR = unlist(n_ranges), row.names = NULL, stringsAsFactors = FALSE) n_ranges_df <- inner_join(n_ranges_df, SppList[, 1:9], by = "Species")
- The
data.frame()
function creates a data frame with a Species column of the list item name repeated (therep()
function) for as many rows as there are countries for a species (accomplished by thesapply()
function) and an AR/NR column of country names from converting the list to a vector (theunlist()
function). - The
gsub()
function cleaning up the AR data replaces any number ([[:digit:]]
) and the two brackets ([
and]
) with nothing (""
). Note the OR operator|
separating these items as well as the\\
which “escapes” the brackets as special characters. - The dplyr package
inner_join()
functions join SppList data back to each new data frame (useful later on for data summaries by Family, System, etc.).
III. Results
tbl_a % group_by(AR) %>% summarise(count = n()) %>% arrange(-count) %>% slice(1:5) %>% rename(Country=AR, `Species Count` = count) tbl_n % group_by(NR) %>% summarise(count = n()) %>% arrange(-count) %>% slice(1:5) %>% rename(Country=NR, `Species Count` = count) col <- tbl_df(cbind(rep("",5))) tbl_an <- cbind(tbl_a, col, tbl_n) colnames(tbl_an)[3] % kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "compressed"), full_width = FALSE) %>% column_spec(c(1,4), width = "10em") %>% column_spec(c(2,3,5), width = "8em") %>% add_header_above(c("Top Alien Range Countries" = 2, " " = 1, "Top Native Range Countries" = 2))
- The data summaries are obtained by dplyr chains that assign country names as a grouping variable (the
group_by()
function), count them (then()
function within thesummarise()
function), sort the counts in descending order (thearrange()
function, take the top five (theslice()
function), and finally rename the columns to something nicer for display (therename()
function). - To display them side by side, the
cbind()
function is used to bind them together (with a blank, empty space-named column in between them to produce some separation). - Functions in the kableExtra package set the HTML table column widths and add an additional header (note the middle blank column gets
" "
white space for a header; using""
results in an error).
Top Alien Range Countries |
Top Native Range Countries |
|||
---|---|---|---|---|
Country | Species Count | Country | Species Count | |
united states | 84 | china | 25 | |
australia | 61 | india | 23 | |
new zealand | 58 | italy | 20 | |
new caledonia | 39 | russian federation | 20 | |
french polynesia | 38 | spain | 19 |
I emphasize rough for a few reasons that I can think of: (1.) as useful as I believe these accounts are, the information they contain is unlikely to be perfect and some countries may be missing from the lists; (2) just because a country makes either list doesn’t mean that the species is known to be harmful in that country (e.g., it may be merely established) or that the country is a known source; and (3.) the will and the resources each country has to get this information can vary widely.
That said, a whopping 84 of the Liliopsida species are considered alien to the U.S., followed by Austrialia with 61 and New Zealand with 58. China includes the native ranges of the most species at 25, followed closely by India with 23.
I next calculated the frequency of each county pair across the joined alien and native range data, both as a global total and considering the U.S. as a source alone, to get an idea of the magnitude of the potential (not proven) connections between a country as a source and a country as a destination:
an_ranges_df <- inner_join(a_ranges_df, n_ranges_df, by = "Species") tbl_gl % count(NR, AR) %>% arrange(-n) %>% slice(1:5) %>% rename(`Native Range`=NR, `Alien Range`=AR) tbl_us % filter(NR=="united states") %>% count(NR, AR) %>% arrange(-n) %>% slice(1:5) %>% rename(`Native Range`=NR, `Alien Range`=AR) tbl_glus <- cbind(tbl_gl, col, tbl_us) colnames(tbl_glus)[4] % kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "compressed"), full_width = FALSE, position = "center") %>% column_spec(c(1,2,5,6), width = "10em") %>% column_spec(c(3,4,7), width = "5em") %>% add_header_above(c("Top (Potential) Global Connections" = 3, " " = 1, "Top (Potential) U.S. Origin Connections" = 3))
- After joining the alien range and native range data, the
count()
function in the dplyr chains is a handy way to get a count of the country pairs. - The
filter()
function in the tbl_us chain provides a U.S. native range-filtered total (note the use of==
).
Top (Potential) Global Connections |
Top (Potential) U.S. Origin Connections |
|||||
---|---|---|---|---|---|---|
Native Range | Alien Range | n | Native Range | Alien Range | n | |
china | united states | 23 | united states | new zealand | 8 | |
india | united states | 21 | united states | united states | 8 | |
italy | united states | 19 | united states | australia | 7 | |
russian federation | united states | 18 | united states | ecuador | 4 | |
spain | united states | 18 | united states | fiji | 4 |
Not surprisingly given their prior rankings, the highest pair of countries for the global total is China and the U.S. While this does not prove that China is the actual source of most of the alien occurrences of these species in the U.S. (e.g., India is a close second), it could indicate something about the species in the China region and/or implicate factors such as trade (a powerful driver of species introductions) that may increase their chances of becoming established here.
Although the numbers of connections are relatively low for the U.S. filtered data it is interesting that South Pacific countries (New Zealand, Australia, and Fiji) are prominent. But what about the U.S. being paired with itself? Mostly this is because some species native to one area of the U.S. have become established in another (e.g., Spartina alterniflora is native on the east coast and alien on the west coast); a more likely occurrence for larger countries. Several species are in both lists for rather complex taxonomic reasons (e.g., if interested, check out Phragmites australis).
The rworldmap mapping procedure first requires that the countries and their counts get matched to the map via the aptly-named joinCountryData2Map()
function. This function also provides statistics on the matching success, here for the alien range countries:
a_ranges_df_c % group_by(AR) %>% summarise(count = n()) gtdMapAR <- joinCountryData2Map(a_ranges_df_c, nameJoinColumn="AR", joinCode="NAME")
## | 187 codes from your data successfully matched countries in the map |
## | 45 codes from your data failed to match with a country code in the map |
## | 56 codes from the map weren't represented in your data |
The matching was a success overall; most of the 45 codes from the scraped and cleaned AR data that failed were for geographic entities such as Africa that I didn’t attempt to look into further to try and fix.
Producing the chloropleth map of the species counts was done using the mapCountryData()
function with some customization of the color ramp and legend:
colfunc_a <- colorRampPalette(c("white", "red")) map1Params <- mapCountryData(gtdMapAR, nameColumnToPlot = "count", catMethod = c(0,1,5,10,15,20,25,85), colourPalette=c(colfunc_a(7)[1:6],"darkred"), mapTitle = "Liliopsida Species Counts: Alien Range Countries", addLegend = FALSE) map1Params$legendText 25") do.call(addMapLegendBoxes, c(map1Params, x = "left", horiz = FALSE, cex = 0.8, title = "# Species"))
- Assigning the
colorRampPalette()
function to the colfunc_a object creates a flexible way to generate a number of color values along a gradient from white to red. - The
mapCountryData()
function takes the base map with the count data attached that was created previously (gtdMapAR), and symbolizes it, adds a title, etc.- I chose breaks of 5 for the count categories (catMethod parameter), with 0 and >25 also represented (0 is actually NA in the base map data, since the counts are >0).
- Note the use of
colfunc_a()
to generate 7 color values, the last of which is replaced manually with “darkred” to set the >25 category apart a bit (the first value, white, is assigned to NA). - The addLegend parameter is FALSE because I wanted to replace the default continuous bar legend with a box legend.
- The box legend is added with the
do.call()
function (after the vector of category labels is added first to the map object by accessing its legendText slot) and formatted to be vertical, left, etc. with additional parameters.
n_ranges_df_c % group_by(NR) %>% summarise(count = n()) gtdMapNR <- joinCountryData2Map(n_ranges_df_c, nameJoinColumn="NR", joinCode="NAME")
## | 170 codes from your data successfully matched countries in the map |
## | 29 codes from your data failed to match with a country code in the map |
## | 73 codes from the map weren't represented in your data |
colfunc_n <- colorRampPalette(c("white", "blue")) map2Params <- mapCountryData(gtdMapNR, nameColumnToPlot="count", catMethod=c(0,1,5,10,15,20,25), colourPalette=colfunc_n(6), mapTitle = "Liliopsida Species Counts: Native Range Countries", addLegend = FALSE) map2Params$legendText <- c("0", "1-5", "6-10", "11-15", "16-20", "20-25") do.call(addMapLegendBoxes, c(map2Params, x = "left", horiz = FALSE, cex = 0.8, title = "# Species"))
A somewhat reversed geographic trend in the northern hemisphere is evident in the native range map compared to the alien range map, with higher counts of species in Asia and Europe on the whole. Could this difference between the maps be a reflection of a some historical Old World-New World pathway for these species?
Right as I was finishing up this post (honestly!), I found a published journal article2 that resembles what I did (they even used the rworldmap package). Please read it for a much more in-depth and sophisticated interpretation of data in the GISD and other sources than I could hope to do here.
So that basically wraps it up, anyone reading this with coding expertise (perhaps even in another program like Python) might do things better, or differently. I would love to hear about it, or anything else relevant to this process or particular subject matter in the comments. Best of luck with your scraping!
1 Invasive Species Specialist Group ISSG 2015. The Global Invasive Species Database. Version 2015.1
2 Turbelin, A. J., Malamud, B. D., & Francis, R. A. (2016). Mapping the global state of invasive alien species: patterns of invasion and policy responses. Global Ecology and Biogeography, 26(1), 78-92. https://onlinelibrary.wiley.com/doi/abs/10.1111/geb.12517
Ecologist, GIS and remote sensing analyst, developing data scientist, and founder of Natural Data Solutions.
0 Comments