Interacting with open-source databases through R

Accessing data through R

  • Option 1: Download from a repository, read into R.
# Navigate to website:
# https://figshare.com/articles/dataset/When_and_where_plant-soil_feedback_may_promote_plant_coexistence_A_meta-analysis_Supplementary_Table_1_File_containing_raw_data_and_citations_for_the_meta-analysis/7985195/1
# Download and save file
data <- read_csv("/path/to/file.csv")

Accessing data through R

  • Option 2: Use R to download from a repository, and proceed with your analysis
curl::curl_download("https://figshare.com/ndownloader/files/14874749", destfile = "../crawford-supplement.xlsx")

data <- read_excel("../data/crawford-supplement.xlsx", sheet = "Data") 

Accessing data through R

Web-scraping through R

library(tidyverse)
library(rvest)

url <- "https://rvest.tidyverse.org/articles/starwars.html"
html <- read_html(url)

section <- html |> html_elements("section")
section[[1]]
## {html_node}
## <section>
## [1] <h2 data-id="1">\nThe Phantom Menace\n</h2>
## [2] <p>\nReleased: 1999-05-19\n</p>
## [3] <p>\nDirector: <span class="director">George Lucas</span>\n</p>
## [4] <div class="crawl">\n<p>\nTurmoil has engulfed the Galactic Republic. The ...

We can now extract data from the html output.

movies <- section |> html_element("h2") |> html_text2()
movies
## [1] "The Phantom Menace"      "Attack of the Clones"   
## [3] "Revenge of the Sith"     "A New Hope"             
## [5] "The Empire Strikes Back" "Return of the Jedi"     
## [7] "The Force Awakens"

directors <- section |> html_element(".director") |> html_text2()
directors
## [1] "George Lucas"     "George Lucas"     "George Lucas"     "George Lucas"    
## [5] "Irvin Kershner"   "Richard Marquand" "J. J. Abrams"

crawl <- section |> html_element(".crawl") |> html_text2()
crawl
## [1] "Turmoil has engulfed the Galactic Republic. The taxation of trade routes to outlying star systems is in dispute.\n\nHoping to resolve the matter with a blockade of deadly battleships, the greedy Trade Federation has stopped all shipping to the small planet of Naboo.\n\nWhile the Congress of the Republic endlessly debates this alarming chain of events, the Supreme Chancellor has secretly dispatched two Jedi Knights, the guardians of peace and justice in the galaxy, to settle the conflict…."           
## [2] "There is unrest in the Galactic Senate. Several thousand solar systems have declared their intentions to leave the Republic.\n\nThis separatist movement, under the leadership of the mysterious Count Dooku, has made it difficult for the limited number of Jedi Knights to maintain peace and order in the galaxy.\n\nSenator Amidala, the former Queen of Naboo, is returning to the Galactic Senate to vote on the critical issue of creating an ARMY OF THE REPUBLIC to assist the overwhelmed Jedi…."             
## [3] "War! The Republic is crumbling under attacks by the ruthless Sith Lord, Count Dooku. There are heroes on both sides. Evil is everywhere.\n\nIn a stunning move, the fiendish droid leader, General Grievous, has swept into the Republic capital and kidnapped Chancellor Palpatine, leader of the Galactic Senate.\n\nAs the Separatist Droid Army attempts to flee the besieged capital with their valuable hostage, two Jedi Knights lead a desperate mission to rescue the captive Chancellor…."                     
## [4] "It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire.\n\nDuring the battle, Rebel spies managed to steal secret plans to the Empire’s ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet.\n\nPursued by the Empire’s sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy…."
## [5] "It is a dark time for the Rebellion. Although the Death Star has been destroyed, Imperial troops have driven the Rebel forces from their hidden base and pursued them across the galaxy.\n\nEvading the dreaded Imperial Starfleet, a group of freedom fighters led by Luke Skywalker has established a new secret base on the remote ice world of Hoth.\n\nThe evil lord Darth Vader, obsessed with finding young Skywalker, has dispatched thousands of remote probes into the far reaches of space…."                 
## [6] "Luke Skywalker has returned to his home planet of Tatooine in an attempt to rescue his friend Han Solo from the clutches of the vile gangster Jabba the Hutt.\n\nLittle does Luke know that the GALACTIC EMPIRE has secretly begun construction on a new armored space station even more powerful than the first dreaded Death Star.\n\nWhen completed, this ultimate weapon will spell certain doom for the small band of rebels struggling to restore freedom to the galaxy…"                                          
## [7] "Luke Skywalker has vanished. In his absence, the sinister FIRST ORDER has risen from the ashes of the Empire and will not rest until Skywalker, the last Jedi, has been destroyed. With the support of the REPUBLIC, General Leia Organa leads a brave RESISTANCE. She is desperate to find her brother Luke and gain his help in restoring peace and justice to the galaxy. Leia has sent her most daring pilot on a secret mission to Jakku, where an old ally has discovered a clue to Luke’s whereabouts…."

Accessing data through R

Potential limitations of the preceding options:

  • Static data product
  • Run into some limitations with large files
  • If data download is happening outside of a script, limited reproducibility

Alternative Request data from dynamic databases

  • Data product need not be static (e.g. regularly updated database of all records in iNaturalist)
  • Queries are happening through systems designed for data, so better for large files
  • Reproducible data provenance, as the exact query can be recreated

Accesing databases with using an API

Application programming interfaces (APIs) are mechanisms for communicating between two computers

  • You use APIs on a daily basis
  • e.g. Weather app on your phone is connecting with a remote database through an API
  • Several types of APIs, we’ll focus on “REST”1 APIs.
  • APIs as an alternative to “GUIs” (Graphical user interfaces, where you point-and-click on options)
  • To interact with an API, we need a properly formatted http request.

Example

Accessing US Demographic data from the Census Bureau

Using R to access an API, with the httr and jsonlite packages.

# If needed, install two useful packages 
# pak::pak("httr")
# pak::pak("jsonlite")

library("tidyverse")
library("httr")
library("jsonlite")

# Path to US Census API
path <- 'https://api.census.gov/data/2018/acs/acs5'

# To use APIs, we need a good handle on the expected query parameters
# Here, we are building a query for the B19013_001E database
# and asking for data from all counties in state ID 55 (Wisconsin)
# State codes are enumerated at
# https://www.census.gov/library/reference/code-lists/ansi/ansi-codes-for-states.html
query_params <- list('get' = 'NAME,B19013_001E', 
                     'for' = 'county:*',
                     'in' = 'state:55') 

# UNCOMMENT the following line to run the API search
# response <- GET(path, query = query_params)
# saveRDS(response, file = "../10-databases/mdresponse.rds")
response <- readRDS("mdresponse.rds")
response |>
 content(as = 'text') |> 
 fromJSON() |> 
  as_tibble()
## # A tibble: 73 × 4
##    V1                           V2          V3    V4    
##    <chr>                        <chr>       <chr> <chr> 
##  1 NAME                         B19013_001E state county
##  2 Iron County, Wisconsin       40801       55    051   
##  3 Clark County, Wisconsin      51872       55    019   
##  4 St. Croix County, Wisconsin  81124       55    109   
##  5 Oconto County, Wisconsin     57105       55    083   
##  6 Lincoln County, Wisconsin    56086       55    069   
##  7 Waupaca County, Wisconsin    57680       55    135   
##  8 Barron County, Wisconsin     50903       55    005   
##  9 Sawyer County, Wisconsin     44555       55    113   
## 10 Green Lake County, Wisconsin 53260       55    047   
## # ℹ 63 more rows

# We can now progress with this analysis as we choose.

Programmatically download data

# NOTE that this code block is not executed.
# Retrieve data for four states (ID 1, 12, 32, 55)
multi_state <- 
tibble(state_ids = sprintf("%02d", c(1, 12,32, 55))) |>
  mutate(path = "https://api.census.gov/data/2018/acs/acs5",
         query_params = map(state_ids, \(x) 
                     list('get' = 'NAME,B19013_001E', 
                          'for' = 'county:*',
                          'in' = paste0('state:',x)))) |> 
  mutate(response = map2(path, query_params,
                         \(x,y)
                         GET(x, query = y)))
# saveRDS(multi_state, file = "10-databases/multistate-database.rds")

Programmatically download data

multi_state <- readRDS("multistate-database.rds")

multi_state <-
  multi_state |> 
  mutate(data = map(response, \(x)
                    x |> 
                      content(as = "text") |> 
                      fromJSON() |> 
                      as_tibble() |> 
                      janitor::row_to_names(1))) |> 
  unnest(data) |> 
  mutate(state_name = str_extract(NAME, "\\s(\\w+)$")) 

Programmatically download data

multi_state |> 
  count(state_name)
## # A tibble: 4 × 2
##   state_name       n
##   <chr>        <int>
## 1 " Alabama"      67
## 2 " Florida"      67
## 3 " Nevada"       17
## 4 " Wisconsin"    72

multi_state |> 
  rename(medinc = B19013_001E) |> 
  mutate(medinc = as.numeric(medinc)) |> 
  ggplot(aes(x = medinc)) + 
  geom_density() + 
  facet_wrap(.~state_name)

Another example

  • GBIF has an incredible database for species occurrence data
  • Collates information from various sources - e.g. eBird, iNaturalist, and much more (see here) for complete list
  • Data are accessible through GBIF’s API
  • Different APIs for accessing occurrence data, retrieving individual records, etc.

Another example

  • Using GBIF to look for all plants in the family Melastomataceae recorded in Ecuador between 500-750 meters
  • See documentation for details on how to use this API.
path <- 'https://api.gbif.org/v1/occurrence/search'

# 
query_params <- list('country' = 'EC', 
                     'elevation' = '500,750',
                     'acceptedTaxonKey' = '6683') # See https://www.gbif.org/species/ to find taxonKey for your group

# UNCOMMENT the following line to run the API search
# response <- GET(path, query = query_params)
# saveRDS(response, file = "10-databases/melastome-database.rds")

View the output (restricted to twenty records for now)

readRDS(file = "melastome-database.rds") |>
# response |> 
 content(as = 'text') |> 
 fromJSON() |> 
  pluck("results") |> 
  as_tibble() |> 
  select(where(is.character)) |> #View()#colnames()
  select(12,27,33,40, 43) |> 
  knitr::kable() |> 
  kableExtra::kable_styling(font_size = 10) |> 
  # kableExtra::kable_paper("hover", full_width = F) 
  kableExtra::scroll_box(width = "1000px", height = "500px") 
acceptedScientificName datasetName country locality gbifID
Melastomataceae Tropicos Ecuador Along road from Lita to Baboso, along Río Baboso, near bridge. 1257919934
Melastomataceae Tropicos Ecuador Distrito Metropolitano de Quito, carretero Mashpi-Los Bancos. 1257738373
Melastomataceae Tropicos Ecuador Distrito Metropolitano de Quito, carretero Mashpi-Los Bancos. 4061104558
Melastomataceae Tropicos Ecuador Distrito Metropolitano de Quito, carretero Mashpi-Los Bancos. 4032067677
Melastomataceae Tropicos Ecuador Distrito Metropolitano de Quito, carretero Mashpi-Los Bancos. 4061087222
Melastomataceae Colección Cientifica Herbario Universidad Técnica del Norte Ecuador Lita 4525407940
Melastomataceae Vascular Ecuador Cristóbal Colón. Junto a Parcelas permanentes de Proyecto Ecomadera 4067891159
Melastomataceae Tropicos Ecuador Cristobal Colón. Junto a Parcelas Permanentes de Projecto ECOMADERA 1257937089
Melastomataceae Tropicos Ecuador Cristobal Colón. Junto a Parcelas Permanentes de Projecto ECOMADERA 4031742182
Melastomataceae Tropicos Ecuador Cordillera de Cutucú. Parroquia Yaupi. Centro Shuar Tumpaim. Sendero a lo largo del Río Satapa, afluyente del Río Yaupi. Bosque maduro en suelo aluvial. 1258048075
Melastomataceae Tropicos Ecuador Cordillera de Cutucú. Parroquia Yaupi. Centro Shuar Tumpaim. Sendero a lo largo del Río Satapa, afluyente del Río Yaupi. Bosque maduro en suelo aluvial. 4061178559
Melastomataceae Tropicos Ecuador Cordillera de Cutucú, vertiente nororiental. Centro Shuar Wisui, parroquia Macuma. Bosque maduro no intervenido, en peqeueñas colinas; árboles hasta 30 m, suelo amarillento. 4032024280
Melastomataceae Tropicos Ecuador Cordillera de Cutucú, vertiente nororiental. Centro Shuar Wisui, parroquia Macuma. Bosque maduro no intervenido, en peqeueñas colinas; árboles hasta 30 m, suelo amarillento. 4032014293
Melastomataceae Tropicos Ecuador Cordillera de Cutucú, vertiente nororiental. Centro Shuar Wisui, parroquia Macuma. Bosque maduro no intervenido, en peqeueñas colinas; árboles hasta 30 m, suelo amarillento. 4032081305
Melastomataceae Tropicos Ecuador Cordillera de Cutucú, vertiente nororiental. Centro Shuar Wisui, parroquia Macuma. Bosque maduro no intervenido, en peqeueñas colinas; árboles hasta 30 m, suelo amarillento. 4032127250
Melastomataceae Tropicos Ecuador Cordillera de Cutucú, vertiente nororiental. Centro Shuar Wisui, parroquia Macuma. Pendiente del Cerro Wisui. Bosque maduro no intervenido; árboles hasta 30 m. Suelo negro con piedras calizas. 4031737251
Melastomataceae Tropicos Ecuador Cordillera de Cutucú, vertiente nororiental. Centro Shuar Wisui, parroquia Macuma. Pendiente del Cerro Wisui. Bosque maduro no intervenido; árboles hasta 30 m. Suelo negro con piedras calizas. 4031644560
Melastomataceae Tropicos Ecuador Parroquia Macuma. Centro Shuar Wisui, en la base nororiental de la Cordillera de Cutucú. Bosque maduro, no intervenido, en la planada del Centro Wisui. Suelo arcilloso. 4031728384
Melastomataceae Tropicos Ecuador Cordillera de Cutucú, vertiente nororiental. Centro Shuar Wisui, parroquia Macuma. Bosque maduro no intervenido, en peqeueñas colinas; árboles hasta 30 m, suelo amarillento. 1260844193
Melastomataceae Tropicos Ecuador Cordillera de Cutucú, vertiente nororiental. Centro Shuar Wisui, parroquia Macuma. Bosque maduro no intervenido, en peqeueñas colinas; árboles hasta 30 m, suelo amarillento. 1260844198

Guidelines for APIs

  • Well-maintained APIs have documentation on query parameters - get familiar with these!
  • e.g. US EPA API guidlines here


  • Many APIs require that you sign up for a key before making requests (e.g. for downloading records from GBIF)
  • This to ensure that API servers aren’t “spammed” by requests

R Packages for working with APIs

  • You don’t always have to “Hand-code” API queries

  • In some cases, others have built packages to facilitate API searches.

  • e.g. tidycensus for accessing US Census API; rgbif for the GBIF (Global Biodiversity Information Facility) API

  • Browse a large list of packages for accessing eco-related APIs through R here

Exercise

Work through the rGBIF tutorial, which will introduce you to the use of various useful packages for biodiversity informatics in R.