A Brief Introduction to Cloud Native Approaches for Big Data Analysis

March 20, 2023

Monitoring technologies (e.g., satellites, smart phones, acoustic recorders) are creating vast amounts of data about our earth every day. These data hold promise to provide global insights on everything from biodiversity patterns to vegetation change at increasingly fine spatial and temporal resolution. Leveraging this information often requires us to work with data that is too big to fit in our computer's "working memory" (RAM) or even to download to our computer's hard drive. Some data - like high resolution remote sensing imagery - might be too large to download even when we have access to big compute infrastructure.

In this post, I will walk through some tools, terms, and examples to get started with cloud native workflows. Cloud native workflows allow us to remotely access and query large data from online resources or web services (i.e., the “cloud”), all while skipping the need to download large files. We'll touch on a couple common data storage types (e.g., Parquet files), cataloging structures (e.g., S3) and tools (e.g., Apache Arrow and Vsicurl) to query these online data. We'll focus on an example of accessing big tabular data but will also touch on spatial data. I’ll be sharing resources along the way for deeper dives on each of the topics and tools.

Example code will be provided in R and case studies focused on environmental data, but these concepts are not specific to a programming language or research domain. I hope that this post provides a starting point for anyone interested in working with big data using cloud native approaches!

Tabular data: Querying files from cloud storage with Apache Arrow

Example: Plotting billions of biodiversity observations through time

The Global Biodiversity Information Facility (GBIF) synthesizes biodiversity data derived from a variety of sources, ranging from museum specimens to smartphone photos recorded on participatory science platforms. The database includes over a billion species occurrence records - probably not too big to download onto your machine's hard drive, but likely too big to read into your working memory. Downloading the whole database and then figuring out how to work with it is truly a hassle (you can trust me on that, or you can try it yourself!), so we will walk through a cloud native approach to collect and download only the information we need from this database.

Step 1. Finding and accessing the data 

There are snapshots (i.e., versions) of the entire GBIF database on Amazon Web Services (AWS) S3. You check them out and pick a version here. A lot of data is stored on AWS or other S3 compatible platforms (not just observation records of plants and animals), so the following workflow is not specific to GBIF.

What is S3?  Simple Storage Service (S3) is just a structure of object storage through a web service interface. S3 compatibility is a requirement for cloud-native workflows (what we are working towards here). Cloud-native applications use the S3 API to communicate with object storage. The GBIF snapshot we leverage is stored on Amazon Web Services (AWS) S3, but there are other S3 compatible object storage platforms (e.g., MinIO) that you can query with the same methods that we will explore below.

First, we will grab the S3 Uniform Resource Identifier (URI) for a recent GBIF snapshot (note the s3:// placed before the URI name)

gbif_snapshot <- "s3://gbif-open-data-us-east-1/occurrence/2022-11-01/occurrence.parquet"

You might notice that this is a .parquet file. 

What is a Parquet file? Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. You can read all about it here. Put simply, it is an efficient way to store tabular data - what you might otherwise store in a .csv

We'll use the R package arrow which supports reading data sets from cloud storage without having to download them (allowing for large queries to run locally by only downloading the parts of the data-set necessary). You can read more about Apache Arrow if you are interested in digging deeper into other applications (or interested in interfacing with the arrow package in other programming languages like Python). Note that arrow doesn't require that data is a parquet file - it also works with "csv", "text" and "feather" file formats.

library(arrow)

library(tidyverse)

The open_dataset function in the arrow package connects us to the online parquet file (in this case, the GBIF snapshot from above).

Notice that this doesn't load the entire dataset into your computer's memory, but, conveniently, we can still take a look at the variable names (and even take a glimpse at a small subset of the data!)

db <- open_dataset(gbif_snapshot)

db # take a look at the variables we can query

glimpse(db) # see what a subset of the data looks like

Step 2. Querying the data

The arrow package provides a dplyr interface to arrow datasets. We’ll use this interface to perform the query. Note that this approach is a bit limited, since only a subset of dplyr verbs are available to query arrow dataset objects (there is a nice resource here). Verbs like filter, group_by, summarise, select are all supported.

So, let’s filter GBIF to all observations in the United States and get a count of observations of each class of species per year. We use the collect() function to pull the result of our query into our R session.

gbif_US <- db |>

  filter(countrycode == "US") |>

  group_by(kingdom, year) |>

  count() |>

  collect()

Within minutes we can plot the number of observations by kingdom through time! 

gbif_US |>

  drop_na() |>

  filter(year > 1990) |>

  ggplot(aes(x = year,

             y = log(n),

             color = kingdom)) +

  geom_line() + theme_bw()

There are many other applications of arrow that we won't get into here (e.g., zero-copy R and Python data sharing). Check out this cheat sheet if you are interested in exploring more!

Raster data: STAC and vsicurl

Cloud native workflows are not limited to querying tabular data (e.g., Parquet, CSV files) as explored above but can also be helpful for working with spatiotemporal data. We'll focus on reading and querying data using the SpatioTemporal Asset Catalog (STAC) (STAC is just a common language to describe geospatial information, so it can more easily be worked with, indexed, and discovered. More info here). 

A lot of spatiotemporal data is indexed by STAC -- for example, Microsoft's planetary computer data catalog includes petabytes of environmental monitoring data using STAC format.

We’ll use the rstac library to make requests to the planetary computer's STAC API (and like our tabular example, similar workflows are available in Python)

library(tidyverse)

library(rstac)

library(terra)

library(sf)

Now, we search for data available on the planetary computer! 

We can set a bounding box for our data search - lets say, the San Francisco Bay Area!

And we can take a look at a given "collection" of data (e.g., Landsat data):

it_obj <- s_obj %>%

stac_search(collections = "landsat-c2-l2",

bbox = bbox) %>%

get_request() |>

items_sign(sign_fn = sign_planetary_computer())

Instead of downloading the raster, we use the prefix /vsicurl/ and the download URL from above - passing that directly to the rast() function (reads in a raster; from the terra package)

url <- paste0("/vsicurl/", it_obj$features[[1]]$assets$blue$href)

data <- rast(url)

plot(data)

We can do computation over the network too - for example calculating vegetation indices (NDVI) using a subset of bands in Landsat imagery. This allows us to only download the results into memory.

Conclusions

Cloud native workflows can be incredibly convenient and powerful when working with large data. Here, we walked through cloud native workflows for both tabular data and spatial data, using Arrow and Vsicurl. This is a very brief introduction, but I hope the links provided help jumpstart your ability to use these tools in your own work!