How to deal with datasets that are larger than our computer’s memory? Do we still need to learn Structured Query Language (SQL) to create and manage a database?
As an incipient data analyst, one of my first major challenges was to build and manage a spatial database using PostGIS, an open-source software that adds a geographic to PostgreSQL relational databases. I was given several text files in a hard drive that weighed approximately 10 GB each! My first reaction was to double click on the first text file that I saw… but this was clearly something that would not work, because my computer crashed every single time. I read on querying, indexing, pgdumping, and vacuuming, but these concepts were so foreign to me that I didn't even know what to ask Google -- a feeling that almost anyone starting off working with a new data language can relate to.
All I knew was that I had to create a replicable automated process for building a data warehouse and that I would have to optimize the import process so that my computer would not crash. After spending several weeks on stackoverflow, I discovered that the most efficient command for importing the data into the spatial database was the \copy command. As shown in figure 1, through many iterations and failures, I found that the most efficient workflow was 1) creating a spatial database, 2) creating the spatial extensions with the copy command, 3) importing the data using the split command to partition large files and using a parallel GNU, 4) automating the process of cleaning the tables, and finally, 5) creating a spatial environment where the spatial reference system was specified, adding the geometry type, and creating a spatial index.
Figure 1: Methodology for building an automated and replicable PostGIS data warehouse for U.S. business establishments.
From this endeavor, I discovered that the capacity limit for loading data into PostgreSQL was approximately 11 GB and that using the split command for those tables improved the performance of importing the data into the database. I also found that parallelizing the import improved the performance by ~4x, using 8 of the CPU cores from the server I was working with. In particular, I found that GNU parallel provides better control and is more efficient than xargs when using a server since it keeps the CPUs active, generating a new process when another one finishes, also allowing me to clean the tables in a more efficient manner.
Yet, it seems that dbplyr, a relatively recently updated R package (2017) from the tidyverse stack works with SQL in the backend and converts dplyr code into SQL. Although dplyr allows us to import, clean, and analyze data into R, it does not allow us to modify and manage the database. dbplyr, however, allows us to connect to any remote or local database and manage it without using any SQL verbs! There is no need to translate SQL syntax into R because dbplyr does it for us, translating our R code into SQL, submitting it to the database, and translating the query into an R data frame. This makes it easy for us to use our well-known dplyr pipes to alter the tables through select, filter, mutate, perform joins, you name the function! As an example below taken from datacarpenty, we can query a table from a database by using dplyr verbs and reading it as a data frame instead of using SQL syntax.
Using SQL code:
tbl(mammals, sql("SELECT year, species_id, plot_id FROM surveys"))
Using dbplyr code:
surveys <-tbl(mammals, "surveys")
surveys %>%
select(year, species_id, plot_id)
The reason why dbplyr is great at manipulating large amounts of data is because it uses the “lazy” approach to read in tables from the database. This means that it translates dbplyr verbs into direct queries and then implements them in an optimized way. The lazy approach does not actually retrieve the entire tables from our database, unless we explicitly state that we want the whole query results through ‘collect()’ and work on them without being connected to the database. As a side note, dtplyr (the data table back-end for dplyr) has also made our processes more efficient, translating dplyr verbs into direct calls to the data.table C++ API, using highly efficient functions like fread. As with dbplyr, dtplyr uses a lazy approach, refraining from performing any computation before requesting as.data.table(), as.data.frame() or as_tibble().
In terms of installing dbplyr, it is important to take into account that dbplyr automatically installs a DBI (database interface), but we must install specific DBI packages if we want to connect to different databases. For example, connecting to MySQL, Postgres, SQLite, and Google’s Big Query, we would also have to install “RMariaDB”, “RPostgres”, “RSQLite”, and “bigrquery” packages, accordingly.
After all the hours spent in stackoverflow, I really came to enjoy managing a data frame using SQL. Although you don’t have to become an SQL expert by using this R package, learning SQL is extremely useful and a valuable skill set in any data scientist’s toolkit. It will inevitably help us in debugging and improving our understanding of data analytics, even if we are not software developers. For those users already acquainted with dplyr syntax, this package is a great starting point to learn more about databases and how to manage large scale data whenever we don’t have enough data in our local environment!