Mapping Census Data with tidycensus

November 6, 2023

Mapping Census Data with tidycensus

Data from the U.S. Census Bureau are widely used by researchers and policymakers. Although a variety of census data products are available online, there are substantial barriers to efficiently downloading and using those datasets for research purposes. Luckily for us, there is a much easier way to download, process, and even quickly map Census data, all within the R software environment! While this blog post is aimed toward current users of R, researchers with less experience in using R, or programming languages in general, can also view this as a potential shortcut for more quickly downloading Census data.

Why Use Census Data?

The U.S. Census Bureau provides the most complete and comprehensive array of information on the U.S. population. In addition to the full count of the U.S. population conducted every ten years (the most recent of which took place in 2020), the U.S. Census Bureau also administers a number of surveys that provide crucial demographic information. These types of data are used across many different domains, determining the shape of U.S. congressional districts, generating standards and funding formulas for government programs, and measuring phenomena such as racial segregation and neighborhood change. With so many possible applications of Census data within both policy and research, an efficient and intuitive method for downloading and analyzing these data is essential.

Introducing… tidycensus!

tidycensus is a software package created for R statistical software by Professor Kyle Walker at Texas Christian University. This package provides a shortcut to the ordinarily tedious process of downloading Census data by allowing the user to directly query the U.S. Census Bureau’s API (Application Programming Interface), thereby removing the need to go to the census website itself and download individual spreadsheets. An API provides a protocol for interactions between computers or applications, and is often used in the context of online systems to query online databases. The Census Bureau’s web API allows registered users to make custom requests to their public-facing database of demographic and economic estimates, downloading subsets of census data directly onto their own computer. Whereas most API tools have a somewhat steep learning curve, tidycensus may be used by anyone familiar with the fundamentals of the R programming language.

Step 1: Request a Census API Key

Before you can use the tidycensus package, you first need to request an API key from the U.S. Census Bureau. API keys are a unique identifier that you must provide every time you request data using the tidycensus package, which allows the Census to verify who is requesting their data. The Census API key is entirely free and may be obtained by entering your organization and email address on the Census website. Once you have received your API key, save it somewhere easily accessible on your computer – I recommend putting it in a text file saved in your home directory, or wherever you store important files.

Step 2: Install tidycensus

Once you have your API key, you are ready to start using the tidycensus package! To install tidycensus, open up RStudio and run the following command: 

install.packages("tidycensus")

You will only need to run this command once on your computer. Once it has been installed, that installation will remain permanently available. Next, you can load the package into your R session by running:

library(tidycensus)

Finally, you will need to load in your API key, which will allow you to download multiple Census datasets without having to call the API key each time. This can be accomplished using the command census_api_key(). There are two methods for doing this. The easiest strategy is to simply copy your API key into the census_api_key() command. However, if you are sharing your code publicly, you may not want others to know your key. For that reason, you can also save your key in a separate text file, and read in that text file each time you want to run your code. Both options are shown here:

## Option 1: Copy key directly

census_api_key("YOUR API KEY HERE")


## Option 2: Load key from elsewhere

census_api_key(read.table("path/to/file"))

Step 3: Explore Census Variables

We’re almost ready to start downloading Census data! First, we need to figure out how to identify the variable(s) we want to download. tidycensus provides a handy tool to explore the available variables and the codes we can use to identify them: the load_variables() function. We will provide this function with the year of data we’re interested in and which Census dataset we want to pull from. 

tidycensus currently supports data from two primary sources: the decennial census, and the American Community Survey (ACS). The decennial census is collected every 10 years and includes information about a limited number of characteristics for the entire U.S. population, including age, gender, and race. The ACS is a survey conducted every year on a sample of the U.S. population and asks a wider variety of questions of its respondents. However, because it only includes a small proportion of the population, researchers often use estimates derived from five years of survey results aggregated together.

Let’s say we’re interested in understanding the incomes of different neighborhoods here in Alameda County. One common way to answer this question is to look at the median household income. For this example, we will look at the median household income using the 2019 5-year ACS estimates. In this case, we can use the load_variables() function, inputting our year of interest (2019) and the survey identifier "acs5" to tell it we want the 5-year estimates from the American Community Survey (ACS). Several other survey identifier options are available, including "acs1" for single-year ACS estimates, "acs5/profile" and "acs5/subject" for special collections of variables provided by the U.S. Census Bureau, and "sf1" and "sf3" for different summary files from the decennial census.

View(load_variables(year = 2019, dataset = "acs5"))

This produces a table that we can then explore to find our variable(s) of interest! In this case, we’re interested in median household income. When we search for the phrase "median household income" in the search bar on the top right, we narrow the total number of variables from the original 27,040 to 33. We will use the variable from the first row, which uses the variable code "B19013_001" and estimates the median household income in 2019 inflation-adjusted dollars for the entire population. There are many other potential options, however, including the median household income by race and household size. Depending on your interest, you could download one or multiple of these variables!

Step 4: Download Census Data

Finally, we’re ready to download our Census data! We will accomplish this using the function get_acs(), which pulls ACS data directly from the Census API. If we were interested in downloading data from the decennial census, we would instead use the command get_decennial(). While there are a variety of parameters that this function can include, four essential questions must be answered:

  1. Which tables or variables? Use the "variables" command to specify a list of variables that you are interested in downloading. This could be a single variable string, or multiple variable strings combined together in a list, as in: c("variable1", "variable2") .

  2. What is the scale of geography? Use the "geography" command to specify the scale at which you want your data. Options include "state", "county", "place", "tract", and more!

  3. Which state(s) and/or county(s)? If you are interested in multiple states, you may include a list of state names under the "state" parameter. If you are interested in multiple counties, you may likewise include a list of county names under the "county" parameter. However, if you are including multiple states you may not specify any counties, and if you are including multiple counties you can only specify one state.

  4. Which year of data? Decennial Census data are available only for 2000, 2010, and 2020. American Community Survey (ACS) data are available for each year between 2005 and 2022, with a few important exceptions: five-year estimates are only available starting in 2009, and one-year estimates from 2020 are not available due to the Covid-19 pandemic.

The example below shows how you can answer each of these questions for our query:

  1. We are interested in the variable for median household income, which we now know is represented by the variable "B19013_001". 
  2. We are interested in data at the geographic level of the "census tract", which we will use to represent neighborhoods. 
  3. We want data for Alameda County in California.
  4. We are specifically interested in data from 2019.

We will also specify that we want to include "geometry" information, which will allow us to create a map of our data later on!

df <- get_acs(variables = "B19013_001",

              geography = "tract",

              state = "California",

              county = "Alameda",

              year = 2019,

              geometry = TRUE)

With all of these parameters specified, we can download the requested data from the Census API. Note that requests for large amounts of data, such as requests with multiple variables or many different geographies, may take a little while to download. Including the geometry also makes the download process take substantially longer; if you don’t need spatial information, you may set geometry to FALSE.

Step 5: Explore and Map Your Data!

Let’s take a look at our newly downloaded census data table!

View(df)

Our table contains six columns:

  • GEOID: a standard numeric identifier for every single geography used by the census. In this case, the first two digits indicate the code for the state ("06" = California), the next three digits indicate the code for the county ("001" = Alameda County), and the final six digits represent the code for the specific census tract.

  • NAME: The official name of the census tract, including its numeric identifier and the county and state in which it is located.

  • variable: The code for the census variable represented in this row. In our table this will be the same for every row, but it would be different if we specified more than one variable in our API request.

  • estimate: The actual value of our variable(s) – in this case, the estimated median household income in 2019 dollars.

  • moe: The "margin of error" for the census estimate, indicating the uncertainty surrounding the estimate. In the first row of our table, for example, the census estimates a median household income of $110,761, but also states that the likely range of values is +/- $21,966 of that value, or between $88,795 – $132,727. The uncertainty surrounding these estimates increases for geographies with smaller populations, which means that estimates for states and counties are often much more reliable than those at smaller scales.

  • geometry: This column, which is only included if we set "geometry = TRUE", provides information about the shape of the geography, which will allow us to create a map! In this case, the geometry is saved as a long list of numbers.

Finally, we are able to map our data! There are several different ways to create maps of spatial data in R, but one of the most flexible and intuitive approaches involves the use of several additional packages: ggplot2, sf, and scales. These packages may be installed using the same "install.packages()" command used above; once installed, they can be loaded into your R environment using "library()". For more information about the format and syntax of these packages, you can follow the links below:

library(ggplot2)  ## Package for creating plots

library(sf)           ## Package for interpreting spatial data

library(scales)    ## Package for formatting legend labels (in dollar amounts)


ggplot(data = df) + 

   geom_sf(aes(fill = estimate)) +

   scale_fill_viridis_c("Median Household Income", labels = dollar) + 

   theme_void()

And with that, we have created a map showing median household income by neighborhood in Alameda County! Already, we can see very distinct patterns of areas where incomes are lower (dark blue) and higher (green and yellow). These results could be combined with additional census variables or with other external datasets (such as school quality or environmental hazards) in order to answer many different types of questions!