Exploring Population Data with IPUMS
Last month, demographer and historian Steve Ruggles was awarded a prestigious MacArthur Foundation Fellowship for his work developing IPUMS—a harmonized database of individual and family responses to large-scale domestic and international surveys. With some samples going as far back as the 18th century, IPUMS can offer key insights into changing demographics, norms, and decision-making over time. In this post, I’ll first introduce IPUMS and give an overview of the data available. Next, I’ll explain how to gather, download, and process your data. It will end with an in-depth example for R users.
Note: IPUMS was originally called “Integrated Public Use Microdata Series,” but adopted the acronym when some restricted and aggregated data were added to the database. The vast majority of the IPUMS data are still publicly available microdata.
Opportunities for Researchers
IPUMS offers researchers an opportunity to download linked data from otherwise decentralized government surveys—both in the United States and internationally. Each of the samples join survey responses that were previously in different formats, and at the appropriate level of observation. For example, the USA sample contains detailed de-identified demographic information merged together from 16 federal surveys.
Given the richness of its data, IPUMS gives researchers opportunities to perform micro-, meso-, and macro-level analyses of individuals, families, and communities. In my own work as a labor researcher, I draw on the Current Population Survey (CPS) sample, which links monthly U.S. Census data to survey responses from the Bureau of Labor Statistics. Here, the inclusion of employment variables enables me to connect repeated cross-sections of individual demographic and financial data to larger-scale changes in occupational working conditions over a 40-year period. Other research has investigated state-level schooling requirements and student outcomes, gendered care differences within households, and domestic migration differences across 30 Asian countries
Gathering your Data
IPUMS stores its collections in separate databases depending on which surveys are able to be linked. For example, the IPUMS USA sample harmonizes Census data with the American Community Survey, while the IPUMS CPS sample merges a different sample of individuals. As such, each sample is its own self-contained data set.
Using the IPUMS CPS sample as an example, this section details step-by-step instructions to download the relevant data from the site and ultimately load the data into your appropriate software program. It ends with an example for R users.
Navigating the IPUMS Database
To prepare your data on the website, follow the three steps below. You will need to select which variables and time periods you want to select.
First, create a free account here to gain access to the IPUMS samples.
Next, select your appropriate sample.
Select relevant time periods and variables (refer to documentation for more information on particular variables). Your selections should be reflected in the “Data Cart” at the upper right of the page.
Create your custom data extract. Depending on the size of your data, it might take some time for the database to compile your extract.
Downloading your Data
Finally, download both your raw data extract (“Download.DAT”) and “DDI” file, which contains documentation for your data set. Each software will use this file to attach variable names and values to your data extract.
IPUMS only produces raw text files, but fortunately provides accompanying documentation for R, SAS, Stata, and SPSS users. The next section will walk through this process for R in greater detail, but it will look slightly different for other software programs.
Note: As is the case with most large-scale records files, the data may still require manipulation and cleaning before analysis. Beyond data wrangling, it is also important to pay attention to the correct weighting of your observations. Weights are also provided by IPUMS and are pre-selected as part of any data extract. Documentation within each weighting variable explains when you should use each weight.
Extended Example: Reading Files into R
After following the steps above, you should have downloaded two files: one with a .dat.gz and another with a .xml extension. You will need to unzip your compressed .dat.gz file, which contains the raw data. Do not change the names of your files.
First, install and load the ipumsr package into your R session. This package contains functions that will transform raw text files into a tabular data frame (i.e., one with rows and columns), as well as load variable names and values. We’ll use some of these functions below.
# Install ipumsr package
# Load ipumsr package
Next, we will create an object containing the documentation and metadata from the .xml file. The read_ipums_ddi function automatically loads the IPUMS documentation to ease interpretability. We then pass this object through the read_ipums_micro function, which will automatically load the data from your .dat file.
Be sure to adjust your file path and file name to your system! Note that IPUMS automatically uses the same name (e.g., cps0001.xml and cps0001.dat) for the documentation file and the data set. As long as both files are in the same folder, the below code will run smoothly.
# Create ddi object of documentation
# Pass ddi object through read function
The above chunks of code ultimately create a R data frame called raw_data