Handling Missing Data

May 4, 2021

I recently started working with a set of eviction data for a project on housing precarity at the Urban Displacement Project. As I began exploring the dataset, I was excited to find that it appeared to contain a wealth of historical data we could use to train a robust model for predicting eviction rates in urban neighborhoods. However, my initial excitement soon had to be scaled back when a standard check for missing data revealed that many of the observations lacked values for precisely the variable we aimed to predict. I was now faced with the problem of what to do about this sizable hole at the very center of an otherwise promising dataset.

If you’re engaged in data-intensive research, you’ve probably faced a similar problem or will sometime soon. Maybe some of the respondents to your survey declined to answer a question they found too sensitive. Or perhaps there was a temporary malfunction in the sensor you’re using to measure the flow of traffic through an intersection. As my former teacher, Matt Brems, put it recently, “When doing any sort of data science problem, we will inevitably run into missing data.” The question, then, is not how we can avoid the problem, but rather what we can do about it.

What too many of us tend to do is just drop the missing data, cut our losses, and move on without giving it a second thought. Some of us may put more effort in by making a simple imputation, such as filling in the missing values with the mean of their respective variables. However, both these common approaches are potentially problematic. If our missing data differ systematically from our observed data, simply dropping them could introduce significant bias into our sample. Mean imputation, on the other hand, is likely to distort the relationships between our variables by reducing their variance.

More rigorous methods require determining what statisticians call the “missingness mechanism”, which amounts to figuring out why the data is missing in the first place before deciding what to do about it. Although there are various ways to categorize missingness mechanisms, the most common is a three-fold typology:

First, if the probability of missingness is the same for all observations, then the data is Missing Completely At Random (MCAR). For example, survey respondents decided whether or not to answer your question based on the roll of a die. In this case, the missing data can be dropped without introducing bias.

Second, if the probability of missingness depends on other variables, the data is Missing at Random (MAR). For example, your traffic sensor stopped working on days when there were snow storms and other extreme weather. In this case, the missing data can be safely dropped only if your model controls for the relevant variables.

Third, if the probability of missingness depends on the variable in question itself, then the data is Not Missing At Random (NMAR). For example, survey respondents with very low incomes were more likely not to answer your question about income. In this case, the missing data cannot be dropped without introducing bias.

Even in cases where dropping the missing data would not create bias (i.e. MCAR and MAR), it is often preferable to impute the missing values instead, because this prevents any reduction in the size of the dataset. However, the pitfalls of simple imputation should still be avoided if possible by using more advanced methods such as multiple regression imputation, in which a series of models are fit on the observed data to predict the missing values.

In the case of the eviction dataset, I found that there were actually multiple missingness mechanisms at work. The majority of the missing data had a clear pattern: entire years of data were missing for all neighborhoods in one particular county. It turns out that the researchers who collected the data were only able to access the eviction records for that county from some years and not others. The relatively small number of remaining missing values were missing for unknown reasons that appear unrelated to the eviction rate variable itself. So, in the language of missingness mechanisms, most of the missing data was MAR based on known variables, while a smaller portion was also MAR but based on unknown variables.

Armed with this understanding of the missingness mechanisms in the dataset, I attempted both to impute and then to drop the missing data. First, I tried multiple regression imputation using Scikit-Learn’s IterativeImputer, which I should note is still designated as “experimental” in the Sklearn documentation. However, even this advanced imputation method was unable to avoid distorting the variance in the eviction rate variable and resulted in a model that performed poorly on unseen test data. Then, I also tried dropping the missing data and adding the recommended dummy variables to control for county and year (with the assumption that the unknown factors behind the rest of the missing data were covered by the other independent variables included in my model). The addition of these extra variables again led to overfitting the model, but controlling for county and year also generated new insights into the factors that drive neighborhood eviction rates.

These were decidedly mixed results. There’s no silver bullet for completely eliminating the bias and distortion that missing data can introduce into our models. The best we can do is to try to mitigate these effects with the thoughtfulness and rigor that the problem deserves.

I’m grateful to Tim Thomas (Urban Displacement Project) and Cari Kaufman (Department of Statistics) for sharing their thoughts and suggestions on this case.

References and Resources:

Scikit-Learn documentation for IterativeImputer: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html