6 months ago, the D-Lab community made possible a connection between the UC Berkeley School of Information, D-Lab Data Science Fellows, and the Urban Displacement Project (UDP). A summer of brainstorming, collaboration, and multiple Zoom sessions later, the team at Project HOME is excited to present our 5th Year Master of Information and Data Science capstone project. We worked under the mentorship of Timothy Thomas at UDP to develop Project HOME, a finalist for the 2021 5th Year MIDS Capstone Award. In this post, we'll walk through why and how we predicted and mapped eviction rates in the state of California.
Project HOME Team:
We embarked on Project HOME for a number of reasons—but to really share the story, we’ll start with this statistic: ⅓ of U.S counties lack any form of annual eviction data. This means there aren’t any available records of who, how, where, or how many folks have been evicted in a given county - let alone at the census tract level, or neighborhood level.
Overall, eviction data is sparse, unvalidated, and often also inaccurate for many reasons. Specifically, in California, eviction records are highly protected by the courts, and even those records are typically underestimates of the true eviction rates.
Why do we need eviction data?
Generally, having access to tract or neighborhood-level eviction rates helps policymakers understand where resources are needed and make sound policy decisions and develop impactful public assistance programs. Situating ourselves more currently in 2021 during the COVID-19 pandemic and the California eviction moratorium, which was meant to prevent mass evictions of tenants falling behind on rent, understanding where mass evictions may occur post-moratorium is vital for proper resource allocation such as rent relief. However, since we don’t have California eviction data, what we can do is predict for each tract in California, what the eviction rates are.
The Urban Displacement Project has worked with local organizations, collected data themselves, and utilized NLP methods on court records to develop what we know to be the most validated and robust national dataset on evictions, which includes counts of eviction filings in 16 metros in the United States. The overarching method is to build a model on the 16 metros labeled data and then use that model to predict eviction rates in California!
Acquire Data: We sourced data from the American Community Survey, CDC, and USDA to predict the eviction rates. ACS Census data includes data on variables such as household income, property values, rent burden, and demographic information on racial breakdowns of the tract. The Centers for Disease Control 500 Cities Project supplied us with tract level data on the prevalence of various health indicators such as smoking, obesity, and senior health. Finally, the USDA Food Access Research Atlas includes data on tract level poverty rates, use of SNAP benefits, distance to supermarkets, and so forth.
Train Models: We trained supervised learning models on the 16 metros data from 2016. We utilized 2016 data because this was the most recent, validated eviction data (via UDP). The outcome variable we’re predicting is the eviction rate of renters in a tract. We computed this rate by finding the proportion of evictions to the total number of renting households. We then bin this rate into 3 categories: less than 2%, 2-5%, and greater than 5% eviction rate. We found this to be a natural grouping of the data as well as an interpretable result for policymakers.
Validate: Next, we went through a validation step. We had split the 16 metros data into train and validation sets. We used the validation set to assess model performance.
Predictions: Once we chose the best performing model on the validation set, we used that model to predict the eviction rates in California (using data from 2019). We chose 2019 as opposed to 2021 or 2020 to produce current estimates for 2 reasons: 1) there’s not yet a full year’s worth of 2021 data and 2) 2020 is an outlier year due to the pandemic -- so a model trained on 2016 data likely won’t perform well on 2020 data.
Build a Map: Finally, once we produced the predictions, we built a map! We produced our map by coding it from scratch using R’s Leaflet library.
The domain baseline for predicting eviction rates with machine learning is accuracy rates of around 60%. It’s a hard task to predict evictions because it is a heavily nuanced topic with varying local laws, conditions, economics, predatory actors, property sales, environmental changes, and so forth all impacting a neighborhood’s eviction rate. We tested out a variety of models including logistic regression, K nearest neighbors and gradient boosted trees. The best performing model came back to be the gradient boosted tree model with an accuracy of about 72% on the validation set; this is the model we used to predict California eviction rates.
Caveat: Conservative Estimates
We want to point an important caveat to our findings. There are a number of steps that lead to a forceful eviction. First, a landlord gives the tenant a 3-60 day written notice. If the tenant doesn’t comply, the landlord can file a complaint legally, potentially notify the Sheriff’s department, who then goes in to do a forceful eviction. But at any point during this process, a tenant may leave… which experts also denote as eviction.
However, the training data that we have is based on the very last step -- the rate of legally documented, forced evictions. Because of this, all of our estimates are conservative, and very likely underestimates the true number of residents made to leave their homes. We point this out in order to convey that the impacts of mass eviction and displacement are likely to be much higher than what our final product conveys.
Without further ado, check out our final product through the following links: