Exploring Rental Affordability in the San Francisco Bay Area Neighborhoods with R

November 5, 2024

Exploring Rental Affordability in the San Francisco Bay Area Neighborhoods with R

Rental housing affordability for renters continues to be a critical issue in many American cities, particularly in the San Francisco Bay Area. However, discussions about how expensive rent is often rely on personal experiences or media anecdotes. What if we want a more data-driven approach to understand the cost of rent across different neighborhoods and uncover patterns in affordability? Furthermore, what if we want to explore how neighborhood characteristics—such as demographics—are related to rental affordability? Thankfully, researchers can rely on computational tools in R to easily access, analyze, and visualize these data.

In this post, I will walk through how we can use R and its packages to assess rental affordability across neighborhoods in the San Francisco Bay Area. While there are many ways to measure rental affordability, we will create a simplified rental affordability index by comparing the median rent of an area to its median monthly household income. Although this measure isn’t perfect, it serves as a straightforward introduction to spatial analysis for our purposes. This post will demonstrate how easy it can be to incorporate spatial data analysis into housing research using R and how computational tools can provide meaningful insights into rental affordability. The codes used in this blog post can be found in this GitHub repository. Let’s dive in!

Importing U.S. Census Data Using tidycensus

As explained in a previous D-Lab blog post written by former D-Lab Data Science Fellow Alex Ramiller, tidycensus is a package that allows us to easily download U.S. Census data using R. This blog post assumes that you already know the basics of how to use tidycensus to import census data into your R environment. As a reminder, you should have requested a Census API key and installed the tidycensus package into your R environment!

Once the initial setup is complete, we can load the tidycensus package into our R environment using the library(tidycensus) function. Next, we will download several variables from the census using the get_acs function in the tidycensus package. To do so, we will need to supply several arguments to this function. Here are some things you might want to consider before importing your dataset:

First, what variables do you want to import? We use the variables “B25064_001”, “B19013_001”, “B03002_001”, “B03002_003”, which correspond to median gross rent, median household income, total population, and non-hispanic white population. You can find more details on how to look up a census variable from a previous blog post
Second, what geographic unit do you want to pull data from? For the list of geographies available, you can refer to this tidycensus documentation. We all have very different ideas on what neighborhoods might indicate, but let’s rely on Census tracts, a widely used proxy for neighborhoods in research. Census tracts are relatively smaller geographic units that make up a county altogether, much like jigsaw puzzles.
Third, what state and counties do we want our data for? Despite varying definitions, the San Francisco Bay Area often refers to nine counties around San Francisco. We will need to specify these nine counties in our functions.
Fourth, what year do we want the data for? We will choose 2022, which is the most recent year that the census data is available at this time.
Lastly, we will specify the geometry argument as TRUE so that the function downloads the geometry data for the Census tracts as well.

# install.packages("tidycensus")

# Uncomment and download the package if you haven't installed tidycensus yet.

library(tidycensus)

tract_data <- get_acs(variables =

c("median_rent" = "B25064_001", # median gross rent

"median_income" = "B19013_001", # median household income

"pop_total" = "B03002_001",

"pop_white" = "B03002_003"

geography = "tract",

state = "California",

county = c("Alameda", "Contra Costa", "Marin",

"Napa", "San Francisco", "San Mateo",

"Santa Clara", "Solano", "Sonoma"),

year = 2022,

geometry = TRUE)

Let’s check that our datasets are properly downloaded! You can print our data frames directly by using the print() function.

print(tract_data)

In the data summary section, you can see that tract_data contains 7088 features and 5 fields, meaning the dataset includes 7088 observations (rows) and 5 columns (variables). Each row corresponds to a specific Census tract, and the columns provide key information such as GEOID (a unique identifier for each tract) and NAME (the tract's name or label). Additionally, the variable column lists the names of the Census variables, and the estimate column shows their corresponding values. The moe column indicates the margin of error for these estimatesor how reliable they are, but we will not use this information in our exercise. You will also see that there is a geometry column, which means that our data frame is a spatial vector object with geometries (e.g. point, line, polygon) assigned to each observation.

However, the current data frame is organized in long format, where each row represents a unique combination of a Census tract and a specific variable. While long format is useful for storing and reshaping data efficiently, data analysis and visualization typically require the data to be in wide format. In wide format, each variable is presented as a separate column, making it easier to perform calculations, generate visualizations, and apply spatial analysis techniques. For more information on the differences between long and wide data formats, you can refer to this post

We can pivot the column into a “wide” format using the following code:

# Pivot the dataframe

tract_data <- tract_data |>

select(-moe) |> # Remove the margin of error column

pivot_wider(names_from = "variable",

values_from = "estimate")

Once we finish running the code, we can print the data frame again to check if it has been properly arranged. And good news—it did! And now, we’re ready to start mapping our data.

Mapping Rental Affordability Using ggplot2 and sf

Using ggplot2 and sf packages for mapping

We can now use the formatted data frame to produce a map using the ggplot2 and sf packages. ggplot2 is a widely used R package for data visualization and sf is a package with many useful functions for analyzing spatial vector data, such as the data we just downloaded using tidycensus. Remember that our tract_data data frame had a geometry column? The spatial information in this column is what will allow us to map our data.

Let’s create a simple map visualizing the median rent across all census tracts in our data frame. This can be achieved using the following code. First, we use the ggplot() function to specify that we are working with the tract_data data frame for the visualization. Then, we add geom_sf() to tell R that we are mapping spatial data from the data frame. Within geom_sf(), we specify the aesthetic mapping by setting fill = median_rent, which allows us to color each tract based on its median rent value.

map_median_rent <- tract_data |>

ggplot() + geom_sf(aes(fill=median_rent))

print(map_median_rent)

Once you print map_median_rent, you should see the map displayed in your RStudio Plots panel.

Congratulations—our first map is complete! However, there is still room for improvement. First, the map needs better annotations—we should indicate what area we are mapping and provide clear labels to describe what is being visualized. Second, it’s not easy to differentiate neighborhoods with high median rent from those with lower rent because the current color gradient and the neighborhood boundaries make it hard to interpret the data effectively. Finally, the grey background panels and the longitude/latitude tick marks don’t look aesthetically pleasing, at least for me personally.

Now, we will run an updated version of the code as follows:

First, I specified the color=NA argument inside geom_sf(). This makes the boundary transparent (or with no color).
Second, I added the scale_fill_gradientn() function to specify the spectrum of colors I want to use to visualize median rent. I can also specify that median rent uses a dollar unit here using labels = scales::dollar_format(), which will be updated in the legend.
Third, I use the labs() function to include useful information about the map. What is the map showing? Where is it? What data is it using?
Lastly, I use theme_void() to make the map more neat, removing the background panels and the tick marks.

map_median_rent2 <- tract_data |>

ggplot() +

geom_sf(color = NA, aes(fill = median_rent)) +

scale_fill_gradientn(

colors = c("lightgreen", "green", "yellow", "orange", "red"),

labels = scales::dollar_format()

) +

labs(

title = "Median Rent by Census Tract (2022)",

subtitle = "San Francisco Bay Area",

fill = "Median Rent",

caption = "Data: American Community Survey (2018-2022)"

) +

theme_void()

print(map_median_rent2)

Once you print the updated map, you’ll see that most Bay Area neighborhoods have high median rents over $2,000, except for some of the more rural outskirts. Good work!

Creating and mapping rental affordability index

We just mapped the median rent levels across different Bay Area neighborhoods. But what if we want to look at how different neighborhoods fare in terms of their rent levels relative to how much income households make every month? Let’s create a new rent_to_income variable to capture rental affordability in a neighborhood. You can run the following code to create a new variable, dividing median rent by monthly median household income (or median household income divided by 12).

tract_data <- tract_data |>

mutate(rent_to_income = median_rent/(median_income/12))

Next, we can quickly check what the distribution of this new rental affordability index looks like, using a histogram.

tract_data |>

ggplot() + geom_histogram(aes(x=rent_to_income))

Once you visualize the histogram, you will see that most neighborhoods have a rent-to-income ratio of less than 25%, although in some places this ratio is greater than 75%!

Using this information on the variable distribution, let’s group our continuous rent_to_income variable into a new categorical variable with four levels. You can do this by creating a new rent_to_income_grouped variable using mutate() and case_when() functions.

tract_data <- tract_data |>

mutate(rent_to_income_grouped = case_when(

rent_to_income < 0.2 ~ "< 20%",

rent_to_income < 0.3 ~ "< 30%",

rent_to_income < 0.4 ~ "< 40%",

rent_to_income >= 0.4 ~ ">= 40%",

TRUE ~ NA_character_

))

Let’s map this newly created rental affordability index. Our codes will be very similar to the codes we used to create map_median_rent2 for visualizing median rent. But this time, our rental affordability index is a categorical variable, so we will need to revise the scale_fill_gradientn() functionto scale_fill_manual() function, where we specify what fill color each variable value will take. We will also need a new title for the map.

map_affordability_ratio <- tract_data |>

ggplot() + geom_sf(color=NA) + aes(fill=rent_to_income_grouped) +

scale_fill_manual(

values = c("< 20%" = "green", "< 30%" = "yellow", "< 40%" = "orange", ">= 40%" = "red")) +

labs(title = "Median rent to Median monthly household income ratio by census tracts in SF Bay Area",

fill = "Ratio",

caption = "Data: American Community Survey (2018-2022)") +

theme_void()

print(map_affordability_ratio)

We see that this new map has a different legend from our median rent map, because the fill values take categorical values. We are also able to see more clearly how much the average household in each neighborhood might be paying for their rent on average. For example, we see that many of the Silicon Valley neighborhoods in the South Bay had very high median rent, but they are on average paying much less of their income on rent. On the other hand, we see new pockets of orange and red areas pop up, particularly in East Bay cities like Oakland, Berkeley, and Vallejo. These neighborhoods may have high rent levels but their residents may not be making as much income as the other expensive South Bay neighborhoods.

Through this exercise, we learned how useful mapping is for understanding the spatial patterns of rental affordability in the Bay Area. The majority of neighborhoods, particularly in the core urban areas, had a median rent of $3,000 or above, suggesting how expensive rental costs are. When we looked at our custom-made rental affordability index, we found that many of these rent-expensive neighborhoods also had high median household income. On the other hand, we also start seeing that households living in areas with lower nominal rents may experience a greater burden due to their lower income levels—areas such as East Oakland. From a policy-oriented perspective, these neighborhoods might be the places that we would want to prioritize in terms of renter protections and rent stabilization.

It’s also important to recognize that these traditional maps have several weaknesses. First, we might find it difficult to discern granular geographic patterns in a huge area like the Bay Area. On the map above, it’s really difficult to see all the small neighborhoods and which cities they belong to. Additionally, we might want to know precisely what the rental affordability ratio for each neighborhood is, which is simply impossible to visualize on a traditional map. I plan to address these issues in a future blog post where I will introduce interactive mapping tools using the leaflet package!

References

Ramiller, A. 2024. Mapping Census Data with tidycensus. UC Berkeley D-Lab, Medium. https://medium.com/@dlab-berkeley/mapping-census-data-with-tidycensus-492453d2ecf7 (Accessed November 4, 2024).
Walker, K. & Herman, M. Basic usage of tidycensus. https://walker-data.com/tidycensus/articles/basic-usage.html (Accessed November 4, 2024).

Exploring Rental Affordability in the San Francisco Bay Area Neighborhoods with R

Topics