Analyzing the Bay Area Commute Network with Geopandas and Networkx

February 12, 2021

Hi everyone! I'm one of the D-Lab Data Science Fellows that joined the D-Lab this year. I'm in my second semester of the MCP/MS (City Planning / Transportation Engineering) program. My academic background is actually in Physics, and I've been doing research on radiation detection in urban areas before deciding to come back to school. I hope to bring my physics background and computational skills to the field of urban planning, to better understand and model urban/regional systems using complex systems and computational methods, and to bridge the divide between data science and the social sciences. I'm also very interested in the emergence and evolution of social complexity, urbanism, and regional systems/networks of cities.

One of my projects has been to look into the commuting patterns of the Bay Area. The project was inspired by this paper (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0166083) by Dash Nelson and Rae, who ran a network clustering algorithm on the commuting network of the United States to partition (divide up) the lower 48 states of the US into commuter sheds surrounding each major city or region. They visualized their results in figures like the following one...

Figure from Dash Nelson and Rae (2016):  (missing figure) 

5-year averages of commuter data for each census tract in the US are publicly available and can be pulled from the American Community Survey (ACS) done by the Census Bureau. The Dash Nelson and Rae paper actually provides the 2005-2010 data in parsed shapefiles so you don't have to pull the data directly from the US Census databases.

With python and the geopandas package, one can easily load geographical data into a DataFrame structure and browse using Jupyter(Lab) Notebooks. The data can also then be parsed using the networkx package into a graph/network, which would allow you to easily investigate network properties of the commuter flows between census tracts.

I'm hoping that this research would be able to lead us to divide the Bay Area into service areas for transit agencies that make more sense than the current ones. (Did you know that there are currently 27 transit agencies that serve the 9-county Bay Area?) It could also tell us how to better draw boundaries for fare zones, which is the fare-unifying mechanism suggested by Seamless Bay Area under their new Integrated Transit Fares proposal (https://www.seamlessbayarea.org/integrated-fare-vision).

Some examples of the code to follow:

Reading a shapefile is as easy as

```

import geopandas as gpd

gpd.read_file("cb_2018_06_tract_500k.shp")

```

where `cb_2018_06_tract_500k.shp` is one of the set of files that constitute a shapefile.

And converting that into a networkx graph would simply be:

```

import networkx for nx

DiG = nx.from_pandas_edgelist(df, source="OFIPS", target="DFIPS",

                          edge_attr="weight", create_using=nx.DiGraph)

```

`OFIPS`, `DFIPS`, and `weight` are just the column names of `df` that I would like to input into the networkx DiGraph (Di for directed, to preserve the commute flow direction from the residence to the workplace).

The following are two example figures that I created while exploring the data. (missing figure) 

The first is a visualization of the undirected commute flow graph (i.e. commute flows of both directions are included). The node size is scaled linearly based on the total flow incident on the node, and is coloured according to the county it is in. Only links with a total flow greater than 100 is drawn. The plot on the left pins the nodes to their geographical location, whereas the the plot on the right positions the nodes with the Fruchterman-Reingold force-directed algorithm (as implemented by networkx), as weighted by the total flow on each link, and initialized to the nodes’ geographical locations.

The nodes representing census tracts in downtown San Francisco are clearly pulled to the right towards the East Bay, telling us that a huge number of commuters are coming into downtown San Francisco from the East Bay every day. It also seems that downtown San Francisco exhibits a dual core structure, thus warranting a closer look. (missing figure) 

These plots show us the degree of the nodes (census tracts) in our graph. The three plots on the left shows the number of census tracts that each tract is linked to (i.e. the two tracts are a commute origin-destination pair). The three plots on the right shows the number of people that commute to/from each census tract each day. The redder the colour, the higher the number. The plots show us clearly that the employment centres in the Bay Area are centred along the coastline in the southern half of the bay.