Creating County Files from the Microsoft US Building Footprints Dataset

Microsoft recently made freely available a dataset of building footprints for the entire USA. Thank you Microsoft! These footprints were generated using machine learning on satellite imagery. Although not perfect - buildings may be missing or imperfectly represented - this is a unique and valuable dataset that attempts to map every building in the US. It’s also really cool to explore and visualize.  You can read more about the data and how it was generated on the Bing Blog and in this New York Times article, the latter of which includes some great maps of the data. 

The data are made available as GeoJSON files, one per state. For some states like CA, FL, IL, MI, NY, PA, TX, OH, and a few others, these files are huge - 1 GB or larger. These large files are extremely difficult to work with on your average personal computer, whether with desktop software like ArcGIS or QGIS or programmatically in R or Python. Moreover, many folks who work with geospatial data have more experience working with ESRI Shapefiles rather than GeoJSON data.

Consequently, there are a number of web posts discussing the desire for and ways to wrangle these data into smaller county based files. This is also a task that I was asked to help out with to make these data more widely available to the UC Berkeley campus community.

Unfortunately, none of the suggested approaches for splitting the data into county files worked for me. Consequently, I took advantage of some time off due to a broken ankle and came up with an approach that leverages PostGIS, the geospatial extension to the fantastical free and open source database software PostgreSQL.

What I liked about this approach is that it gave me an opportunity to get reacquainted with PostGIS, a tool that I used all the time in my previous position but haven't used for a while. It's a great tool for scaling up geospatial data management and analysis. As a new twist, I brought Python to the party and interacted with PostGIS using the psycopg2 database adapter.

My steps are detailed in the Jupyter notebook in this Github repository. Even if you don’t program in Python you can read through the notebook to better understand the challenges of this task as well as the potential shortcomings. And remember, this is a draft!  It ain't pretty nor is it perfect but it works! You can check out the processing results for California counties in this Google drive folder.  Given that this is the holiday season, I call it the ugly sweater of workflows. Feel free to send me an email if you have suggested improvements for any of the steps! 

Happy Holidays!




Patty Frontiera

Dr. Patty Frontiera is the D-Lab Data Services Lead and a geospatial data scientist.  She is the the official campus representative for ICPSR, the Roper Center, and the Census State Data Center network, and serves as the Co-Director of the Berkeley Federal Statistical Research Data Center (FSRDC).  Patty also develops the geospatial workshop curriculum, teaches workshops and consults on geospatial topics.  Patty has been with the D-Lab since 2014 and served as the the Academic Coordinator through Spring 2017. Patty received her Ph.D.