Data Science

Digitization of Historical Maps in the Age of AI

December 3, 2025
by Elena Stacy. Researchers today increasingly have access to a wealth of tools to streamline or automate labor-intensive data processing and generation tasks. When it comes to mapping, progress has been slower. This blog details the author's experience tackling the digitization of a historical map in the age of AI.

A Practical Guide to Shift-Share Instruments (and What I Learned Replicating the China Shock)

November 26, 2025
by Jiayu Lai. Shift-share instruments are among the most widely used tools in applied economics, appearing in labor, trade, immigration, and policy evaluation research. But despite their popularity, many researchers still use them as black boxes — and risk invalid instruments as a result. In this blog post, I unpack how shift-share IVs actually work, why their validity depends on both the “shifts” and the “shares,” and what practical steps researchers should take to check assumptions. I also walk through how I used the Borusyak–Hull–Jaravel (2022, 2025) framework to reproduce the seminal Autor, Dorn, and Hanson (2013) China shock analysis.

Sahiba Chopra

Data Science Fellow 2024-2025
Haas School of Business

I'm a PhD student in the Management and Organizations (Macro) group at Berkeley Haas. I have a diverse professional background, primarily as a data scientist across numerous industries, including fintech, cleantech, and media. I hold a BA in Economics from the University of Maryland, an MS in Applied Economics from the University of San Francisco, and an MS in Business Administration from UC Berkeley.

My research focuses on the intersection of inequality, technology, and the labor market. I am particularly interested in understanding how to reduce inequality in...

A Participant-Centered, GIS-Based Approach to Improving Contextual Measurement

November 19, 2025
by Sarah Daniel. Researchers increasingly recognize that neighborhoods profoundly shape life outcomes, yet measuring them remains challenging. A common approach uses administrative boundaries, such as census tracts, as proxies for neighborhoods, but this method presents three key challenges. First, administrative boundaries may fail to capture residents’ lived experiences, a limitation that is particularly concerning in marginalized communities; second, they can misrepresent contextual effects; and third, they may produce inconsistent findings. To address these issues, I advocate for the use of self-defined neighborhood boundaries as an alternative measure. I compare GIS- and non-GIS-based methods and propose that GIS-based methods offer the strongest potential for more valid measurement.

Beyond the Hype: How We Built AI Tools That Actually Support Learning

November 12, 2025
by Weiying Li. What does genuine partnership look like when building AI for education? Working with middle school teachers and computer scientists, we co-designed AI dialogs where teachers are valuable contributors to refine what the AI understands as valuable thinking. Through iterative refinement, teachers identified precursor ideas and observations that predicted future learning, and refined guidance design in the dialog. Our AI dialog sees learning the way teachers do, built through genuine collaboration where both model development, learning sciences theories, and teachers' classroom expertise work together from the start, not just at the end.

In Silico Approach to Mining Viral Sequences from Bulk RNA-Seq Data

October 28, 2025
by Carly Karrick. Viruses play important roles in evolution and influence ecosystems and host health. However, isolating and studying them can be difficult. In lieu of using resource-intensive methods to concentrate viruses into a “virome,” bulk sequencing methods include data from all biological entities present in a sample. In this tutorial, we explore an approach to mine viral sequences from publicly available bulk RNA-Seq data. The output from this analysis paves the way for future statistical analyses comparing viral communities in different contexts. This approach can be applied to other datasets, including studies of human health.

A brief primer on Hidden Markov Models

April 25, 2022
by Amy Van Scoyoc. For many data science problems, there is a need to estimate unknown information from a sequence of observed events. There are many ways to tackle these types of sequential input problems. In the data science world, there is a tendency to use machine learning approaches to search for relations in the dataset. But in many cases, we don’t have enough data or the sequences are too long to train RNNs effectively. In such cases, simpler is better. Enter the Hidden Markov Model.

How to Get Involved in Computing Research as a Undergrad at UC Berkeley

October 15, 2025
by Abby O'Neill. Are you an undergrad interested in getting involved in CS/DS research? This blog post gives some advice for navigating the Berkeley research landscape. It includes mentions of structured programs like DARE, URAP, and Data Science Discovery, as well as cold emailing strategies and using office hours effectively. The main takeaway: Know your why, don't filter yourself out, and focus on finding people and projects that align with your goals.

Python Introduction to Machine Learning: Parts 1-2

October 21, 2021, 1:00pm
This workshop introduces students to scikit-learn, the popular machine learning library in Python, as well as the auto-ML library built on top of scikit-learn, TPOT. The focus will be on scikit-learn syntax and available tools to apply machine learning algorithms to datasets. No theory instruction will be provided.

R Fundamentals: Parts 1-4

October 25, 2021, 9:00am
This workshop is a four-part introductory series that will teach you R from scratch with clear introductions, concise examples, and support documents. You will learn how to download and install the open-sourced R Studio software, understand data and basic manipulations, import and subset data, explore and visualize data, and understand the basics of automation in the form of loops and functions. After completion of this workshop you will have a foundational understanding to create, organize, and utilize workflows for your personal research.