Visualization

Mapping Census Data with tidycensus

November 6, 2023
by Alex Ramiller. The U.S. Census Bureau provides a rich source of publicly available data for a wide variety of research applications. However, the traditional process of downloading these data from the census website is slow, cumbersome, and inefficient. The R package “tidycensus” provides researchers with a tool to overcome these challenges, enabling a streamlined process to quickly downloading numerous datasets directly from the census API (Application Programming Interface). This blog post provides a basic workflow for the use of the tidycensus package, from installing the package and identifying variables to efficiently downloading and mapping census data.

Hate Speech

The hate speech measurement project began in early 2017 at UC Berkeley’s D-Lab. Our research project applies data science techniques such as machine learning to track changes in hate speech over time and across social media platforms. After three years, we have now published our groundbreaking method that measures hate speech with precision while mitigating the influence of human bias. Read the manuscript here.

Introduction to Item Response Theory

October 24, 2023
by Mingfeng Xue. Measurements (e.g., tests, surveys, questionnaires) are inevitably involved with various sources of errors. Among many psychometric theories, item response theory stands out for its capability of detailed analyses at the item level and its potential to reduce some of the measurement errors. This post first discussed the limitations of conventional summation and average, which give rise to the IRT models, and then introduced a basic form of the Rasch model, including expressions of the model, the assumptions underlying it, some of its advantages, and software packages. Some codes are also provided.

Using Forest Plots to Report Regression Estimates: A Useful Data Visualization Technique

October 17, 2023
by Sharon Green. Regression models help us understand relationships between two or more variables. In many cases, results are summarized in tables that present coefficients, standard errors, and p-values. Reading these can be a slog. Figures such as forest plots can help us communicate results more effectively and may lead to a better understanding of the data. This blog post is a tutorial on two different approaches to creating high-quality and reproducible forest plots, one using ggplot2 and one using the forestplot package.

Twitter Text Analysis: A Friendly Introduction, Part 2

March 7, 2023
by Mingyu Yuan. This blog post is the second part of “Twitter Text Analysis”. The goal is to use language models such as BERT to build a classifier on tweets. Word embedding, training and test splitting, model implementation, and model evaluation are introduced in this model.

Can Machine Learning Models Predict Reality TV Winners? The Case of Survivor

March 14, 2023
by Kelly Quinn. Reality television shows are notorious for tipping the scales to favor certain players they want to see win, but could producers also be spoiling the results in the process? Drawing on data about Survivor, I attempt to predict the likelihood of a contestant making it far into the game based on editing and production decisions, as well as demographic information. This post describes the model used to classify player outcomes and other potential ways to leverage data about reality TV shows for prediction.

From paper to vector: converting maps into GIS shapefiles

April 11, 2023
by Madeleine Parker. GIS is incredibly powerful: you can transform, overlay, and analyze data with a few clicks. But sometimes the challenge is getting your data into a form to be able to use with GIS. Have you ever found a PDF or even paper map of what you needed? Or googled your topic with “shapefile” after it to no avail? The process of transforming a PDF, paper, or even hand-drawn map with boundaries into a shapefile for analysis is straightforward but involves a few steps. I walk through the stages of digitization, georeferencing, and drawing, from an image to a vector shapefile ready to be used for visualization and spatial analysis.

Mapping Time-Series Satellite Images with Google Earth Engine API

July 17, 2023
by Meiqing Li. Remote sensing imagery has the potential to reveal land use patterns and human activities at a planetary scale. For example, nighttime light intensity extracted from can shed light on spatial patterns of human activities and settlements, especially in places where traditional data are scarce. This blog post introduces Google Earth Engine (GEE) as a general purpose tool to extract time-series remote sensing data from GEE data catalog. I walk through using GEE to obtain data, filter by time and geographic region, and visualize it on static and interactive maps.

Twitter Text Analysis: A Friendly Introduction

October 25, 2022

Read part 2 here.

Introduction

Text analysis techniques, including sentiment analysis, topic modeling, and named entity recognition, have been increasingly used to probe patterns in a variety of text-based documents, such as books, social media posts, and others. This blog post introduces Twitter text analysis, but is not intended to cover all of the aforementioned topics. The tutorial is broken down into two parts. In this very first post, I...

Peter Amerkhanian

Graduate Student Researcher (GSR), Instructor
Goldman School of Public Policy (GSPP)

I’m a D-Lab GSR and a graduate student in The Goldman School’s Master of Public Policy/The I School’s Graduate Certificate in Applied Data Science. I have 5 years of experience working on data problems in government and nonprofits. I’m interested in social policy, program evaluation, and computational methods. Python is my principal language, but I’ve developed experience using and teaching a variety of other tools, including R, Excel, Tableau, and JavaScript. I deeply enjoy teaching data science methods and am excited to be a part of the D-Lab.