Blog post

Is your Random Sample Really Random?

January 20, 2022

One of the frequent ways people can run into random numbers is through their research. We often hear the term “random sample,” or a “randomized” assignment to control. Or, sometimes, we can randomly select a certain number of rows or columns from data to perform an analysis on a representative snapshot of the data. Additionally, for many of us from a natural science or engineering background, random numbers are often used in simulations or optimization models. Given the wide variety of uses for random numbers in Data Science, I thought it would be interesting to take an...

Big datasets, small code chunks, and why I use Google Earth Engine

December 17, 2021

Have you ever found yourself in the midst of an analysis when suddenly, out of nowhere, it happens. That tiny, dreaded pinwheel appears indicating an error has occurred. Yes, that's right, they call it the spinning wheel of death. Your application freezes. Everything fades. Did it save?! You clutch your stress ball, watching helplessly as your computer approaches molten temperatures and begins to sputter uncanny, otherworldly sounds. WHIRRRRRRR. Your fate seems to rest on that...

Working with State-of-the-Art NLP Models: A Friendly Introduction to Hugging Face

December 13, 2021

We often read about the many new advancements being made in the field of Natural Language Processing (NLP). Each month, leading organizations release new models that seem like magic to us, such as models that can write it’s own code based on user prompts [1] or are able to help answer our queries when we use Google Search [2]. Large AI research groups like OpenAI and Google spend many years and pour millions of...

Working with spatial networks using NetworkX

December 7, 2021

I have always been interested in working with spatial networks. My first introduction to spatial network modeling was in Prof. John Radke’s Geographic Information Systems class when I learned about building and analyzing spatial networks using the Network Analyst extension in ArcMap. This extension provides powerful tools to solve common network problems, such as finding the best route across a city, finding the closest...

Resisting our Data Doppelgangers: A Proposal for Unpacking the Dangers of Data-Driven Fertility Advertising With Data Science Tools

December 7, 2021

Introduction

When Janet Vertasi, a sociology professor of technology at Princeton, learned of her pregnancy, she decided to conduct a personal experiment. She hid her pregnancy from the internet for nine months. This meant only sharing her pregnancy with close friends and family, using her own personal server while making purchases on Amazon and even opting to use cash For many of her transactions. During this time Amazon mistook her as a “suspicious customer” (Vertasi 2014, Gray 2014). Recall another incident of how Target found out about a...

Analyzing the Bay Area Commute Network with Geopandas and Networkx

February 12, 2021

Hi everyone! I'm one of the D-Lab Data Science Fellows that joined the D-Lab this year. I'm in my second semester of the MCP/MS (City Planning / Transportation Engineering) program. My academic background is actually in Physics, and I've been doing research on radiation detection in urban areas before deciding to come back to school. I hope to bring my physics background and computational skills to the field of urban planning, to better understand and model urban/regional systems using complex systems and computational methods, and to bridge the divide between data science and...

7 Steps to a strong survey tool with Qualtrics

November 30, 2021

When creating a survey for an audience it is important to make your survey tool accessible, succinct, and understandable. The following 7 step guide gives you an important tool kit to improve your survey response rates and completion rates, and give you clear results.

1: Set the stage on your intro page.

Inform the respondent of the purpose of the survey on the title page of your survey. The purpose of this page is to build trust with the audience and provide the necessary information...

A Beginner’s Guide to the Bootstrap

November 22, 2021

What is the bootstrap method?

If you take a quantitative methods course here at Berkeley, chances are that you will learn how to perform a bootstrap. As an introductory data science instructor, it’s one of my favorite topics to teach, not just because it’s a powerful and useful tool, but also because it’s incredibly intuitive. In short, the bootstrap -- also known as resampling with replacement -- allows us to generate a distribution of sample statistics given only a single sample, estimating sampling error.The name of this method...

Stumbling Upon Data Sonification When I Fused My Passion for Music with Coding

November 16, 2021

Like many graduate students from the MIDS program who are also full-time working professionals, I return to campus to seek knowledge and satisfy my intellectual curiosity in information and data science. It has become a part of a lifelong learning pursuit that enables me to constantly apply what I learn back into the real world. Along the way, I never forget that it is also important to have fun with science by combining new knowledge with my own passions in arts and music in whatever ways possible. For nearly a decade, I have been helping clients in...

Rural vs. Urban: Using Python to Explore Legislative Data

November 8, 2021

Before COVID-19, becoming a data scientist was never on my radar. As a policy analyst for the California Research Bureau, a legislative research and reference section of the California State Library, I’ve worked on a variety of projects and requests. For the last 8 years, my work has focused on producing timely, confidential ...