Data Science

Excel Fundamentals: Lookups with INDEX-MATCH-MATCH

April 18, 2022

Last week marked the D-Lab’s inaugural “Excel Fundamentals” workshop, and to celebrate I am sharing one of my favorite Excel functions: INDEX-MATCH-MATCH. By combining the INDEX and MATCH functions, we can create a faster and more flexible lookup than the typical approach with VLOOKUP.

First, let’s explore the INDEX function and its three arguments: INDEX(where, down, across). It returns the value of a single cell within a block of data. It knows which cell we are...

dbplyr: do we still need to learn SQL to create and manage databases?

April 11, 2022

How to deal with datasets that are larger than our computer’s memory? Do we still need to learn Structured Query Language (SQL) to create and manage a database?

As an incipient data analyst, one of my first major challenges was to build and manage a spatial database using PostGIS, an open-source software that adds a geographic to PostgreSQL relational databases. I was given several text files in a hard drive that weighed approximately 10 GB each! My first reaction was to double click on the first text file that I saw… but this was clearly...

What can state government do…to attract a data scientist like YOU?

March 29, 2022

What can state government do…to attract a data scientist like YOU?

By Kellie Hogue

What’s your next move? When I was in grad school, one of my professors told me that regardless of the job I am currently in, I should always be planning the next step in my career.

At the time, it made sense–academic appointments in my discipline were few and far between, and I wouldn’t get one without some major strategic networking and planning. Simply a case of too much supply, not enough demand....

Predicting Madness: This March Madness, you can be your friend group’s resident Bracketologist.

March 7, 2022

On Selection Sunday, a twelve-member NCAA committee kicks off March Madness by picking America’s best college basketball teams. Each year, millions of people build their bracket based on records, school allegiances, favorite colors, and weirdest mascots. The national college basketball event that pins the top 64 Division I teams in the country in a knockout style tournament is one of the largest betting events in sports. In the course of 68 games, over $8.5 billion across 40 million bets are estimated to be made both legally and illegally (Odds Shark, 2021). ...

Twitter data extraction with Selenium

March 1, 2022

Introduction

With online communities and social networks serving as important sites for computational social science research, Twitter has quickly become a popular data source for researchers (Frey et al. (2020), Kusen et al. (2017), Rao et al. (2010) and Ru et al. (2021)). This blog post will demonstrate one way to extract twitter data without using the Twitter API. This is especially useful for researchers who are new to exploring the use of Twitter data in their research, looking to develop a baseline corpus for a research question they are newly...

Getting Started with the NYT API

March 1, 2022

Introduction

The web is chock full of valuable troves of data that can spawn an infinite number of social science research projects. However, not all data is easily accessible! While some data can be easily downloaded, access to some sources of data are dictated by what is known as an API. Standing for application programming interface, APIs are a set of defined protocols governing the terms of access to software and servers from programs created...

Enumeration of Informal Work

March 1, 2022

The first time that I mapped out poverty statistics at a municipal scale, I was completely mind blown (figure 1). Looking at the spatial inequities from a bird’s-eye view drove my desire to find more granular data of social indicators to better understand intra-urban socioeconomic inequities. Spatial data techniques help us to find patterns and anomalies across data that improves our understanding of people’s lives in cities, raising new questions about urban infrastructure in terms of public goods provision, land-use, and access. However, finding granular socioeconomic...

Ian Castro

D-Lab Alumni
School of Information

Ian is a graduate student in the Master of Information Management and Systems program at the School of Information with a focus in applied data science. He earned his B.A. in Media Studies and B.S. in Microbial Biology from UC Berkeley, and his research interests and work experience are in STEM education. He focuses in building courses and academic programs to make data and computing accessible to historically marginalized students and those without prior exposure to the field.

PoliPy: A Python Library for Scraping and Analyzing Privacy Policies

February 8, 2022

In light of recent scandals involving the misuse and improper handling of personal data by large corporations, advocacy groups and regulators alike have given increased attention to the issue of consumer privacy [e.g., 1, 2, 3, 4, 5]. National and local governments have been enacting privacy legislation that requires companies to minimize the amount of data they collect, deters the collection of sensitive data, limits the purposes for which the data are used, and critically, gives users more transparency into data collection and use.

As part...