Data Manipulation and Cleaning

R Data Wrangling and Manipulation

September 17, 2021, 2:00pm
It is said that 80% of data analysis is spent on the process of cleaning and preparing the data for exploration, visualization, and analysis. This R workshop will introduce the dplyr and tidyr packages to make data wrangling and manipulation easier. Participants will learn how to use these packages to subset and reshape data sets, do calculations across groups of data, clean data, and other useful tasks.

Python Data Wrangling and Manipulation with Pandas

November 1, 2021, 12:00pm
Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with 'relational' or 'labeled' data both easy and intuitive. It enables doing practical, real world data analysis in Python. In this workshop, we'll work with example data and go through the various steps you might need to prepare data for analysis.

GPT Fundamentals

April 17, 2024, 3:00pm
This workshop offers a general introduction to the GPT (Generative Pretrained Transformers) model. We will explore how they reflect and shape our cultural narratives and social interactions, and which drawbacks and constraints they have.

Python Web Scraping & APIs

June 29, 2022, 1:00pm
In this workshop, we cover how to extract data from the web using Python. We focus on two approaches to extracting data from the web: leveraging application programming interfaces (APIs) and web scraping.

Python Web APIs

March 14, 2023, 2:00pm
In this workshop, we cover how to extract data from the web with APIs using Python. APIs are often official services offered by companies and other entities, which allow you to directly query their servers in order to retrieve their data. Platforms like The New York Times, Twitter and Reddit offer APIs to retrieve data.

Grace Hu

Data Science for Social Justice Fellow 2024
Bioengineering

Grace is a 3rd year Bioengineering PhD candidate in the joint UC Berkeley-UCSF Graduate Program. Her research lies at the nexus of computational design and 3D-bioprinting to advance tissue engineering for regenerative medicine. She previously studied Materials Science and Engineering (B.S.) and Computer Science (M.S.) at Stanford University, where she investigated printable batteries to power an ultra-affordable scanning electron microscope and explored computer science education research by developing AI models to augment teaching ability.

In her free time she...

Hugh Kadhem

Mathematics

Hugh Kadhem is a Ph.D. student in Applied Mathematics, with broad research interests in computational quantum physics and high-performance scientific computing.

Introduction to Propensity Score Matching with MatchIt

April 1, 2024
by Alex Ramiller. When working with observational (i.e. non-experimental) data, it is often challenging to establish the existence of causal relationships between interventions and outcomes. Propensity Score Matching (PSM) provides a powerful tool for causal inference with observational data, enabling the creation of comparable groups that allow us to directly measure the impact of an intervention. This blog post introduces MatchIt – a software package that provides all of the necessary tools for conducting Propensity Score Matching in R – and provides step-by-step instructions on how to conduct and evaluate matches.

What Are Vowels Made Of? Graphing a Classic Dataset with R

February 13, 2024
by Anna Björklund. Vowels are all around us. Mainstream US English has around twelve unique vowels. How can our brains tell these sounds apart? This blog post will help you answer this question by plotting vowel data from a classic American English dataset by Peterson and Barney (1952).

How can we use big data from iNaturalist to address important questions in Entomology?

February 26, 2024
by Leah Lee. Large-scale geographic data over time on insect diversity can be used to answer important questions in Entomology. Open-source, open-access citizen science platforms like iNaturalist generate huge amounts of data on species diversity and distribution at accelerating rates. However, unstructured citizen science data contain inherent biases and need to be used with care. One of the efforts to validate big data from iNaturalist is to cross-check with systematically collected data, such as museum specimens.