Data Science

FSRDC 2023 Annual Meeting and Research Conference

October 2, 2023
by Renee Starowicz. Renee Starowicz, Co-Executive Director of the Berkeley Federal Statistical Research Data Center, provides an overview of the takeaways from the 2023 Annual Federal Statistical Research Data Center Business Meeting and Annual Conference. She provides a brief overview of the Berkeley FSRDC. Then, she describes the priorities for collaboration across national directors to improve outreach to diverse researchers and transparency. Additionally, she points out the other key topics of conversation at this year’s meeting.

ADOPTING DATA SCIENCE CURRICULA: A STUDENT CENTRIC EVALUATION

Susan Wang
Vandana Janeja
David Harding, Ph.D.
Claudia von Vacano, Ph.D.
Daniel Lobo
2023

With the advent of data science as a new discipline with high demand for a skilled workforce, educators are increasingly recognizing the value of translating courses and programs that have been shown to be successful and sharing lessons learned in increasing diversity in data science education. In this paper, we describe and analyze our experiences translating a lower-division data science curriculum from one university, University of California, Berkeley’s Data8 course, to other settings with very different student populations and institutional contexts at, University of Maryland,...

Critical Faculty and Peer Instructor Development: Core Components for Building Inclusive STEM Programs in Higher Education

Claudia von Vacano, Ph.D.
Michael Ruiz
Renee Starowicz, Ph.D.
Seyi Olojo
Arlyn Y. Moreno Luna
Evan Muzzall, Ph.D.
Rodolfo Mendoza-Denton, Ph.D.
David Harding, Ph.D.
2022

First-generation college students and those from ethnic groups such as African Americans, Latinx, Native Americans, or Indigenous Peoples in the United States are less likely to pursue STEM-related professions. How might we develop conceptual and methodological approaches to understand instructional differences between various undergraduate STEM programs that contribute to racial and social class disparities in psychological indicators of academic success such as learning orientations and engagement? Within social psychology, research has focused mainly on student-level mechanisms...

Testing for Measurement Invariance using Lavaan (in R)

February 7, 2023
by Enrique Valencia López. Measurement invariance has increasingly become a prerequisite to examine if items in a survey that measure an underlying concept have the same meaning across different cultural and linguistic groups. While there are different ways to examine measurement invariance, the most common approach is using a method known as Multigroup Confirmatory Factor Analysis (MGCFA). In this blog post, I discuss how to conduct a MGCFA using lavaan in R and the different levels needed to establish measurement invariance.

Twitter Text Analysis: A Friendly Introduction, Part 2

March 7, 2023
by Mingyu Yuan. This blog post is the second part of “Twitter Text Analysis”. The goal is to use language models such as BERT to build a classifier on tweets. Word embedding, training and test splitting, model implementation, and model evaluation are introduced in this model.

Can Machine Learning Models Predict Reality TV Winners? The Case of Survivor

March 14, 2023
by Kelly Quinn. Reality television shows are notorious for tipping the scales to favor certain players they want to see win, but could producers also be spoiling the results in the process? Drawing on data about Survivor, I attempt to predict the likelihood of a contestant making it far into the game based on editing and production decisions, as well as demographic information. This post describes the model used to classify player outcomes and other potential ways to leverage data about reality TV shows for prediction.

Acquiring Genomic Data from NCBI

April 4, 2023
by Monica Donegan. Genomic data is essential for studying evolutionary biology, human health, and epidemiology. Public agencies, such as the National Center for Biotechnology Information (NCBI) offer excellent resources and access to vast quantities of genomic data. This blog introduces a brief workflow to download genomic data from public databases.

A Brief Introduction to Cloud Native Approaches for Big Data Analysis

March 20, 2023
by Millie Chapman. Satellites, smart phones, and other monitoring technologies are creating vast amounts of data about our earth every day. These data hold promise to provide global insights on everything from biodiversity patterns to human activity at increasingly fine spatial and temporal resolution. But leveraging this information often requires us to work with data that is too big to fit in our computer's "working memory" (RAM) or even to download to our computer's hard drive. In this post, I walk through tools, terms, and examples to get started with cloud native workflows. These workflows allow us to remotely access and query large data from online resources or web services, all while skipping the download step!

From paper to vector: converting maps into GIS shapefiles

April 11, 2023
by Madeleine Parker. GIS is incredibly powerful: you can transform, overlay, and analyze data with a few clicks. But sometimes the challenge is getting your data into a form to be able to use with GIS. Have you ever found a PDF or even paper map of what you needed? Or googled your topic with “shapefile” after it to no avail? The process of transforming a PDF, paper, or even hand-drawn map with boundaries into a shapefile for analysis is straightforward but involves a few steps. I walk through the stages of digitization, georeferencing, and drawing, from an image to a vector shapefile ready to be used for visualization and spatial analysis.