Data Science

Acquiring Genomic Data from NCBI

April 4, 2023
by Monica Donegan. Genomic data is essential for studying evolutionary biology, human health, and epidemiology. Public agencies, such as the National Center for Biotechnology Information (NCBI) offer excellent resources and access to vast quantities of genomic data. This blog introduces a brief workflow to download genomic data from public databases.

A Brief Introduction to Cloud Native Approaches for Big Data Analysis

March 20, 2023
by Millie Chapman. Satellites, smart phones, and other monitoring technologies are creating vast amounts of data about our earth every day. These data hold promise to provide global insights on everything from biodiversity patterns to human activity at increasingly fine spatial and temporal resolution. But leveraging this information often requires us to work with data that is too big to fit in our computer's "working memory" (RAM) or even to download to our computer's hard drive. In this post, I walk through tools, terms, and examples to get started with cloud native workflows. These workflows allow us to remotely access and query large data from online resources or web services, all while skipping the download step!

From paper to vector: converting maps into GIS shapefiles

April 11, 2023
by Madeleine Parker. GIS is incredibly powerful: you can transform, overlay, and analyze data with a few clicks. But sometimes the challenge is getting your data into a form to be able to use with GIS. Have you ever found a PDF or even paper map of what you needed? Or googled your topic with “shapefile” after it to no avail? The process of transforming a PDF, paper, or even hand-drawn map with boundaries into a shapefile for analysis is straightforward but involves a few steps. I walk through the stages of digitization, georeferencing, and drawing, from an image to a vector shapefile ready to be used for visualization and spatial analysis.

Why We Need Digital Hermeneutics

July 13, 2023
by Tom van Nuenen. Tom van Nuenen discusses the sixth iteration of his course named Digital Hermeneutics at Berkeley. The class teaches the practices of data science and text analysis in the context of hermeneutics, the study of interpretation. In the course, students analyze texts from Reddit communities, focusing on how these communities make sense of the world. This task combines both close and distant readings of texts, as students employ computational tools to find broader patterns and themes. The article reflects on the rise of AI language models like ChatGPT, and how these machines interpret human interpretations. The popularity and profitability of language models presents an issue for the future of open research, due to the monetization of social media data.

Introduction to Field Experiments and Randomized Controlled Trials

July 24, 2023
by Leena Bhai. This blog post provides an introduction to field experimentation and its significance in understanding cause and effect. It explains how randomized experiments represent an unbiased method for determining what works. It delves into essential features of experiments such as intervention, excludability, and non-interference. It then works through a fictional example of a randomized controlled trial of the efficacy of an experimental drug Covi-Mapp.

Mapping Time-Series Satellite Images with Google Earth Engine API

July 17, 2023
by Meiqing Li. Remote sensing imagery has the potential to reveal land use patterns and human activities at a planetary scale. For example, nighttime light intensity extracted from can shed light on spatial patterns of human activities and settlements, especially in places where traditional data are scarce. This blog post introduces Google Earth Engine (GEE) as a general purpose tool to extract time-series remote sensing data from GEE data catalog. I walk through using GEE to obtain data, filter by time and geographic region, and visualize it on static and interactive maps.

D-Lab & Graduate Division create inclusive data science summer program

August 9, 2023
by Vanessa Navarro Rodriguez. UC Berkeley's Social Sciences D-Lab and Graduate Division created the Data Science for Social Justice Program to address underrepresentation in data science. The program teaches diverse students critical data analysis and its applications in addressing societal injustices. The 8-week free summer course for admitted University of California students focuses on Python programming, Natural Language Processing, and value-informed data practices. It aims to empower students from underrepresented backgrounds and to bridge STEM with social justice. This blog post elaborates on the program's creation and features one of the DSSJ students, Robin López, and his reasons for participating.

The Geography of Cannabis: Does California’s dual licensing program (de)criminalize cannabis and drive unnecessary anthropogenic activity in remote rural environments?

August 29, 2023
by Chevon Holmes. When California voters (de)criminalized cannabis production, the state’s dual licensure requirement forced local jurisdictions to create permitting programs or uphold prohibition. Many Counties developed ersatz zoning ordinances to regulate cannabis activities and hired staff to administer local permits. As an inspector, administrator, and project planner for Mendocino County from 2017-2021, I visited hundreds of cultivation sites and production facilities where I learned first-hand how two legal pathways impacted the ways in which operators could transition their businesses. This post details a dataset created to track, aggregate, and analyze the relationship between cannabis infrastructure and licensing.

Unlock the Joy and Power of Reading in Language Learning

August 21, 2023
by Bowen Wang-Kildegaard. I share my story of how reading for pleasure transformed my English speaking and writing skills. This experience inspired my passion to promote the joy and power of reading to all language learners. Using natural language processing techniques, I dive into the Language Learning subreddit, revealing a trend: Learners are often highly anxious about output practices, but are generally positive about input methods like reading and listening. I then distill complex language learning theories into actionable language learning tips, emphasizing the value of extensive reading for pleasure, pointing to potential methods like using ChatGPT for customization of reading materials, and advocating for joy in the learning journey.

My Summer Exploring Data Science for Social Justice: Learnings, Tensions & Recommendations

September 5, 2023
by Genevieve Smith. This summer I joined the D-Lab hosted Data Science for Social Justice workshop at UC Berkeley diving into Python – including TF-IDF, sentiment analysis, word embeddings, and more – with a lens towards leveraging data science for social justice. My team explored a Reddit channel on abortion and used computational analysis to answer key questions related to abortion access from before versus after Roe vs. Wade was overturned. Computational social science is incredibly powerful, but I continue to grapple with tensions particularly as it relates to employing machine learning and large language in international research, and end with key recommendations for CSS practitioners.