Log in

Sign up for our weekly newsletter!

The D-Lab has partnered with several organizations both on and off campus in order to collaborate on high-impact research projects. If you're interested in working with us, please email dlab@berkeley.edu .

Our project seeks to help explain why many displaced persons from Syria, Afghanistan and other war-torn regions fail to access critical protections and benefits in host states. We are currently building an online platform to fill information gaps and correct the types of miscommunication that prompt refugees to distrust host governments and their services. The website will feature intuitive, interactive data visualization tools to help local stakeholders visualize important trends thematically, spatially, and temporally. Synthesizing information in this way provides a critical resource to asylum seekers, governments and aid organizations, as communication barriers have impeded effective refugee policy. The results of this work could serve as an exemplar for efforts to improve the large communication gap separating asylum seekers and host communities and combat the influence of web-based misinformation campaigns.

- For an article providing background research, please click here.

- Visit our website at digitalrefuge.berkeley.edu


Boalt Law School; BIMI
Katerina Linos, Law School (PI); Laura Jakli: Patty Frontiera, D-Lab; Melissa Carlson; Jasmijn Slotjes, D-Lab/BIMI

The College Futures Foundation seeks to better understand and characterize feeder patterns among educational settings and segments in the transitions for California high school students from high school through receipt of their four-year degrees. D-Lab data scientists are working with College Futures to gather, process and visualize educational data to to better target mechanisms for improving college going and completion and aid efforts to understand common educational pathways in various California geographic regions.

College Futures Foundation
Jon Stiles; Patty Frontiera; Chris Hench

Please visit: https://hatespeech.berkeley.edu/

Measuring hate speech by integrating ordinal, multitask deep learning with faceted Rasch modeling

Outcome phenomena are typically measured at the binary level: a comment is toxic or not, an MRI scan shows cancer or is clear, a patient is diagnosed as having a disease or not. But underlying that dichotomization there is often a continuous spectrum or latent variable. Physical quantities such as temperature and weight can be measured as interval variables where magnitudes are meaningful. How can we achieve that same interval measurement for arbitrary outcomes - creating continuous scales with magnitudes?

We propose a general method for measuring complex phenomena as continuous, interval variables by integrating deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT). We decompose the target construct into multiple constituent components that are labeled as ordinal survey items, and are transformed via an IRT nonlinear activation into a high-quality continuous measure. We estimate the survey interpretation bias of the human labelers and eliminate its influence on the final construct when creating a training dataset, which removes labeling bias and supersedes the use of inter-rater reliability as a quality diagnostic. We further estimate the response quality of each individual labeler using faceted IRT, allowing responses from low-quality labelers to be removed.

Our faceted Rasch scaling procedure corresponds naturally to a multitask, weight-sharing deep learning architecture in which our theorized components of the target outcome are used as supervised, ordinal latent variables for the neural networks’ internal concept learning, improving sample efficiency and promoting generalizability. The architecture combines a proportional odds activation function and quadratic weighted kappa loss function designed for ordinal outcomes. This leads to a new form of model explanation because each continuous prediction can be directly explained by the constituent ordinal components in the penultimate layer.

We demonstrate our method on a new dataset of 50,000 social media comments labeled to measure a spectrum from hate speech to counterspeech, and sourced from YouTube, Twitter, and Reddit. We evaluated Universal Sentence Encoders, BERT, RoBERTa, ALBERT, and T5 as contextual representation models for the comment text, and benchmarked our predictive accuracy against Google Jigsaw’s Perspective API models.

Chris Kennedy (Lead), Alexander Sahn, Nora Broege, Claudia von Vacano (PI)

SAGE Publications is partnering with the D-Lab as one of their first data science online course developers and providers of learner support for SAGE Campus. The partnership has yielded a series of modules that introduce applied data science to social scientists. This series of learning modules demystify the tools and methods of an emerging field that is changing the way we collect, process, and analyze information. The learning modules make extensive use of interactive programming in Jupyter notebooks, a part of the larger Jupyter Project. We use a JupyterHub to host the materials, and students program directly in the web browser from any device.

SAGE Publications
Claudia von Vacano; Christopher Hench; Geoff Bacon; Adam Anderson, Evan Muzzall; Patty Frontiera, Rebecca Dizon, Keeley Takimoto

The Louisiana Slave Conspiracies Project (LSC) aims to make source materials from two slave conspiracies in 1791 and 1795 accessible to interested researchers. A collaborative multidisciplinary team, led by Professor Bryan Wagner in the UC Berkeley English Department, is developing an interactive digital archive of these materials. Our focus will be the testimonies taken from slaves and their allies in these conspiracies.  We are transcribing, translating, tagging, and collating these testimonies, along with other archival documents related to the conspiracies, for the first time. We want to ground our project in the aspirations and actions of the enslaved, as they are described in their own words.

Patty Frontiera, D-Lab geospatial data scientist, and Stacy Reardon, the Literatures and Digital Humanities Librarian, are working with Professor Wagner to manage a team of postdocs, graduate and undergraduate student researchers, and outside consultants engaged in this endeavor.  One goal of the LSC digital archive is to present these French and Spanish manuscripts alongside original transcription and English translation.  We are developing a website to provide access to these primary materials by scholars and interested public alike. The site will also feature interactive historical maps that will provide location-based access to the collection but will also help explore essential but still unresolved questions about the organization of social relations and the circulation of ideas in these conspiracies. We plan to use geospatial analysis and network analysis to explore new ways to visualize and gain insight into these historical events.

The Louisiana Slave Conspiracies project is generously supported by the Digital Humanities at Berkeley Collaborative Research grant program. Additional support has come from the Office of Digital Humanities at the National Endowment for the Humanities; the UC Consortium for Black Studies in California; the University of California Humanities Research Institute; the Doreen B. Townsend Center for the Humanities; and the Center for the Gulf South at Tulane University.

Department of English, D-Lab, UC Berkeley Library
Bryan Wagner (PI); Patty Frontiera; Stacy Reardon; Susan Powell; Amani Morrison; Shad Small; Adam Anderson

Trademark is an important component of companies’ intellectual property. In general, companies register trademarks with the United State Trademark and Patent Office (USPTO) and maintain an active status by paying registration and maintenance fees.  However, it is not uncommon for public companies to abandon existing trademarks and register new ones. This process is usually called rebranding or re-trademarking. There are many possible reasons behind the rebranding/re-trademarking decisions made by companies, such as emphasizing a new product line or market, as well as evading the impact of a negative news. This project aims to pinpoint a few reasons for rebranding/re-trademarking and explore how those reasons may explain the frequency and likelihood for public companies to make rebranding decisions. Specifically this project will combine longitudinal data drawn from the U.S. Patent and Trademark Office's Trademark Electronic Search System (TESS) with company metadata that have been disclosed publicly. For the purpose of data collection, we will also conduct text analysis of trademark applications and public companies’ SEC disclosures.

Berkeley School of Law
Su Li, Law School (PI)