Text Analysis

PoliPy: A Python Library for Scraping and Analyzing Privacy Policies

February 8, 2022

In light of recent scandals involving the misuse and improper handling of personal data by large corporations, advocacy groups and regulators alike have given increased attention to the issue of consumer privacy [e.g., 1, 2, 3, 4, 5]. National and local governments have been enacting privacy legislation that requires companies to minimize the amount of data they collect, deters the collection of sensitive data, limits the purposes for which the data are used, and critically, gives users more transparency into data collection and use.

As part...

Jennifer Kaplan

Consultant
French

Jennifer is a first-year graduate student in the Romance Languages & Literatures program here at Berkeley. She has experience conducting ethnographic fieldwork and is passionate about qualitative research methods.

Text Analysis for Public Health

October 5, 2021
October 5th, 2021 - another day in the global pandemic. Average Joes are busy tweeting about it, politicians give interviews on the latest plans, and newspapers publish article after article on vaccination levels, case counts, and the booster shot. That’s a ton of information. So much in fact, that it would be pretty nice to have some computer assisted help to sort through it. Enter stage right: text analysis. Just what is it, and in the midst of COVID-19, how can it be used to advance public health? Text analysis is a family of analytic techniques used to identify patterns and meaning from unstructured text, that is, text that a computer can’t readily understand. Aka, most qualitative data. And there is a lot of that sort of data floating around. We’re talking tweets, Reddit posts, and emails, but also electronic health records (EHRs), books, and even academic research. You’ll probably agree that in that list alone, there’s a lot of valuable data!

Ilya Akdemir

Data Science Fellow
School of Law

Ilya is a JSD candidate at UC Berkeley School of Law. His research focuses on natural language processing and machine learning applications that are motivated by both theoretical and practical questions in the legal domain.

Computational Text Analysis Working Group (CTAWG)

The Computational Text Analysis Working Group (CTAWG) features demos, tutorials, and ongoing projects through which we are learning to use an array of computational text analysis approaches including: topic modeling, TF-IDF, dictionary methods, supervised machine learning, cosine similarity scores, words-to-vectors, grammar parsing, regular expressions, and more. Visit the CTAWG website to learn more.

Check the D-Lab Calendar to view and register for...

Brooks Jessup, Ph.D.

Data Science Fellow
History

Brooks received his Ph.D. in History from UC Berkeley and was trained in Data Science at General Assembly. His work applies digital tools and methods to the study of modern cities and urban issues. At D-Lab, he teaches and consults on data analytics, machine learning, geospatial analysis, and natural language processing with Python and SQL.

Adam Anderson, Ph.D.

Research Training Manager; Postdoc Lecturer
Digital Humanities

I’m an interdisciplinary data scientist, with a background in Middle Eastern languages (Hebrew, Arabic, and historical languages like Sumerian, Akkadian, Assyrian and Babylonian). I’ve worked in Syria, Lebanon, Israel, and Turkey with archaeological sites and museums. My technical skills include: translation and data storytelling, data forensics (3D imaging, mapping, modeling), computational linguistics (CTA, NLP, OCR), and network analysis (SNA). My roles on campus include: Research Training Manager of the Computational Social Science Training Program; Postdoc Lecturer...