Python Text Analysis Fundamentals: Parts 1-2

March 28, 2022, 3:00pm to March 30, 2022, 6:00pm

Trying to register, but not affiliated with the UCB campus? If you are from Berkeley Lab (LBL), UCSF, or CZ Biohub, please register via our partner portals here

If you are from the UCB campus there's no more waitlist! But after registering above, please do fill out the affiliations form if you have not done so at least once before:

Location: Remote via Zoom. Link will be sent on the morning of the event.

Recordings: This D-Lab workshop will be recorded and made available to UC Berkeley participants for a limited time. Your registration for the event indicates your consent to having any images, comments and chat messages included as part of the video recording materials that are made available.

Date & Time: This workshop is a 2-part series running from 3pm-6pm each day:

  • Part 1: Monday, March 28.
  • Part 2: Wednesday, March 30.

Start Time: D-Lab workshops start 10 minutes after the scheduled start time (“Berkeley Time”). We will admit all participants from the waiting room at that time.


This two-part workshop series will prepare participants to move forward with research that uses text analysis, with a special focus on humanities and social science applications.

  • Part 1: Preprocessing Text.  How do we standardize and clean text documents? Text data is noisy, and we often need to develop a pipeline in order to standardize the data, to better facilitate computational modeling. In the first part of this workshop, we walk through possible steps in this pipeline using tools from basic Python, NLTK, and spaCy in order to preprocess and tokenize our data.

  • Part 2: Bag-of-words Representations How do we convert text into a representation that we can operate on computationally? This requires developing a numerical representation of the text. In this part of the workshop, we study one of the foundational numerical representation of text data: the bag-of-words model. This model relies heavily on word frequencies in order to characterize text corpora. We build bag-of-words models, and their variations (e.g., TF-IDF), and use these representations to perform classification on text.

To continue with Text Analysis sign up for Topic Modeling or Word Embeddings. 

  • Part 3: Topic Modeling. How do we identify topics within a corpus of documents? In this part, we study unsupervised learning of text data. Specifically, we use topic models such as Latent Dirichlet Allocation and Non-negative Matrix Factorization to construct “topics” in text from the statistical regularities in the data.

  • Part 4: Word Embeddings How can we use neural networks to create meaningful representations of words? The bag-of-words is limited in its ability to characterize text, because it does not utilize word context. In this part, we study word embeddings, which were among the first attempts to use neural networks to develop numerical representations of text that incorporate context. We learn how to use the package gensim to construct and explore word embeddings of text.

The first two parts are taught as a joint series. Parts 3 and 4 can be attended "a la carte"; however, prior knowledge of Parts 1 and 2 is assumed.

Prerequisites: D-Lab’s Python Fundamental introductory series or equivalent knowledge.

Workshop Materials:

Software Requirements:Installation Instructions for Python Anaconda

Is Python Not working on your laptop? Attend the workshop anyway, we can provide you with a cloud-based solution until you figure out the problems with your local installation.

Feedback: After completing the workshop, please provide us feedback using this form

Questions? Email: