Twitter Text Analysis: A Friendly Introduction, Part 2

March 7, 2023

The code for this blog post is available in this GitHub repository. You can also follow along in this Collab notebook!

Introduction

This blog post follows up on an introduction, part 1, in which I showed how to conduct exploratory data analysis of the Sentiment140 data. In this second part, we will dive into Twitter sentiment analysis. Starting with an introduction of Large Language Models, we will go through implementing word embeddings, splitting training and test sets, and finally building a classifier to identify positive and negative sentiments. If you are interested in further exploring NLP techniques, don’t hesitate to sign up for the Python Text Analysis workshop series with D-Lab.

BERT

Bidirectional Encoder Representation from Transformers (BERT) is one of the state-of-the-art pretrained Language Models. It was first trained on a massive amount of data which facilitates fine-tuning on later downstream tasks, such as sentiment analysis. That is to say, we don’t need to start from scratch but build upon an existing model and customize it for our own purposes. 

We will first install the transformers package provided by Hugging Face. This package is the standard way to use large language models in Python. Then, we load the tokenizer corresponding to the BERT “base” model, which is one of two variants of the BERT model (there is also a “large” model). 

# Install transformers package!pip install transformers# Build a BERT based tokenizertokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Tokenization

Instead of passing all tweets (1.6 million!) into our model, which will take too long to train on, we will use only 1% of the data in this tutorial. It should be noted that the label for negative tweets is changed to “1”; without this change, it may trigger a bug during training.

# Sample 1% of positive and negative tweets from original dataset# 0 means positive tweets; 4 means negative tweetsdf_model = pd.concat(    [df.loc[df['label'] == 0].sample(frac=0.01),     df.loc[df['label'] == 4].sample(frac=0.01)])# Change the label for negative tweets to "1"df_model.loc[df_model["label"] == 4, "label"] = 1

Next, we iterate through the tweet column and tokenize each tweet into individual tokens. It should be noted that BERT does not simply convert a word into a token. Very often, a single word is tokenized into multiple parts; for instance, “surfing” is decomposed into the two tokens “surf” and “-ing”. See this blog post for a detailed introduction of BERT’s tokenizer. 

# Store length of each tweettoken_lens = []# Iterate through the tweet column

for txt in df_model.tweet:   tokens = tokenizer.encode(txt)   token_lens.append(len(tokens))

We can plot the distribution of tweet lengths. It seems that all tweets have a length lower than 125 tokens. We will then set 125 as MAX_LEN, a parameter that is required for training. 

# Plot the distribution of review lengthssns.distplot(token_lens)plt.xlim([0, 200])plt.xlabel('Token count');

Train-test split

Next, we will split our dataset into three subsets: the training data, test data, and validation data. Slicing 10% or 20% of the data for testing is most commonly used. Creating a validation dataset is necessary, as our classifier may perform exceptionally well on training but perform less satisfactory when evaluated on a separate dataset containing out-of-distribution tweets. 

# Train-test splitdf_train, df_test = train_test_split(df_model, test_size=0.1, random_state=42)df_val, df_test = train_test_split(df_test, test_size=0.5, random_state=42)

Training

The final step before training is to transform the tokens into word embeddings and create the classifier. We won’t go deep into this part as we follow the standard procedure for training a machine learning model. Throughout the whole process, we have many training decisions to make, e.g. the size of the embedding, batch size, and etc. All decisions will influence the training time and, most importantly, the accuracy of classification. The code we use here is from this Kaggle post. Feel free to take a look at it! Run the “Build the classifier” code chunk to get everything done before training. 

We have a number of additional hyperparameters to specify for training including:

  • Batch Size: A batch size of 16, 32, or 64 is generally recommended. You can think of a batch as passing a sample to our learning algorithm in search of the optimal solution; after each batch a loss will be calculated and our learning algorithm will be updated.

  • Number of epochs: This is the number of times the entire dataset is passed to our classifier.  The number of epochs varies depending on the task. We don’t have a huge dataset here, so a small number of epochs should return a good optimization result. 

  • Learning rate: Determines the step size our learning algorithm will move forward with. If it is too large, the algorithm might bounce back and forth without hitting the optimum. If it is too small, it may take too many steps until converging to the optimum.

All of these hyperparameters are not fixed but often adjusted to fit any given downstream task. 

# Set the number of epochsEPOCHS = 8

# Create an optimizeroptimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)total_steps = len(train_data_loader) * EPOCHSscheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# Set up a loss functionloss_fn = nn.CrossEntropyLoss().to(device)

Once everything’s settled, we can start training. Click on the “Training” code chunk and pour yourself a cup of coffee. It may take 40 mins to run! Our classifier will spit out both training and validation accuracy for each epoch. 

Take a look at the result. We can of course visualize the accuracy data, simply aligning them together on a line plot. The training accuracy steadily increases, exactly as we would expect, and eventually reaches almost 100% after eight epochs of passes. The validation accuracy is not bad either, but it remains at around 81% without any increase. On the other hand, the loss of validation gets larger. It might be the case that our learning algorithm is overfitting. Considering the fact that the task itself isn’t that complex, the validation accuracy is still good enough. Nevertheless, to have a better performance on validation, we can always tweak a few things:

  • Increase the size of the dataset (we are merely using 1% of data).

  • Tweak the learning rate: [5e-5, 4e-5, 3e-5, 2e-5, 1e-4] are commonly used learning rates for the Adam optimizer. 

  • Tweak training settings, or adjust batch and epoch size. 

Epoch 1/8Train loss 0.4857352459265126 accuracy 0.7685416666666667Val   loss 0.4103033171594143 accuracy 0.81875Epoch 2/8Train loss 0.2707135284360912 accuracy 0.8961111111111111Val   loss 0.581130108833313 accuracy 0.81125Epoch 3/8Train loss 0.1603570611783976 accuracy 0.9538888888888889Val   loss 0.7748090094700456 accuracy 0.81375Epoch 4/8Train loss 0.09925389545109485 accuracy 0.9754166666666667Val   loss 1.0215172763576266 accuracy 0.81375Epoch 5/8Train loss 0.059240124417346024 accuracy 0.9866666666666667Val   loss 1.0887920645647682 accuracy 0.81Epoch 6/8Train loss 0.050139884708898734 accuracy 0.9894444444444445Val   loss 1.232879184919875 accuracy 0.8125Epoch 7/8Train loss 0.03146361627515742 accuracy 0.9932638888888888Val   loss 1.3020428043184802 accuracy 0.81375Epoch 8/8Train loss 0.018890714899413675 accuracy 0.9955555555555555Val   loss 1.392997169571172 accuracy 0.81875

Feel free to play with these hyperparameters and see if they end up with a better result! This tutorial is simply going through several important steps of building a classifier, leaving many for you to explore on your own. Again, if you enjoy reading this tutorial and would like to learn more about either Text Analysis or Deep Learning, sign up for our D-Lab workshops!