Introduction
Text analysis techniques, including sentiment analysis, topic modeling, and named entity recognition, have been increasingly used to probe patterns in a variety of text-based documents, such as books, social media posts, and others. This blog post introduces Twitter text analysis, but is not intended to cover all of the aforementioned topics. The tutorial is broken down into two parts. In this very first post, I will give a step-by-step guide of how to use Python and Pandas to explore Twitter data In the second post, I will introduce Language Models, and together we will go through implementing word embedding, splitting training and test sets, and finally building a classifier to identify positive and negative sentiments.
Initial Analysis of Tweets
Setup
The dataset we will be using is Sentiment140, an open-source twitter dataset which contains 1.6 million tweets and their associated sentiments. Sentiment140 already comes with annotated labels of polarity: each tweet is labeled either positive, negative, or neutral. To analyze the tweets, we’ll use two packages called tweet-preprocessor and wget. Tweet-preprocessor is a library of tools for tweet text cleaning, while wget is used to retrieve data online. First, we check if tweet-preprocessor and wget are installed. Then, we import packages and libraries that we will use for the analysis.
We then download the data from the web and unzip it locally.
Overview of Dataset
We load Sentiment140 into a dataframe and check the length of the dataset — there are 1,600,000 tweets. The dataset is already preprocessed, with emoticons removed.
Here's what each column refers to:
-
label : the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
-
twitter_id: the ID of the tweet (2087)
-
time: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
-
query: the query (lyx). If there is no query, then this value is NO_QUERY.
-
account: the user that tweeted (robotickilldozr)
-
tweet: the text of the tweet (Lyx is cool)
Next, we examine the number of unique IDs and accounts. Note that account or twitter name is typically not unique.
The date column contains the day of the week, the month, the day of the month, time, timezone, and year. In order to facilitate later analysis, we split this column into six columns using str.split(), each corresponding to the variables mentioned above. We then check if time is normalized by checking the number of unique timezones in the dataset, which is only one!
The next step is to extract mentions from tweets. We use regular expression to find patterns we often see for mentions, which is the @ symbol followed by any word character.
The final step of data preparation is to extract mentions from each tweet and convert them into a single dataframe called "mentions." First, we convert the text into lowercase, and then extract mentions in each tweet. In the case that a tweet comes with multiple mentions, we create additional lines to hold that mention. In the mentions dataframe, the index refers to the index in the original dataframe, the mentions column holds mention in each tweet, and the label column refers to the polarity of that tweet. We then take a look at the top five mentions in the dataset.
Looks like Miley Cyrus, whom we could call the person of the year in 2009, was mentioned most frequently that year!!
Visualization
Before diving into text analysis, we can take a look at the distribution of positive and negative tweets across different days of a week.
It looks like people were mostly tweeting with negative sentiment on Thursday, with negative tweets taking up almost 75% of all tweets sent out on Thursday. The proportion immediately goes down for Friday. Yay! TGIF.
We are going to wrap up the first part of the tutorial. In the next post, I’ll introduce sentiment analysis, one of the downstream tasks of text analysis often applied to Twitter data.