Twitter data extraction with Selenium

March 1, 2022

Introduction

With online communities and social networks serving as important sites for computational social science research, Twitter has quickly become a popular data source for researchers (Frey et al. (2020), Kusen et al. (2017), Rao et al. (2010) and Ru et al. (2021)). This blog post will demonstrate one way to extract twitter data without using the Twitter API. This is especially useful for researchers who are new to exploring the use of Twitter data in their research, looking to develop a baseline corpus for a research question they are newly pursuing or don’t have the technical resources to support Twitter API use.

There are a few different ways to extract data from Twitter but Selenium, a python package used for web scraping and web crawling, has become a popular choice within the research community. For my own work, I use Selenium to collect specific kinds of tweets over time. This package helps me iterate over a specific task so that I can collect enough twitter data to analyze.

Web Scraping vs Web Crawling

Before I discuss how Selenium works, it’s important that I clarify what it means to extract data from websites, especially social media sites. This process of extraction is usually referred to as web scraping. Web scraping is a systematic way to extract information online. It automates our ability to extract specific elements on a webpage. Alternatively, web crawling (otherwise known as ‘bots’ or ‘spiders’) is known to automate the process of indexing information on web pages. Web crawlers function more like a search engine rather than tools for scraping information. Web scraping and crawling are often used together, especially when utilizing python functions like Selenium. In this blog post, we will be specifically focused on web crawling.

Terms of Service Agreements

Additionally, deploying a bot on web pages can at times present legal challenges depending on the website you are indexing on. Before starting any social media-related data extraction or indexing project, be sure to review the terms of service agreement for that particular social media site. If you are interested in using Twitter data, you can read Twitter’s terms of service agreementhere

What is Selenium?

As mentioned earlier, selenium is a popular Python packaged used for automating information extraction from websites. This package uses python, java, and C# to create the functions needed to use the package. Since it provides scripts already incorporating these languages, it makes it easier for individuals who don’t have much familiarity with java and C# to use python for web-based data tasks.

There are two major components of Selenium: the web driver and elements. There are various ways that you could use selenium to find the elements you would like to extract. A few popular ones include type, class, and name. You can reference a great python installation guide for Seleniumhere

Download your Driver

In order to access the webpage you’ll want to scrape from while using python, you’ll need to download the appropriate driver for the browser. I normally use Google Chrome, so I downloaded the browser fromthis website. You’ll also need to make sure that you download the version that best matches your current version of Chrome. One easy way to identify your current version of Chrome would be to go to your Chrome settings, click into the “About Chrome” tab at the bottom of your settings page. There you’ll see your current Chrome version. Updating the driver version so that it consistently matches the current browser version is super important. This will prevent any future breaks in your script.

Using the Driver to Open the Twitter Webpage

Once you have both your driver and selenium downloaded, you’ll then want to import a few packages into your python script. These packages, listed below, import the capabilities for using the chrome driver in python.

Next, you will want to use the driver.get() function to call a website.  For the purposes of accessing Twitter data, you’ll need to call the Twitter website. The service argument within the webdriver. Chrome() passes in the path of your web driver. This service argument essentially recognizes your browser and initializes it. Next, you will need to assign the Chrome driver and its service argument a variable name, which we call “driver” here. We then assign the twitter website to a variable that we can call “url”. Lastly, we use the driver.get() function to open our url.

Next, you will want to extract the relevant twitter data we are interested in. In order to do that, you need to be familiar with how data is mapped onto web page elements. To extract an entire tweet, you can use developer tools to inspect a page and select the element that corresponds to that specific tweet. The next section will discuss this further.

Using Xpath and element id to extract an element on Twitter

Now you can use Selenium to extract data in the html. In order to efficiently select the correct data object, we need, we need to identify its relevant html element. An html element is a component of a webpage that contains data. Most html pages are comprised of various elements that are organized into specific attributes. Each element has an opening tag, content, and a closing tag.

Xpath, which stands for XML path language, is a syntax used to identify the elements of a XML document. Developer tools can be used to export a corresponding xpath for an element you would like to extract. In an example tweet from ESPN, the xpath will be used to extract the text in the tweet.

In developer tools, locate the element that corresponds to this text. You’ll see that it has an associated div and class. Right-click on the element and select copy xpath. This is the xpath that we will use to extract the text from this tweet in our script. From the copied path, we only want to use element id for this extraction process. We then will append the function .text to the variable that we are defining.

When we run the variable that our line of code is assigned to, we see that the text that is returned and it matches the text that is associated with the element we extracted.

 

Conclusion

The above example outlines a simple way of extracting Twitter data. This process can be iterated over and automated to aid in large-scale extraction of Twitter data extraction. To get the best results from Selenium, familiarizing yourself with how xml elements are configured on web pages would be a good set of next steps. Otherwise, Selenium acts as a simple and powerful tool for automating data extraction from web pages.

References:

Frey, W. R., Patton, D. U., Gaskell, M. B., & McGregor, K. A. (2020). Artificial Intelligence and Inclusion: Formerly Gang-Involved Youth as Domain Experts for Analyzing Unstructured Twitter Data. Social Science Computer Review, 38(1), 42–56.https://doi.org/10.1177/0894439318788314

Kušen, E., Cascavilla, G., Figl, K., Conti, M., & Strembeck, M. (2017). Identifying Emotions in Social Media: Comparison of Word-Emotion Lexicons. 2017 5th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW), 132–137.https://doi.org/10.1109/FiCloudW.2017.75

Rao, D., Yarowsky, D., Shreevats, A., & Gupta, M. (2010). Classifying latent user attributes in twitter. Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents, 37–44.https://doi.org/10.1145/1871985.1871993

Wu, L., Dodoo, N. A., Wen, T. J., & Ke, L. (2021). Understanding Twitter conversations about artificial intelligence in advertising based on natural language processing. International Journal of Advertising, 0(0), 1–18.https://doi.org/10.1080/02650487.2021.1920218