PoliPy: A Python Library for Scraping and Analyzing Privacy Policies

February 8, 2022

In light of recent scandals involving the misuse and improper handling of personal data by large corporations, advocacy groups and regulators alike have given increased attention to the issue of consumer privacy [e.g., 1, 2, 3, 4, 5]. National and local governments have been enacting privacy legislation that requires companies to minimize the amount of data they collect, deters the collection of sensitive data, limits the purposes for which the data are used, and critically, gives users more transparency into data collection and use. 

As part of my research at the Berkeley Lab for Usable and Experimental Security (BLUES) [6], I examine the data collection and sharing practices of smartphone app developers and evaluate the compliance of these practices with data protection regulations, such as the EU General Data Protection Regulation (GDPR) [7] and the California Consumer Protection Act (CCPA) [8]. Alongside other provisions, these regulations require businesses to disclose their information collection practices to consumers and obtain their consent before collecting and sharing personal information. In the context of online privacy, the most common mechanism of achieving this is by having consumers consent to a privacy policy presented by the business before data collection takes place. 

Although prior research has repeatedly demonstrated the limitations of privacy notices [e.g., 9, 10, 11, 12, 13], they remain often the sole mechanism through which consumers can glean information about the data collection and sharing practices of online services. For this reason, we have developed an open-source Python library and a command-line interface called PoliPy [14] that can be used to compile datasets of privacy policies for selected services, evaluate the comprehensibility of their disclosures, and track longitudinal changes in their contents. In this blog post, I will provide an overview of its features. 

Components of PoliPy

PoliPy is a maintained library developed as a collaborative effort of researchers seeking to understand longitudinal trends in privacy disclosures of online services. It consists of several modules that can either be used individually or together as a pipeline.

Scraper

PoliPy allows researchers, developers, regulators, and consumers to assemble datasets of privacy notices of their selected services by providing a list of URLs where the notices are hosted. The scraper module scrapes both the static and dynamic page contents of the provided URLs using the requests and selenium Python libraries [15, 16]. Optionally, users can also save the screenshots of the pages and parallelize the scraping process over multiple threads to reduce the runtime. By continuously running the scraper in the same directory (either manually or using job schedulers, such as cron), PoliPy can efficiently produce longitudinal datasets of privacy policies by downloading data only if a change in the webpage contents has been detected since the last execution of the scraper. 

Extractor

The scraper described in the previous step obtains and stores the webpage source document, containing data beyond the actual contents of the policies, such as the website header, navigation menu, and footer interspersed with HTML tags, CSS scripts, and embedded JavaScript code. To facilitate the analysis of the underlying privacy policy, the extractor module allows users to parse the contents of the webpage source producing its textual contents. Users can also specify other extractors they want to employ to locate specific information within the policies themselves, such as the date of modification and the contact information. Importantly, PoliPy is designed to handle a variety of source documents, with the ability to extract contents of privacy notices from HTML, PDF, and TXT files.  

Analyzer

Finally, PoliPy comes equipped with a set of predefined analysis functions that users can directly apply to the parsed texts of privacy policies. These functions evaluate the readability and the complexity of the texts and can compute simple metrics, including word counts and reading times, as well as more complex ones such as the Flesch–Kincaid Grade Level score [17], using the textstat library [18]. By continuously scraping the policies, users of PoliPy can discover how these metrics change over time and as a result of external events, such as the introduction of new data protection regulations or privacy initiatives. 

Putting Everything Together

To get started with PoliPy, you can easily install it using the pip package installer for Python:

pip install polipy

Once you have installed PoliPy, you can start using it to develop your privacy policy datasets. You can use this library as a command-line interface (CLI), for instance:

$ cat policies.txt

https://docs.github.com/en/github/site-policy/github-privacy-statement

$ polipy policies.txt -s

Alternatively, you can also import PoliPy as a library in a Python script:

#!/usr/bin/env python3

import polipy

url = 'https://docs.github.com/en/github/site-policy/github-privacy-statement'

result = polipy.get_policy(url, screenshot=True)

result.save(output_dir='.')

Both of these invocations result in the creation of the following output folder: 

├── docs_github_com_c0eb432555

│   ├── 20220201.html

│   ├── 20220201.png

│   ├── 20220201.json

├── └── 20220201.meta

In this example, the name of the files in the subdirectory corresponds to the date the policy was scraped and the file extensions correspond to the following:

  • .html contains the (dynamic) source of the webpage where privacy policy is hosted;

  • .png contains the screenshot of the webpage;

  • .meta contains information such as the URL of the privacy policy and the date of last scraping; and

  • .json contains the content extracted from the privacy policy.

To explore other features provided by PoliPy, I encourage you to check out the documentation on the project website (https://github.com/blues-lab/polipy). I also encourage you to reach out to me if you have any questions, report any issues, propose new features, or share your experiences with PoliPy

To close, I want to wish you best of luck with your projects and happy data scraping! :) 

References

[1] Granville, K., 2018. Facebook and Cambridge Analytica: What you need to know as fallout widensThe New York Times, 19, p.18.

[2] Jyothish R., 2018. The World’s Biggest Biometric Database Keeps Leaking People’s DataFast Company

[3] Tan, R., 2018. Fitness app Polar revealed not only where U.S. military personnel worked, but where they livedThe Washington Post

[4] Porter, J., 2019. Google fined €50 million for GDPR violation in FranceThe Verge

[5] Kang, C., 2019. FTC approves Facebook fine of about $5 billionNew York Times, 12

[6] Berkeley Lab for Usable and Experimental Security (BLUES)

[7] EU General Data Protection Regulation (GDPR): Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), OJ 2016 L 119/1.

[8] California Consumer Privacy Act (CCPA), 2018 Cal. Legis. Serv. Ch. 55 (A.B. 375). 

[9] Pollach, I., 2007. What's wrong with online privacy policies?. Communications of the ACM, 50(9), pp.103-108.

[10] Sheng, X. and Cranor, L.F., 2005. An evaluation of the effect of US financial privacy legislation through the analysis of privacy policies. ISJLP, 2, p.943.

[11] McDonald, A.M. and Cranor, L.F., 2008. The cost of reading privacy policies. Isjlp, 4, p.543.

[12] Obar, J.A. and Oeldorf-Hirsch, A., 2020. The biggest lie on the internet: Ignoring the privacy policies and terms of service policies of social networking services. Information, Communication & Society, 23(1), pp.128-147.

[13] Barocas, S. and Nissenbaum, H., 2009, October. On notice: The trouble with notice and consent. In Proceedings of the engaging data forum: The first international forum on the application and management of personal electronic information

[14] Polipy: Library For Scraping, Parsing, And Analyzing Privacy Policies

[15] Requests: HTTP for Humans

[16] Selenium with Python

[17] Kincaid, J.P., Fishburne Jr, R.P., Rogers, R.L. and Chissom, B.S., 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Naval Technical Training Command Millington TN Research Branch.

[18] Textstat: Python Package to Calculate Readability Statistics of a Text Object