In light of recent scandals involving the misuse and improper handling of personal data by large corporations, advocacy groups and regulators alike have given increased attention to the issue of consumer privacy [e.g., 1, 2, 3, 4, 5]. National and local governments have been enacting privacy legislation that requires companies to minimize the amount of data they collect, deters the collection of sensitive data, limits the purposes for which the data are used, and critically, gives users more transparency into data collection and use.
Although prior research has repeatedly demonstrated the limitations of privacy notices [e.g., 9, 10, 11, 12, 13], they remain often the sole mechanism through which consumers can glean information about the data collection and sharing practices of online services. For this reason, we have developed an open-source Python library and a command-line interface called PoliPy  that can be used to compile datasets of privacy policies for selected services, evaluate the comprehensibility of their disclosures, and track longitudinal changes in their contents. In this blog post, I will provide an overview of its features.
Components of PoliPy
PoliPy is a maintained library developed as a collaborative effort of researchers seeking to understand longitudinal trends in privacy disclosures of online services. It consists of several modules that can either be used individually or together as a pipeline.
PoliPy allows researchers, developers, regulators, and consumers to assemble datasets of privacy notices of their selected services by providing a list of URLs where the notices are hosted. The scraper module scrapes both the static and dynamic page contents of the provided URLs using the requests and selenium Python libraries [15, 16]. Optionally, users can also save the screenshots of the pages and parallelize the scraping process over multiple threads to reduce the runtime. By continuously running the scraper in the same directory (either manually or using job schedulers, such as cron), PoliPy can efficiently produce longitudinal datasets of privacy policies by downloading data only if a change in the webpage contents has been detected since the last execution of the scraper.
Finally, PoliPy comes equipped with a set of predefined analysis functions that users can directly apply to the parsed texts of privacy policies. These functions evaluate the readability and the complexity of the texts and can compute simple metrics, including word counts and reading times, as well as more complex ones such as the Flesch–Kincaid Grade Level score , using the textstat library . By continuously scraping the policies, users of PoliPy can discover how these metrics change over time and as a result of external events, such as the introduction of new data protection regulations or privacy initiatives.
Putting Everything Together
To get started with PoliPy, you can easily install it using the pip package installer for Python:
pip install polipy
$ cat policies.txt
$ polipy policies.txt -s
Alternatively, you can also import PoliPy as a library in a Python script:
result = polipy.get_policy(url, screenshot=True)
Both of these invocations result in the creation of the following output folder:
│ ├── 20220201.html
│ ├── 20220201.png
│ ├── 20220201.json
├── └── 20220201.meta
In this example, the name of the files in the subdirectory corresponds to the date the policy was scraped and the file extensions correspond to the following:
.png contains the screenshot of the webpage;
To explore other features provided by PoliPy, I encourage you to check out the documentation on the project website (https://github.com/blues-lab/polipy). I also encourage you to reach out to me if you have any questions, report any issues, propose new features, or share your experiences with PoliPy
To close, I want to wish you best of luck with your projects and happy data scraping! :)
 Granville, K., 2018. Facebook and Cambridge Analytica: What you need to know as fallout widensThe New York Times, 19, p.18.
 Jyothish R., 2018. The World’s Biggest Biometric Database Keeps Leaking People’s DataFast Company
 Tan, R., 2018. Fitness app Polar revealed not only where U.S. military personnel worked, but where they livedThe Washington Post
 Porter, J., 2019. Google fined €50 million for GDPR violation in FranceThe Verge
 Kang, C., 2019. FTC approves Facebook fine of about $5 billionNew York Times, 12
 EU General Data Protection Regulation (GDPR): Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), OJ 2016 L 119/1.
 California Consumer Privacy Act (CCPA), 2018 Cal. Legis. Serv. Ch. 55 (A.B. 375).
 Pollach, I., 2007. What's wrong with online privacy policies?. Communications of the ACM, 50(9), pp.103-108.
 Sheng, X. and Cranor, L.F., 2005. An evaluation of the effect of US financial privacy legislation through the analysis of privacy policies. ISJLP, 2, p.943.
 McDonald, A.M. and Cranor, L.F., 2008. The cost of reading privacy policies. Isjlp, 4, p.543.
 Obar, J.A. and Oeldorf-Hirsch, A., 2020. The biggest lie on the internet: Ignoring the privacy policies and terms of service policies of social networking services. Information, Communication & Society, 23(1), pp.128-147.
 Barocas, S. and Nissenbaum, H., 2009, October. On notice: The trouble with notice and consent. In Proceedings of the engaging data forum: The first international forum on the application and management of personal electronic information
 Selenium with Python
 Kincaid, J.P., Fishburne Jr, R.P., Rogers, R.L. and Chissom, B.S., 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Naval Technical Training Command Millington TN Research Branch.