Resisting our Data Doppelgangers: A Proposal for Unpacking the Dangers of Data-Driven Fertility Advertising With Data Science Tools

December 7, 2021

Introduction

When Janet Vertasi, a sociology professor of technology at Princeton, learned of her pregnancy, she decided to conduct a personal experiment. She hid her pregnancy from the internet for nine months. This meant only sharing her pregnancy with close friends and family, using her own personal server while making purchases on Amazon and even opting to use cash For many of her transactions. During this time Amazon mistook her as a “suspicious customer” (Vertasi 2014, Gray 2014). Recall another incident of how Target found out about a teenage girl’s pregnancy before her father did (Hill 2012) or the countless women calling out tech companies for sending them targeted pregnancy and fertility ads after their miscarriages and child loss (Brockell 2018, Moss 2019). In these examples we see individuals grappling with their digital selves, recognizing the predictive and often harmful nature of their collected data. These cases fit into a larger data economy that has become a dominant force within our global economy. Personal data, especially as it pertains to individuals with unique health identities, can have advertisers paying top dollar for access. The algorithms behind online targeted advertising are optimized to match individuals with products they are most likely to buy. This process of tracking online behavioral histories is called online behavioral advertising (OBA). Websites with ad space (Publishers) sell the data behind OBA and a combination of proxies indicating sociodemographic and geographic characteristics that advertisers use to target their desired audiences. Advertisers can run campaigns with their ads at various levels of reach, frequency, and recency. Clicks per impression (cost per thousand impression), Click through rate (CPR) and cost per action (CPA) are among the most important metrics within this industry. The AdTech ecosystem, in turn involves agencies, exchanges, networks, demand-side and supply-side platforms within its elaborate scheme of personalization and targeting. Previous work within social computing on dark patterns have explored the dubious nature of both publishers and advertisers alike (Mathur et al. 2019, Liccardi et al. 2020, Gray et al 2021a, Gray et al 2021b). 

Furthermore, the growth of this industry and the steadied innovation made in cross-device tracking and prediction modeling has presented notable privacy risks are some of the primary motivating forces behind the passing of the EU’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). In addition to maintaining their right to privacy, many users suffer from the tyranny of their online behavioral histories. The Wall Street Journal’s recent investigation on the harmful effects of prolonged use of Instagram on young girlsextend beyond the ethics of feed curation to the personalization of targeted ad content. A recent study that my colleagues and I at the School of Information conducted, illustrated the lengths in which individuals with histories of disordered eating took to mitigate the harmful effects of targeted weight-loss ads (Gak et al. 2021). Through a series of semi-structured interviews, we were able to conclude that targeted weight-loss ads were both physically and emotionally harmful and oftentimes misleading in their promotion of unsubstantiated health claims. Additionally, the combination of the persistent nature of targeted advertising and an overgeneralization of the categories in which communities are situated encourage end user engagement with content that could be potentially harmful to them.

The Data Doppelganger

The data doppelganger, serves as a useful metaphor for understanding the complicated social implications of our personal data. Like the original German term, doppelganger, the data doppelganger represents a mirrored version of ourselves. This version, an accumulation of the continuous process of datafication we endure as internet users, is abhorrent in nature. They can be described as “both heimlich and unheimlich, familiar and strange, and the subject thus has an ambiguous relationship with data, both loving it and hating it at the same time” (Pierlejewski, 470). Pierlejewski (2020) also eerily posits that data doppelganger represents an entanglement of our physical and digital selves in a lot of the same ways Donna Haraway (2006) explains the formation of the cyborg self in her 1985 essay, Cyborg Manifesto

The goal of this research project is to investigate this entanglement further, especially as it pertains to individuals with stigmatized health identities. Where do we draw the line between the human and the data we produce every day? How do the ubiquitous renderings of such data assemblages complicate how we view ourselves and what are the possibilities of stratification within this process? In this blog post, I aim to propose an experimental study that will utilize web scraping and imagine classification as a way to observe the targeted nature of these ads behave at scale.

A focus on Infertility

Infertility, like many other invisible ailments, is deeply social. The public’s continued association of women with infertility rather than both men and women, the positioning of the egg as passive within sexual reproduction (Martin 1991) and the racialization of forced sterilization (Bailey & Peoples 2017) illustrate a long history of body politics within the United States. 

My research project aims to investigate the following hypotheses:

a) Users displaying behaviors associated with infertility, will be significantly targeted by diet ads more than users who do not display online behaviors signaling towards infertility

b)Perceived race, geographic location and socioeconomic status influences the kind of fertility-related ads users receive

Simulating Users

Like Facebook and Google, Twitter uses an ads manager that helps advertisers create custom campaigns for promoting their content to targeted audiences. Promoted tweets are a common site for targeted advertising within their platform. Using their own recommendations for campaign construction, I will create 15 “online personas” or fake twitter accounts mimicking the behavior of twitter users struggling with infertility. These accounts will be my experimental groups. I am focusing on three areas of behavioral targeting as defined by twitter’s ad targeting manager. Figure 1 illustrates the targeting features used to target specific audiences. The UI is responsive, giving estimates of reach for each term inputted by advertisers.

 

Figure 1. A screenshot of the targeting features

Options within Twitter’s ads manager

Interest-based targeting

Interest based targeting is ad targeting that personalizes ad distribution to users who display certain interests. This feature has limited options, featuring broad categories of interests. I decided to select the “The Family and parenting: babies and toddlers” interest. It was a global audience size of approximately 1.8 million twitter users.

Keywords targeting

Keyword targeting uses specific terms to target individuals. These keywords represent the terms users have searched, tweeted, or even interacted with. It's important to note that using a hashtag before a term changes its level of reach so I opted to include hashtags with terms that are more likely to be used across different audiences. A few key terms I decided to include are the following: #infertility, pregnancy loss, ivf, miscarriage, #ivfjourney and ovulation tracking.

Follower look-alike targeting

Follower look-alike targeting includes users that are following accounts who display interests relevant to the active campaigns. I chose to follow fertility centers because of their explicit relationship to individuals suffering from infertility. A few fertility centers I selected include Shady Grove Fertility Clinic (@SGFertility), Boston IVF Fertility Clinic (@BostonIVF) and the USC Fertility (@USCFertiltiy).

I aim to manually set up these profiles and engage with content to situate these accounts within relevant audience segmentation. To adequately answer my hypothesis questions, I will also need to develop a control group. An early prototype of my control group will feature the same age, gender and geolocation distribution as my experiment groups and further work is being done to consider entry points for bias. A future iteration of this project will consider doing this automatically and at scale with the use of web crawlers.

Getting The Data: Web Scraping

To analyze targeting patterns, I will need to create a database of labeled images. To begin constructing this database, I will create a script to continuously web scrape the feeds of my user personas. The python package, Scrapy, is a powerful package that collects JavaScript and json code from web pages. For this project, I will use Scrapy to systematically scrape all div classes containing images and promoted tweets. When inspecting (using Chrome developer tools), I noticed that promoted tweets have a unique marker. See Figure 2 for a screenshot of the relevant marker needed for finding promoted tweets on a webpage.

Figure 2. A screenshot of the promoted tweet marker within Chrome developer tools.

The Classification Model

The open-source machine learning tool,PyTorch, makes building and training image classification models manageable! Using this package, I will train a Convolution Neural Network (CNN) to create a binary classification of images that contains the following categories “fertility related ads” and “non-fertility related ads”. Figure 3 illustrates a simplified diagram of how this model works. CNNs are super useful in that they reduce images into sizes that are efficient for processing while protecting the characteristics that are most pivotal for accurate predicting. This algorithm uses pixel matrices to create layers that assigns weights and biases to learn feature patterns and ultimately classify the input image into its appropriate category based off training data but the convolution layer itself consists of the use of the input image and the feature detector to create a unique output that called a feature map. Each feature map utilizes matrices of color spaces (eg/ RGB vs CMYK) for matrix multiplication. These kinds of calculations are usually complex so one of the major reasons why I am using PyTorch is because it does most of the computation for me. Furthermore, it converts input image types into a version that is recognized by the Python Imaging Library. Within the convolution layer, the matrix is transformed into an often smaller and simplified matrix. Each occurrence of this is called a kernel (a randomly generated vector consisting of weights and biases). As this layer repeats this process flattening ith number of times, another layer called the subsampling layer further simplifies each of the layers using a down-sampling technique. Lastly, the fully connected layer is the component of the model that is used to classify the image between various categories. It learns non-linear combinations of the high-level features retained by the convoluted and pooling layers. In other words, this is where the classification happens.

 

Figure 3. A diagram of the convolution neural network (CNN).Source

Conclusion

These proposed methods are a component of a larger mixed-methods study that I am conducting on the logics behind targeted fertility advertising. Finding a way to test whether there is a significant relationship between a user persona displaying behaviors associated with infertility and a high occurrence of fertility ads helps establish a ground truth for the critical work we are doing in understanding the limitations of using data as a predictive analytic for consumer interests. In addition to investigating the dissonance between, this work also sheds light on the regulatory limitations of our current definitions of a “protected class” or “sensitive category”. Although there are legal protections put in place by the CCPA or by the Consumer Protection Agency, categories of vulnerability are quite fluid and advertisers are continuously finding ways to exploit such identities. Ultimately, sociotechnical tools have always been flawed, but the goal of this research is to show that personal data can be a dangerous mirror of our real-life experiences. It often fails to capture vulnerabilities and in result its agnostic positioning as “evidence-based” upholds systematic inequalities and promotes an abstracted and reductionist picture of ourselves.          

References

Bailey, M., & Peoples, W. (2017). Articulating black feminist health science studies. Catalyst: Feminism, Theory, Technoscience, 3(2), 1–27.

Gak_CHI21_DarkPatterns.pdf. (n.d.). Google Docs. Retrieved November 29, 2021, fromhttps://drive.google.com/file/d/1AETDbMWpWlzPJQtYKPT2oMzWekjNGxzN/view?usp=sharing&usp=embed_facebook

Gray, C. M., Chen, J., Chivukula, S. S., & Qu, L. (2021). End User Accounts of Dark Patterns as Felt Manipulation. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2), 1–25.

Gray, C. M., Santos, C., Bielova, N., Toth, M., & Clifford, D. (2021). Dark patterns and the legal requirements of consent banners: An interaction criticism perspective. 1–18.

Gray, S. (2014, April 29). One woman’s attempt to hide her pregnancy from big data—It’s more difficult than you’d expect. Salon.https://www.salon.com/2014/04/28/one_womans_attempt_to_hide_her_pregnancy_from_big_data/

Haraway, D. (2006). A cyborg manifesto: Science, technology, and socialist-feminism in the late 20th century. In The international handbook of virtual learning environments (pp. 117–158). Springer.

Hill, K. (n.d.). How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did. Forbes. Retrieved November 29, 2021, fromhttps://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/

Martin, E. (1991). The egg and the sperm: How science has constructed a romance based on stereotypical male-female roles. Signs: Journal of Women in Culture and Society, 16(3), 485–501.

Mathur, A., Acar, G., Friedman, M. J., Lucherini, E., Mayer, J., Chetty, M., & Narayanan, A. (2019). Dark patterns at scale: Findings from a crawl of 11K shopping websites. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW), 1–32.

My Experiment Opting Out of Big Data Made Me Look Like a Criminal. (n.d.). Time. Retrieved November 29, 2021, fromhttps://time.com/83200/privacy-internet-big-data-opt-out/

“My World Is Very Dark Right Now”: What It’s Like To Be Targeted By Baby Ads After Miscarriage. (2019, September 29). HuffPost UK.https://www.huffingtonpost.co.uk/entry/women-affected-by-miscarriage-and-infertility-are-being-targeted-with-baby-ads-on-facebook_uk_5d7f7c42e4b00d69059bd88a

Nouwens, M., Liccardi, I., Veale, M., Karger, D., & Kagal, L. (2020). Dark patterns after the GDPR: Scraping consent pop-ups and demonstrating their influence. 1–13.

Perspective | Dear tech companies, I don’t want to see pregnancy ads after my child was stillborn. (n.d.). Washington Post. Retrieved November 29, 2021, fromhttps://www.washingtonpost.com/lifestyle/2018/12/12/dear-tech-companies-i-dont-want-see-pregnancy-ads-after-my-child-was-stillborn/

Pierlejewski, M. (2020). The data-doppelganger and the cyborg-self: Theorising the datafication of education. Pedagogy, Culture & Society, 28(3), 463–475.