Can Virtual Communities be Archived?

February 23, 2021

When future historians try to piece together social life in the twenty-first century, they won’t be combing through faded newspaper clippings or handwritten letters. They’ll be clicking through digital archives that have stored remnants of our real and virtual lives: email collections, tweets, Facebook messages, maybe even Google Calendar entries. 

For the time being, preservation of these records is the burden of the user. If one chooses, they can download their Facebook archive or back-up their emails and photos, and maybe choose to share them with a scholar in the future, in the way people donated personal correspondence collections to archives in the past. But digital platforms have created a category of virtual spaces that may be hard to recover, that of online communities. 

Over the past year, I’ve become interested in archiving an online community that exists on Facebook, called the Memories of East Pakistan/Bangladesh (“Memories”). Memories is a Facebook group that was created three years ago to serve as a gathering place for members of the East Pakistani diaspora. East Pakistan was an administrative unit that existed in South Asia between 1947 to 1971. It went from being the eastern wing of the country of Pakistan to fighting for independence and becoming the new country of Bangladesh. 

Map by Julius Paolo

As I’ve written about before, East Pakistan was a linguistically, religiously, and ethnically diverse region - when Bangladesh was created, many whose identities conflicted with the new Bangla-speaking, Muslim-majority nation-state felt ostracized and left the country.

Prior to the creation of this Facebook group, there was little space for those who had called East Pakistan their home and left to share their experiences and realize they were not alone. Over 3000 members post regularly on the group, sharing their migration stories, recollections of everyday life, and links to food, photos, and music from East Pakistan.

These are primary sources unlike any other in existence. But how will they be saved? This is a conversation I’ve been having with the Memories leadership, campus librarians, and copyright scholars. We believe these records can be stored in a digital archive -- and there is a strong desire for them to be -- but the process will be difficult, particularly because they exist on a corporate social media platform. 

An archive is traditionally a collection of physical primary source materials. At the least, they are storage centers for these materials, but ideally, they are curated collections that facilitate the discovery of stories by visitors. The UC Berkeley Bancroft library, for example, hosts curated archives for different aspects of campus life. A digital archive, is usually meant to be a collection of digitized materials. Examples of digital archives include the Internet Archive, which has vast collections of web pages, rare books, photographs and films, or UC Professor Scott Saul’s website “Becoming Richard Pryor”, which hosts primary sources he uncovered while doing research for his book on Pryor. 

Themed collections from the digital archive, “Becoming Richard Pryor”

But what I would like to do is create a digital archive of a digital space - essentially a copy of the Facebook group that is available for viewing outside of the social media platform, and one that is ideally organized to help web visitors understand the rich and varied experiences of East Pakistanis. There are very few examples of this type of project. 

These types of archive are difficult to create because of the many  legal, logistic, ethical, and financial considerations. The first question is, who has the rights to the content posted to the Facebook group? According to Facebook’s Terms of Service, it appears that the rights belong to the poster. In the specific setting of a Facebook group, one is allowed to collect data from the group so long as it is made clear that the user, rather than Facebook, is collecting the data. If we wanted to collect the content from the Memories group, this would require setting up something like a Google form inviting members to give their consent to storing and/or republishing their posts in an archive. It may also require individual follow-up with those who may not understand or have seen the consent form, and be ready to accept that some people will be difficult to reach or may simply say no.

Even with consent, downloading the content will be cumbersome. For well-founded reasons, most social media companies have made it difficult to download other people’s data at scale from their sites. Some sites, like Twitter, have made an API available for researchers to collect public data on trending topics, tweets, hashtags, and other information once their project is cleared. Facebook has an API available for developers that allows collection of some information from groups, but it only serves data from the last 90 days in accordance with the company’s data retention policies and requires admin privileges. 

Building a full data set -- posts, comments, multimedia such as links, photos, and videos, and metadata such as the date of the post and name of the poster --  may then require a manual effort. This means copy-pasting text, post by post, into a spreadsheet. If we imagine it takes three minutes to manually collect the data from each post, and there are 2000 vital posts we want to preserve, that would be 100 hours of work. Efforts like this are not unheard of - I’ve worked with researchers as a D-Lab consultant who undertook similar projects, albeit for smaller groups, and with professors as an undergraduate RA doing this manual scraping in a multi-year project - but they take resources.

If the data are successfully scraped, then come the more standard archival concerns. Where will the data be stored, keeping in mind security and accessibility? How will they be organized? Where will they be published? Tried and tested infrastructure like Google DriveDropbox, or Box could work for storing, and tools like Wordpress or Omeka or Scalar for presentation. When making these decisions, we will need to consider the ongoing maintenance and preservation of the data set. If we don’t have automatic tools scraping new data, how will we ensure that new posts or comments are being regularly added to our archive? And what digital infrastructure do we trust not to break or be sunset, so that our archive doesn’t suddenly disappear?

Throughout this entire process, we must bear in mind the psychological tension inherent in preserving something we presently understand to be ephemeral and private. Even with consent, it is difficult for anyone to imagine what it means for a Facebook post to be stored indefinitely or be presented to the public. The digital archivist will have a responsibility to develop protocols that honour the intent with which the content was produced, perhaps allowing authors to ask for later deletion from the archive or taking particular care to contextualize the material when posted on a website.

I’m just beginning this project. In my recent consultations with librarians and platform experts, it’s become clear that creating a digital archive of digital space will be no small undertaking. But while I am considering just one community, there will undoubtedly be many others like mine that are worthy of and want archiving. What we all need is more infrastructural and institutional support. Is there a responsible way for Facebook to help facilitate these cultural heritage projects? Can there be institutional funding to create and maintain these projects? What would digital infrastructure look like that didn’t rely on corporate mediation for a community to take ownership of its data? 

Thank you to Adam Anderson, Stacy Reardon, and Claudia von Vacano for their advice on this project.