Like many people around the world, I have been spending the last few weeks at home pouring hours into Nintendo’s newest hit game, Animal Crossing: New Horizons. In what the New York Times describe as “The Game for the Coronavirus Moment,” players take up the role of a human who moves to a deserted island to start a new town alongside their anthropomorphic animal neighbors. Crafting furniture, fishing, and shopping are among the activities you can engage in, and this idyllic world has proven to be the perfect escape over the last couple of months. COVID-19 has also coincided with me wrapping up my doctoral studies here at Berkeley. As I reflect on my time here, I wanted to take this opportunity to share some of the skills I picked up along the way that helped me become a full-fledged researcher. So, in this post I will illustrate how we can use data science techniques to analyze Animal Crossing! All of the code for this post can be found on my GitHub.
The villagers have been a staple of the Animal Crossing franchise for years. Players naturally have favorites, and some have even become internet sensations. Other villagers are not so lucky. With so many villagers, I thought it would be fun to explore some features about them, and hopefully help me keep building my town.
Figure 1: Raymond has become one of the most popular and sought-after villagers. Image Source: Nintendo
In this post, I will highlight some of the data science techniques that I use to acquire, clean, and analyze data. Using both Python and R, I will show you how to web scrape data, visualize the information with exploratory analysis, and apply a common machine learning technique. Data science is empowering because it provides a toolkit for researchers to solve problems at any stage of their research. Although my own research focuses on cybersecurity and privacy law, I wanted to do this blog post to show that data science can be applied across domains. In this case, we can even explore Animal Crossing villagers! Our first step will be to acquire some data about them, which we will scrape from the Animal Crossing wikipage, as seen in Figure 2.
Figure 2: The first five rows of the table that we are going to scrape. Source: Animal Crossing Wiki
To scrape this table, we’re going to use some common Python tools. In particular, we’ll use the Selenium and Beautiful Soup packages. Selenium is a package that allows one to automatically navigate to a website using code alone. The basic idea is that using Python, we can traverse the Internet without ever manually clicking anything. Then, we use Beautiful Soup to extract the underlying HTML from a webpage, where all of the information is stored. We can then scrape the table in Figure 2 and turn it into a pandas dataframe. Figure 3 shows some of this code, written in a Jupyter notebook.
Figure 3: Code to navigate to a webpage and scrape a table from it.
Now that we have a dataframe, we can start exploring and visualizing our data. For this part, we will switch over to R to do some plotting with the ggplot2 package, which is part of the tidyverse. The tidyverse is a suite of tools that is invaluable for data science. It includes methods for cleaning, organizing, plotting, and modeling data. It is important to note that as data scientists, we use the best tool available to us regardless of language. I frequently switch between Python and R depending on what I want to do. As I found out, learning to code in one language made learning a second or third programming language much easier because they all share the same computational principles. Being able to flexibly switch between languages, packages, and environments opens up lots of possibilities for budding data scientists.
Figure 4: Sample R Code to produce plots
So, what are some of the things that we can plot? We have information about each villager’s personality, species, birthday, and catchphrase. One thing we might do is visualize how many villagers there are per species. Then, we could shade these plots to break them down further by personality type. Because each personality type corresponds to a particular gender, we can also do a little data manipulation to make a column for gender and shade with that instead. There are lots of possibilities here, even with only a handful of columns to work with.
Figure 5: Barplots for Animal Crossing Villagers
Plotting out all of these relationships, we learn some interesting things! Cats, rabbits, and squirrels are among the most common species. Meanwhile meeting an octopus or cow is relatively rare. Broken down by personality type, different species also have different personality distributions. For example, there are a lot of jock frogs and peppy rabbits. Because the personality types also map to gender, different species also have different gender breakdowns.
Finally, we can even apply some common modeling techniques to this dataset. In this case, I will visualize k-means clusters and Principal Components Analysis (PCA). K-means clustering essentially groups data together that are similar. PCA is a common technique for dimensionality reduction, and it essentially simplifies information in our columns into fewer columns (or “components”). So, we first one-hot encode our data to transform the categorical Species, Personality, and Birth Month columns into numerical information. We then find some clusters, and plot them against two principal components. In this case, we don’t get very nicely separated clusters, which makes sense given we only have three features, and they are categorical rather than continuous. Still, the fact that we are able to explore the data this way with a little code is incredible.
Figure 6: K-Means Clusters on PCA Components
Using the skills that I learned over the last few years, I was able to acquire and explore this dataset, and learn some things about the villagers who populate my virtual town! Some species are much less common than I thought, and it was fascinating to see that different personalities are not equally distributed. Recruiting a diverse cohort of villagers to my town means I need to balance all of these different features. The fact that I was able to take my research skills and apply them to a hobby shows the flexibility that data science provides. The intellectual framework and techniques cross disciplinary boundaries, and can change the way we approach problems. Thinking in a data-driven way has not only changed the way I approach research, but also how I think about the world more generally.
I’ll end with some reflections about my time learning data science at Berkeley, and some advice for students who are aspiring data scientists. First, data science is ultimately a combination of both “hard” and “soft” skills. Writing code is a technical skill, but framing questions, exploring data, and communicating results to an audience are all necessary too. Learning both of these sets of skills and effectively combining them takes a long time, but the process is rewarding because at the end you truly know what it means to think like a scientist. Second, the definition what counts as “data” has expanded, and thinking creatively about this fact can open up numerous possibilities. In my own work, I have used text and video data in addition to more typical quantitative measures, and now I see potential sources of data in all sorts of unique places. Think creatively about previous limitations in your own field, and how you can solve those problems with novel data. Third, data science is inherently interdisciplinary and collaborative. There is a vibrant campus-wide community interested in data science, and immersing myself in it was the best thing I did in my time at Berkeley. Take courses, meet graduate students and researchers, and become involved with organizations outside of your department. These connections will help you feel part of a broader community, and expand your intellectual horizons. And finally, remember to always take time for friends and hobbies. Learning data science is a marathon, not a sprint. In addition to learning how to code and do statistics, it is important to take time to explore other interests, and sometimes do nothing at all. Luckily with Animal Crossing I can do all of these things! Whether you want to use data to explore Animal Crossing, build models that predict public health or economic outcomes, or anything else, I hope this post shows you that the tools and resources are at your fingertips.