Creating the Ultimate Sweet
As Joanne Chang (founder of Flour Bakery in Massachusetts) says, “Make life sweeter… eat dessert first!” This relatable saying has also been attributed to Jacques Torres, Hellen Keller, and Erma Bombeck. Apocryphal or not, many of us enjoy a sweet treat, which begs the question: “which candy is the best?” Thankfully, the website FiveThirtyEight created a poll to help us try to answer this question.
The poll data is accessible on Kaggle [1], and contains the names of 85 sweets, as well as attributes of each sweet (e.g. whether it is chocolate-based), the relative sugariness and price of the sweet, and the win percentage of the sweet as voted on by over 8000 participants. In the poll, participants were given successive random pairs of candies from the sample, and voted on which of the two they preferred. These responses were used to calculate the win percent, or the fraction of times a given candy won a given matchup.
In this blog post, we’ll use linear regression to fit a model to connect attributes of different sweets to the popularity of those sweets. The code used for this post is available on Github [3]. Let’s download the CSV file and get started!
Loading the data
The poll data has been collated into a data frame, and is stored in a comma-separated values (CSV) file. We will use the pandas package in Python to read in and analyze this tabular data. After downloading the data from [1], let’s read it in:
# Import useful packages import numpy as np import pandas as pd import seaborn as sns import statsmodels.api as sm # Load the data data = pd.read_csv(‘candy-data.csv’) |
Data exploration
Before we start analyzing the data, we want to perform some initial data exploration, in order to get a sense of what format the data is in, what types of data we have to work with, and how complete the data set is. Answering these preliminary questions will inform the types of analysis we can perform on and the inferences we can make from this data set. In order to do so, we will use the 'DataFrame.info()' method of the pandas package, which returns a summary of the columns in the data frame, as well as what type of data each column contains, and whether there are values missing.
# Get some basic information about the dataframe data.info() |
Figure 1. Python output summarizing the data frame, showing that “data” contains 13 columns, the number of nan values in each column (none), and the data type of each column.
It looks like we have 13 columns, each of which correspond to either the sweet’s name, some attribute of the sweet (e.g. 'chocolate', 'nougat', 'sugarpercent'), or the popularity of the sweet, as measured by the percentage of matchups each sweet won ('winpercent'). There are no null entries, so we can use all of the rows in the data frame!
Now, let’s see what correlations there are between the different variables, using the seaborn package. We can calculate the correlation coefficients between each column in the data frame, and plot them with a heat map.
Figure 2. Heat map showing the correlations between each of the columns in the data set. The color of each square is set by the correlation coefficient between the columns indicated by the x and y label of the square, according to the color bar on the right of the plot. Each square is annotated with its correlation coefficient.
We can see that there are positive correlations between the 'chocolate', 'peanutyalmondy', bar attributes and the 'winpercent' attribute. On the other hand, there is a negative correlation between the 'fruity' and 'winpercent' attributes. In other words, a sweet which is fruity is generally less popular than a sweet which is not fruity.
We can also see that there is a strong negative correlation between 'fruity' and 'chocolate' - it looks like few people like sweets which have both fruit and chocolate. The strongest positive correlation is between 'chocolate' and 'winpercent'.
Now that we have taken a first look at the dataset, we have a sense of what sorts of questions we might be able to answer with this data. For example, we could ask “how likely is a chocolate-based sweet to be in bar form?” or “how likely is a bar to contain both nougat and caramel?” For this post, we will tackle the question of “What would be the most popular sweet?” In order to answer this question, we will need to fit a model to the data.
What does it mean to fit a model to this data?
There are many ways to fit a model to a dataset, depending on the desired output, the desired complexity, the format of the input data, and other factors. In this case, we want to see how different attributes of the sweets impact the popularity of the sweet. In other words, given some input (independent variables, such as sweet attributes), what will the output be (dependent variable, i.e. popularity)? We can also look at this as searching for the function f, where:
Y = f(X̂)
Here, Y is our dependent variable ('winpercent'), and X̂ is a vector of the independent variables consisting of features from the dataset capturing qualities of each candy. Let’s take a closer look at what independent variables we want to include. The columns of the data frame are as follows:
These columns are all possible independent variables (aka inputs or predictors) for our model, but we might not want to use all of them!
Pre-processing the data
I think it is unlikely that the name of a sweet will influence its popularity, so I am going to exclude this attribute from our analysis. This is not a completely reasonable assumption: a sweet called “Emma’s Horrible Tooth-Rotters” might not be popular even if it is delicious. For now, though, we’ll exclude this attribute. I encourage you to check out the D-Lab’s workshops on Python Web Scraping and Text Analysis if you are interested in how to fold this type of analysis into a model.
We will also need to split our data into training and test subsets. This way, we can train our model on some data, and then evaluate its performance on a new, independent sample. There is no one way to split the data, but we don’t want a very small training set (as we risk creating an inaccurate model) or a very small testing set (as we risk creating a model that is overfit to the training data). For now, let’s split the data in half, so that the training set and test set are the same size.
# Make the winpercent column the same format as the pricepercent and sugarpercent columns data.winpercent = data.winpercent / 100 # Split into training and test data y = data.winpercent.to_numpy() X = data[data.columns[1:-1]] half = int(np.floor(len(y)/2)) train_X = X[:half] train_y = y[:half] test_X = X[half:] test_y = y[half:] |
Creating a model
Let’s start with a simple assumption, that the popularity of a sweet is a linear function of the sweets attributes, i.e.
Y = βX̂ + ε
where β is a matrix of coefficients that we want to determine, and ε is a constant. We’ll use the 'statsmodels' package to set up a Generalized Least Squares (GLS) model to fit the training data set.
# Add a constant intercept (ε) as an input variable to fit for train_X = sm.add_constant(train_X) # Create a generalized least squares model with our training data model = sm.GLS(train_y, train_X) # Fit the model result = model.fit() |
Now we have a model! Let’s see what it looks like:
print(result.summary()) |
Figure 3. Python output of the Generalized Least Squares (GLS) Model Regression. The top half of the output describes the model’s attributes (e.g. method) and performance (e.g. R-squared). The bottom half of the output includes the model’s coefficients for each of the independent variables, as well as their associated t-statistic and p-value.
There’s a lot going on here, but let’s first look at the regression coefficients (labeled “coef” in the second table), which tell us how our model thinks the dependent variable depends on the independent variables. It seems that 'chocolate' is the only predictor with a reliable p-value, and with a large positive coefficient: if a candy is chocolate, it is significantly more likely to be popular.
We can evaluate the ability of our model to capture the variance in the input data using the R-squared statistic, otherwise known as the coefficient of determination [2]. This metric describes the amount of variance in the input data captured by our model. A model that perfectly fits every single observation would have an R-squared of 1.0. Our model has an R-squared value of 0.676, which isn’t too bad!
Let’s see how our model does on our test data. The model hasn’t seen this subset of the data before, so we can evaluate its performance.
# Use our model to predict y values given the test x values predict_y = result.predict(sm.add_constant(test_X)) # Plot the predicted y values against the true y values for the test set, and compare them to a straight line plt.scatter(predict_y, test_y, c=“navy”) plt.plot(np.linspace(0,1), np.linspace(0,1)) |
Figure 4. True win percentage for the candies in our testing data set vs. the win percentage predicted for these candies based on our model.
Our predictions are evenly spaced around the 1:1 line, but there is still a lot of variance. The R-squared value for the training set is 0.68, but for the test set is only 0.2. In other words, the model is only capturing 20% of the variance in the test set. This tells us that our model could be improved. One possible approach would be to try increasing the fraction of the data set used to train the model.
Inventing a new sweet
Now, let’s apply our model to a new sweet! My ideal sweet would be something like a Kit Kat (i.e. chocolate-based, with a wafer), but also would contain caramel, so let’s see what our model would make of that. We can create an X vector (i.e. list of attributes) for this new candy, and use our model to predict its Y value (i.e. popularity).
# Make an array with values for my invention, including a constant and using the mean sugarpercent and pricepercent values invent_X = [1, 1, 0, 1, 0, 0, 1, 1, 1, 1, data.sugarpercent.mean(), data.pricepercent.mean()] # Now use our model to predict how popular this sweet would be print(result.predict(invent_X)) |
This returns 0.59, which tells me that this might not be the most popular sweet.
Creating the ideal sweet
Based on the outputs of our model, we can create the ideal sweet (assuming our model is accurate!). The attributes with positive coefficients are 'chocolate', 'fruity', 'nougat', 'crispedricewafer', 'sugarpercent', 'pricepercent', and the added constant. If we use an input that maximizes all of these attributes, we will get the ideal sweet under our model.
# Make an array with values for my invention invent_X = [1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1] # Now use our model to predict how popular this sweet would be print(result.predict(invent_X)) |
As we expected, this predicts a very high win percent value of 97%. Although this might be an interesting result, it is still important to confidence check before we rush off to develop a new candy bar - a sweet with chocolate, fruit, nougat, wafer, that is incredibly sugary and very expensive might not be very popular in real life, nicely highlighting the limitations of our model!
Other Approaches
Our first model does not include the correlations between the input features. For example, the correlation heatmap revealed that a chocolate-based sweet was less likely to contain fruit, and that a bar was less likely to come in multiples. These correlations are not incorporated in our model, and would change the calculated coefficients!
A more nuanced model could take these correlations into account, by using other modeling methods like principal component analysis (PCA), ridge regression, or partial least-squares (PLS) regression. We could also have excluded variables which were highly correlated, or combined multiple correlated variables into a single composite variable. I encourage you to try and make some of these changes yourself, and see what other questions you can ask and answer using this data set and your data analysis skills!