On Selection Sunday, a twelve-member NCAA committee kicks off March Madness by picking America’s best college basketball teams. Each year, millions of people build their bracket based on records, school allegiances, favorite colors, and weirdest mascots. The national college basketball event that pins the top 64 Division I teams in the country in a knockout style tournament is one of the largest betting events in sports. In the course of 68 games, over $8.5 billion across 40 million bets are estimated to be made both legally and illegally (Odds Shark, 2021).
This year whether or not you have money on the line, you can use an algorithm to create a bracket that defies the madness and wins you clout with all of your friends. This 5 step guide gives you all the resources you need to go create a bracket with the stats and criteria that matter to you. To follow this guide, you will need some understanding of Python, logistic regressions, and basic Machine Learning.
Step 0: What am I even doing?
This guide allows fans of basketball, and statistics to get their hands dirty making an algorithm to predict their brackets. I can tell you with near certainty that your picks for this year’s bracket will not be right. There are always going to be crazy upsets and late-game meltdowns. That is why we watch the game. However, I do think that this process can help you get a feeling for how to use regression to create a heuristic model that can try to predict the outcome of extremely unpredictable events.
So, take your top 64 teams, pick the stats that mean most to you, run some regressions, stir the pot to predict the unpredictable, and enjoy three weeks of chaos, pride, and an orange ball going through a hoop.
Step 1: Let's get started with data
The one stop shop for all sports data can be found at Sports Reference. Sports Reference has a ton of data on most sports. The website boasts a robust collection of both regular stats like winning percentage or home and away records, and advanced stats like strength of schedule. These advanced stats are often important factors that have been tested by and utilized by sports statisticians to predict game outcomes and team strength. Ironically, the challenge with Sports Reference is that there is too much data! To make your life easier, familiarize yourself with sportsipy. This python API can help you gather any specific team’s data based on the year. While the documentation is a bit opaque, this link can help you navigate through and choose the right data for your teams (sportsipy, 2021).
Here is an example of how to query the API.
What data should I use?
For your regression analysis it is important that you are gathering a comprehensive comparative dataset, from 1999 to the most recent March Madness Tournament in 2021. These 22 years of data will give you a ton of information to test with, as there are 20-30 games in each regular season year per school and 68 tournament games each year. For each year, it is also important to grab all relevant data to the tournament results (each game’s result) and the statistics pertaining to each team.
A Master’s thesis by Fonseca on this topic shows a clean, efficient way to set up this data. Fonseca uses a ratio between stats to analyze in his regression model. For example, a team with a .750 win record facing a .600 win record team will have a win record ratio of 1.25.
Fonseca also put in coaching stats since in college basketball coaches are generally seen as a school’s greatest asset. Other stats that I found important to use in my model were regular season stats such as ‘Quad 1 wins’ (Or wins against top 30 or top 50 teams) and the strength of schedule. (These are generally accounted for in the seeding, but it is nice to have in your analysis and can be dropped if there is high multicollinearity). After you run the model, you may find, like Fonseca did, that many of your variables end up not having any impact on a win, you can drop these statistics - or not, maybe it is the data that is wrong, and your gut that is right! (Fonseca, 2017)
Step 2: Pick your Pseudo Stats
Here is where you should have fun with your model. If stats truly were king, the model will almost always choose top seeded teams. However, college basketball is filled with uncertainty: 12-seeds upset 5-seeds in nine of the past ten tournaments, and historically Basketball Gods have been kind to ‘blue blood’ teams (UNC, Duke, Kentucky, and Kansas). Are there trends you are finding in your data set?
One pseudo stat that I put into my model is conference tournament results - I call it tournament vengeance. Prior to 2011, ten of 14 conference champs won the tournament, but since then only two of nine have, with the majority exiting in the finals or semifinals. It might be totally wrong, but I am giving extra weight to conference semifinalists this time around. It didn't end up having much of an effect, but I played around with adding it to see if it would give me any interesting upsets in my model.
Step 3: Run a regression model and sift through results
Run logistic regressions on the statistical ratios on your testing dataset. What variables are significant predictors of victory? What variables are surprisingly not that important?
SPOILER: Seeding matters. Does it matter too much in your model? In almost all cases, you will find that seeding is going to drive your pick more than all other variables. But there is more to seeding, the NCAA evaluation tool (NET) is a mixture of the following criteria. See if you can apply the quality of wins model to better predict a team’s preparedness for the tournament (Borzello, 2018).
Game results (W,L)
Strength of schedule
Away and neutral wins
Quality of wins (using the quad ranking system)
Scoring margin - OT wins only receive a margin of 1 point
Net offensive and defensive efficiency
All games will be evaluated equally; there is no bonus or penalty for when a game is played within the season.
As part of your regression, try to account for these variables to isolate more precise factors in seeding that can help you estimate good matchups. For example if your model finds that quad 1 wins and strength of schedule are strong predictors of wins in the tournament, a 3-seed with a weaker strength of schedule facing a 5-seed that has a lower win rate but a harder strength of schedule and more quad 1 wins, might be favored in the matchup (Lopez, 2014).
It is important to note that this model is not going to be able to predict head to head player match-ups; A team that has a strong center that faces a team that doesn’t have a strong center might be able to pull out a win due to their mismatch. Here you may want to isolate the impact of certain statistics such as rebounds, assists, or points in the paint to make up for this lack of player specificity. These team attributes will be important when running a prediction algorithm.
Step 4: Plug your regression results into a prediction algorithm
For this step, we are actually going to create a bracket using an existing git repo. Read the readme file and follow the instructions to get started. You will need to create a few data files to feed into your config folder (nga-27, 2021).
You can use the heuristics template to add your weighting predictions for specific games based on your regression model. In the example, the author uses rank ratio to determine his heuristics, your example will use a more detailed win probability based on your logistic regression model. For example, a ratio of relevant matchup statistics will give you an estimate of the likelihood of team a beating team b.
You can use the attributes template to add your weighting prediction for specific teams. For example, if you find from your regression model that winning percentage, road percentage, and free throw percentage are important, load these team specific statistics into csvs using the provided templates and add it to your config json. My attributes consisted of strength of schedule, rebounds, and scoring margin among others.
Finally, this program allows you to design your own algorithm using a mixture of heuristics and attributes. Using the algorithm_template, you should be able to design an algorithm that uses heuristics stats that are weighted by attributes to predict wins.
Step 5: Tweak your heuristics, attributes, and have fun!
The github repo has a great feature. It takes in the argument round_num which allows you to trigger certain algorithms in certain rounds. Say for example, in the first round you think that seeding match up is going to be the only thing that matters. You can have the first round only focus on seeding. If in later rounds you feel as though experience matters, you might want to add an attribute about historic team success, favoring storied program’s nerves of steel to win a championship.
Keep tweaking the inputs and see how things change. You now have full liberty to be a mad scientist in the lab. When you are done you will get a nice image of your bracket, and you may be surprised at your results!
Borzello, J. (2018, August 22). NCAA announces new ranking system instead of RPI. ESPN.Com. https://www.espn.com/mens-college-basketball/story/_/id/24445390/ncaa-an...
Fonseca, J. G. S. S. (2017). MARCH MADNESS PREDICTION USING MACHINE LEARNING TECHNIQUES. NOVA Information Management School. https://run.unl.pt/bitstream/10362/33864/1/TGI0135.pdf
Lopez, M. J., & Matthews, G. (2014). Building an NCAA men’s basketball predictive model and quantifying its success. https://arxiv.org/pdf/1412.0248.pdf
nga-27. (2021, October 31). Bracketology: an NCAA Bracket Creator as a Custom AI/ML Pick Generator Platform. Github.Com. Retrieved March 5, 2022, from https://github.com/nga-27/Bracketology
Sportsipy: A free sports API written for python — sportsipy 0.1.0 documentation. (2021). Readthedocs.Io. https://sportsreference.readthedocs.io/en/stable/
Staff, O. S. (2021, March 19). How Much Is Bet On March Madness: Money Wagered On NCAA Tournament. Odds Shark. https://www.oddsshark.com/ncaab/march-madness/how-much-is-bet#:%7E:text=....