Remember the sports almanac in Back to the Future? Wouldn’t it be great to be able to predict the winner of a sporting event accurately - without a time machine and its complications? That’s what Louisville-based software development company, Doorn, set out to do using AI during March Madness this year.
Creating statistical algorithms and solutions is nothing new for Doorn. For the last 20 years, the company has helped researchers and healthcare groups solve problems with a portfolio of services. However, to expand on their data science skill set, the Doorn team recently earned a professional certification in data science through Harvard University. For the certification’s final project, it was decided to see if an Artificial Intelligence (AI) model could be built to accurately predict college basketball winners and point spreads during the NCAA tournament.
At the core of AI is an algorithm development approach known as machine learning. The high-level view of machine learning is: first, gather as much data as you can reasonably acquire (depending on availability and cost). Next, separate your data into two sets, a training set and a test set. The separation ratio typically puts the majority of the data into the training set while leaving a smaller but reasonable number of records in the test set (say 90/10 ratio, depending on the amount of data you’ve gathered). Then do analysis, math, math, and more math on your training data set until you have hopefully created a winning prediction model. Finally, tryout your shiny, new mathematical model to see how well it predicts the results in your test data set. If it predicts results successfully on both your training and test data sets, you may have a winner.
For this project, data that included nearly 30,000 college basketball games from 2014-2017 for 351 NCAA Division I schools was acquired at Kaggle (generated by Sportradar). The data was then sliced, diced, and analyzed to best develop predictive algorithms to model the training set of game data. The model’s first test was against the test data set of game data, which included 2,402 games whose results were already known. The model’s predicted winner was correct for 88.7% of the games (pretty nice!) while the point spread predictions were a bit less impressive with an average prediction error of 6.4 points (below or above a game’s actual point spread). With the first test complete, it was decided to give the preliminary algorithm a second test by using it to try to predict the winners of March Madness. At this stage, it was expected that the algorithm may be able to pick winners relatively accurately, but fail on determining the point spreads. How did it do?
Out of the gate the algorithm looked amazing, picking the winners of the first 3 games played in the tournament. The odds of picking 3 games in a row by chance (similar to flipping a coin on heads three times in a row) is 12.5%, so it looked like the algorithm was dialed in. However, as games went on, the close parity and tournament surprises proved to outsmart the algorithm and eventually put the bracket to rest. The algorithm picked just 69.4% of the first rounds of action, but there was a hidden gem. The point spread predictions did pretty well (with a twist). By selecting games with point spread predictions differing from the Vegas lines by more than the algorithm’s average error, 6.4 points, there was a slight edge, winning 64% of the Vegas spreads (9 wins and 5 losses). This means the algorithm would provide a wagering advantage, in not trying to pick every game, but just the ones that don’t agree with Vegas by 6.4 points or more. Who would you rather beat, your spouse’s bracket or Vegas?
It’s important to note that the results of the project were based on a small sample, and only a limited, preliminary amount of energy in developing and refining the algorithm. So, more work would need to be done before relying on it to predict the future.
How did Harvard staff and class peers receive the project? It earned a 100% score and the following comments:
“A simple but neat project. Great write up and report!”
“Report is excellent.”
Overall, for an algorithm built in short order, it did an agreeable job at making college basketball predictions, and again showed the potential of AI. Doorn plans to use their enhanced data science skills to help clients on existing projects and to generate new software solutions. From healthcare to sports betting, AI can impact and enhance offerings in any industry.
Doorn is a full-service IT company that specializes in developing software solutions for others. For 20 years, Doorn has been partnering with groups across the US to solve problems and support users around the world. Learn more about Doorn at: https://doornsa.com/
Kaggle is a data science community in which users can build and share projects as well as access data sources submitted by the Kaggle community. You can learn more about Kaggle at: https://www.kaggle.com/
Sportradar is the leading global provider of sports data intelligence. The nexus between sports and entertainment, Sportradar serves leagues, news media, consumer platforms and sports betting operators with deep insights and a suite of strategic solutions to help grow their businesses. You can learn more about Sportradar at: https://sportradar.us/