By far the best Machine Learning Model for predicting the NBA Champion
An NBA season is roughly divided into two parts. The regular season is the first part in which each NBA team plays a total of 82 games over a period of about 6 months. The teams that perform well qualify for the second part of the season, the NBA Playoffs (plus Play-In Tournament). The NBA Playoffs is a best-of-seven elimination tournament that crowns the Champion of the season. For an NBA fan, it is the most exciting and fun time of the year. The basketball community discusses who the tournament favorite is, which players will perform especially well, and what surprises there might be. Besides countless fan bets made on the favorite team, many basketball experts try their hand at predicting the NBA champion every year. History shows that predicting the champion before the start of the NBA Playoffs is almost impossible. An example of this is injuries to important players. As soon as only one important player of a team is injured, the probability of the team being eliminated from the tournament increases enormously. Consequently, predicting the Champion or even betting is completely pointless and has only something to do with luck, right?
This is not quite correct! My research has shown that some smart people have managed to (almost) predict a Champion using Machine Learning. They used the classic NBA statistics like Points per game, Assists per game or Rebounds per game. Personally, I wanted to get a better result. For this reason, I had to enhance the database with more advanced team statistics, as well as statistics that show the playing strength of individual players for a team.
In the following article I will give you an overview of how I managed to correctly predict the last 3 NBA Champions using Machine Learning. Since a detailed explanation of all steps is impossible, the article focuses on the exciting data science part, a short evaluation of the model and the analysis of charts with a focus on the most important statistics. If you want to study the code and all results in detail, check out the GitHub Repository.
Where did I get the data?
The source of my database is Basketball-Reference. On this great internet site you can find all important statistics around the NBA and its history, which I extracted by automated HTML parsing.
Python libraries: urllib, numpy, pandas, requests, bs4
Where do I store the data?
All data regarding a team and their regular season, as well as data of all NBA players, were aggregated and loaded into a database. Since the statistics were not fully tracked until the 80s and the data schema changes depending on the season, I decided to use a cloud based MongoDB database (document oriented database, NoSql).
Python libraries: urllib, numpy, pandas, pymongo, bs4
Definition of my Machine Learning Model?
The next paragraph is for the data geeks among us. The value to be predicted by the model was calculated from the playoff standings. The calculation allows to detect small but subtle differences between playoff performances. The final model is therefore a regression model that is evaluated using the NDCG ranking metrics.
Value = Champion Share Score = Win/Max(Win)
Python libraries: matplotlib, numpy, pandas, pymongo, seaborn, shap, sklearn, xgboost
What are my results?
To my delight, I managed to predict the last 3 NBA Champions using an Tree based Regressor. In order to understand the decision-making process of the Regressor, it would be analyzed using the SHAP library. The two following charts show the results of the detailed model investigation. This is followed by two additional PowerBI charts that illustrate the overall picture of the results.
- SHAP Importance Plot
- SHAP Summary Dot Plot
- PowerBI Heat Table
- PowerBI Comparison Chart
The SHAP Importance Plot shows which statistics, also called features in the Machine Learning context, have the highest influence on the Regressor. It follows that the feature Top_3_Conference_True(the team finished among the first 3 places in its conference [East/West]) has the greatest influence, i.e. is most important. The second most important feature is count_playoff_games(count of all NBA playoff games played by players from the team). And so on…
The SHAP Summary Dot Plot can be seen as a more detailed form of the first plot. The features are still sorted by their impact on the Regressor in descending order, but the size of a single feature value is more in focus.
The Regressor will predict a higher Champion Share Score if…
- Top_3_Coonference_True have a high value = 1
[the team finished in the top 3]
- count_playoff_games have a high value
[the team has a roster that has played many playoff games ⇒ playoff experience]
- Top_3_Coonference_False have a low value = 0
[the team did not finish in the top 3]
- sum_mvp_shares have a high value
[the team has one or more players who have done well in the MVP contest in the past ⇒ elite players on the team]
- SRS and MOV and Offense Four Factors|eFG% and … have a high value
[advanced statistics that show the greatness of a team, for more detail click here, google or send a message]
- L(1, 3, 6)YP have a high value
[the team has performed well in the playoffs for the past n years]
The PowerBI Heat Table (season: 2022) shows the top 10 teams sorted by predicted Champion Share Score. In addition, the 10 most important features are color-coded according to their size. The combination of SHAP Summary Dot Plot and Heat Table illustrates the nuances that differentiate the Golden State Warriors (predicted and true champion) from the other teams.
The PowerBI Comparison Chart (season: 2022) now compares the calculated real Champion Share Score with the values predicted by the model. Overall, the model seems to predict the Champion Share Score quite well. Still, there are some outliers, such as the Boston Celtics, Miami Heats or Dallas Mavericks. The reason for this is, on the one hand, that the model, as well as its evaluation, is specialized on the first 5 places. On the other hand, the model, which learns from the regular season data, cannot take into account the injuries of very important players.
History shows that it is very difficult to predict the NBA Champion after the regular season, because a lot of unexpected things can happen in the playoffs. At first glance, it seemed impossible to make one or even several correct predictions using Machine Learning. A large data pool, as well as the idea to evaluate a Regressor using a ranking metric showed surprisingly good results. The final result is: From the last 3 seasons all 3 NBA Champions could be predicted correctly. This track record makes me absolutely positive to predict next year’s NBA champion correctly as well. In the end, only one question remains.
What is the secret formula to become NBA Champion?
The analysis can be roughly summarized in 4 formula components.
- The team must have a good spirit and play a good regular season.
- The players but also the franchise need playoff experience.
- The team needs one or more superstars (elite players).
[sum_mvp_shares, sum_dpoy_shares, count_all_nba]
- The team should dominate other teams with efficient basketball/shooting.
[SRS, MOV, BLK_opp, Offense Four Factors|eFG%, ≥10, 2P%]