As someone who has been playing basketball since the second grade and has been a Warriors fan since birth, the NBA and basketball in general have always held a special place in my heart. As time goes on, statistics and analytics have played an increasingly larger role in the world of basketball. With this project, and my NBA Player Position Classifier, I wanted to use my love of basketball in developing and praciticing new Data Science skills.
This classifier, specifically a Logistic Regression Classifier, uses statistics from nearly 19,000 players’ seasons from 1980-2017. I will then test my classifier using statistics from the 2018-19 season, the 2020-21 season (skipping the season interrupted by the pandemic), the 2021-22 season, and the most recent season, 2022-23. Given that all star voting involves non-expert (fan) voting and some level of subjectivity, hopefully the results will help reveal which players’ All Star designation(s) match the data, which players were overlooked, and which players were perhaps overrated.
Logistic Regression is a good fit for this model because all features are numeric, and there are many features that are best served by a multiple regression.
The NBA tracks almost 50 different statistics for every player in the league. Many statistics are often unknown to most basketball fans, so using only the common statistics will make the most sense for everyone. Here are some basic definitions of the statistics I will be using in my classifier:
The training data comes from Kaggle. The test data come from Basketball Reference:
Please refer to the Jupyter Notebook Viewer or the .ipynb file to view all the code for the classifier. The notebook with all the data cleaning can be seen here.
The source file contains all the functions used to clean/manipulate the data and DataFrames.
Since there are many features being used, a classifier involving distance like K-Nearest Neighbors would not be the most straightforward; K-NN loses interpretability with higher dimensions. Thus, with the relatively large (>2) amount of numerical features, logistic regression is the logical choice. Additionally, instead of using accuracy as the evaluation metric, recall was chosen instead given my interest in capturing all the All Star players in each season. In addition to looking at recall, I also looked at the X highest probabilities for a given season, where X is the number of All Stars selected, to see how well the model could replicate the real life decision process.
The main hyperparamter in logistic regression involve the type of regularization used. See below for the hyperparameters tested in grid search:
Parameters tested:
{"regr__C":np.logspace(-3,3,7), "regr__penalty":["l1","l2"]}
Best paramters:
{'regr__C': 100.0, 'regr__penalty': 'l2'}
All numeric features were scaled, and the only non-numeric feature, player position, was one-hot encoded. Another version was also trained in which numeric features were scaled with respect to the statistic’s specific season.
Below is a table summarizing the performances of the different test sets. V2 references the model that was trained/tested with scaling the numeric features with respect to season.
Classifier | Season | Recall Score | Precision Score | Accuracy | Top Players % Correct |
---|---|---|---|---|---|
Logistic Regression | 2022-2023 | 1 | 0.574 | 0.962 | 0.778 |
Logistic Regression V2 | 2022-2023 | 1 | 0.303 | 0.884 | 0.778 |
Logistic Regression | 2021-2022 | 0.889 | 0.6 | 0.969 | 0.741 |
Logistic Regression | 2020-2021 | 0.963 | 0.553 | 0.955 | 0.741 |
Logistic Regression V2 | 2020-2021 | 1 | 0.397 | 0.916 | 0.741 |
Logistic Regression | 2018-2019 | 0.769 | 0.571 | 0.96 | 0.731 |
Logistic Regression V2 | 2018-2019 | 0.846 | 0.524 | 0.954 | 0.731 |
Logistic Regression V2 | 2021-2022 | 0.926 | 0.309 | 0.904 | 0.704 |
The classifier was able to find a majority of each season’s all stars, performing best for the 2022-23 season. When looking at the players with the 26-27 highest probabilities to represent the 26-27 players chosen as all stars, 70%+ of these players were all stars.
See below for some discussion on the model’s predictions. See the table linked here to see a full summary of the 3 NBA seasons used in testing and the results for each player.
There were only 6 players who were voted as All Stars for all 4 seasons (2018-19, 2020-21, 2021-22, and 2022-23) and were deemed “properly rated” for all 3 seasons. In this case, “properly rated” means that the all star voting seemed to match the data and the KNN model’s findings.
The “overrated” players are instances where the classifier predicted a player wasn’t an All Star when in reality they were. These cases could include players that perhaps contribute in ways that don’t show up on the stat sheet or players that a particular fan favorites. For instance, in the 2018-19 season, future hall-of-famers and NBA champions Dirk Nowitzki and Dwyane Wade were voted as All Stars, but the classifier thought differently. That season was both Nowitzki and Wade’s last season in the NBA, so while their stats maybe weren’t up to par, their legendary careers earned them the designation.
For 2 of the 4 seasons used in testing, Khris Middleton (2018-19 and 2021-22), fell under the “overrated” category. Middleton might not have the opportunity to shine in the stats playing alongside generational talent Giannis Antetokounmpo, thus resulting in a false negative. Another notable “overrated” All Star is Draymond Green for the 2022 All Star Game. His selection may be due more in part to playing on the champions that season, the Golden State Warriors, and one of the best players in the league, Stephen Curry, than his statistical production.
The “underrated” players are those that the classifier predicted to be all stars, but weren’t voted as all stars in reality. Perhaps the stats of such a player were exceptional, but other aspects like their winning percentages weren’t up to par. See below for the players from the 2018-19, 2020-21, 2021-22, and 2022-23 seasons that were deemed underrated and were never selected as an All Star in any of the 4 seasons:
For 2 seasons in a row, young guard Shai Gilgeous-Alexander was “underrated” in the eyes of the model. His team, the OKC Thunder, had a terrible record this past season and finished in 14th place in the Western Conference. This lack of winning, and the fact that OKC is a smaller market, may explain why Gilgeous-Alexander was overlooked in all star voting. But in the most recent season (2022-23), he was finally chosen as an all star, perhaps because his team has done slightly better.
This personal project was made for the sole intent of applying my skills in Python thus far and as a way to learn new ones. It is intended for non-commercial uses only.