💻 Projects


See below for descriptions and the skills and technologies used for my Data Science projects. To see associated code and get a more in-depth description of the work done for each project, navigate to the GitHub repositories for each.


⚖️ Logistic Regression and Fairness Audit - Civilian Complaints Against the NYPD

GitHub Repository

Project Summary

For my final project for an elective I took called Fairness and Algorithmic Decision Making, I chose to explore potential inequities in how the Civilian Complaint Review Board investigates civilian complaints made against New York Police Department officers. Using Logistic Regression, I modeled the CCRB's decision making process, which decides whether a complaint against an officer is substantiated (leading to repercussions for the officer) or not. I then audited it by assessing whether the model meets various metrics of fairness, like demographic parity, when looking at the ruling on a complaint and the civilian's ethnicity. With this investigation, I hoped to demonstrate the importance of auditing Data Science models and some various methods to do so. Please see the full paper or the repository linked above to read more details on the data, model, and audit performed.

Skills

🧠 Logistic Regression

🧠 Feature Engineering

🧠 Hyperparameter Tuning

🧠 Model Evaluation

🧠 Fairness Auditing

🧠 Data Cleaning

Technologies Used

⚙️ Python

⚙️ scikit-learn

⚙️ Pandas

⚙️ Matplotlib

⚙️ Jupyter Notebooks

Key Findings

💡 Performed data wrangling, statistical tests, and fairness audit to identify and investigate inequity in how allegations against police officers are ruled on based on complainant ethnicity.

💡 Evaluated model fairness by testing different thresholds to see how they satisfy various parity measures and calculating model utility.

💡 Emphasized need for fairness audits of Data Science applications, especially when models deal with human data or when model performance can change based on priority.


🏀 K-Nearest Neighbors - Predicting NBA All Stars

GitHub Repository

Project Summary

Becoming an NBA All Star has several implications for players' contracts, careers, and legacies. Currently, who becomes an all star is 50% dependent on fan votes and the remaining 50% is split evenly by the current players and selected members of the media. In this process, there is a lot of room for subjectivity, and many deserving players may be overlooked due to reasons outside of their play on the court. With this project, I wanted to investigate how well player statistics could determine who should be an all star, understand any underrated or overrated players, and perhaps make a case for the NBA to introduce some form of objective, numeric measure into the all star voting process.

Skills

🧠 K-Nearest Neighbors Classification

🧠 Hyperparameter Tuning

🧠 Model Training - Cross-validation

🧠 Data Cleaning

Technologies Used

⚙️ Python

⚙️ scikit-learn

⚙️ Pandas

⚙️ Matplotlib

⚙️ Jupyter Notebooks

Key Findings

💡 Achieved recall of 74% for 2021 season and 67% for 2022 season, capturing a majority of the All Stars selected for those seasons.

💡 Looked at underrated players (false positives) and overrated players (false negatives) to show potential flaws and inconsistencies in the current voting process.


🏀 Data Visualization - Highlighting Cool NBA Facts

GitHub Repository

Project Summary

I created 5 visualizations to illustrate interesting NBA statistics, like how many NBA players are from each state. View the visualizations created at this link.

Skills

🧠 Data visualization

Technologies Used

⚙️ D3

⚙️ JavaScript

⚙️ HTML