π» Projects
See below for descriptions and the skills and technologies used for my Data Science projects. To see associated code and get a more in-depth description of the work done for each project, navigate to the GitHub repositories for each.

βοΈ Logistic Regression and Fairness Audit - Civilian Complaints Against the NYPD
GitHub Repository
Project Summary
For my final project for an elective I took called Fairness and Algorithmic Decision Making, I chose to explore potential inequities in how the Civilian Complaint Review Board investigates civilian complaints made against New York Police Department officers. Using Logistic Regression, I modeled the CCRB's decision making process, which decides whether a complaint against an officer is substantiated (leading to repercussions for the officer) or not. I then audited it by assessing whether the model meets various metrics of fairness, like demographic parity, when looking at the ruling on a complaint and the civilian's ethnicity. With this investigation, I hoped to demonstrate the importance of auditing Data Science models and some various methods to do so. Please see the full paper or the repository linked above to read more details on the data, model, and audit performed.
Skills
π§ Logistic Regression
π§ Feature Engineering
π§ Hyperparameter Tuning
π§ Model Evaluation
π§ Fairness Auditing
π§ Data Cleaning
Technologies Used
βοΈ Python
βοΈ scikit-learn
βοΈ Pandas
βοΈ Matplotlib
βοΈ Jupyter Notebooks
Key Findings
π‘ Performed data wrangling, statistical tests, and fairness audit to identify and investigate inequity in how allegations against police officers are ruled on based on complainant ethnicity.
π‘ Evaluated model fairness by testing different thresholds to see how they satisfy various parity measures and calculating model utility.
π‘ Emphasized need for fairness audits of Data Science applications, especially when models deal with human data or when model performance can change based on priority.

π K-Nearest Neighbors - Predicting NBA All Stars
GitHub Repository
Project Summary
Becoming an NBA All Star has several implications for players' contracts, careers, and legacies. Currently, who becomes an all star is 50% dependent on fan votes and the remaining 50% is split evenly by the current players and selected members of the media. In this process, there is a lot of room for subjectivity, and many deserving players may be overlooked due to reasons outside of their play on the court. With this project, I wanted to investigate how well player statistics could determine who should be an all star, understand any underrated or overrated players, and perhaps make a case for the NBA to introduce some form of objective, numeric measure into the all star voting process.
Skills
π§ K-Nearest Neighbors Classification
π§ Hyperparameter Tuning
π§ Model Training - Cross-validation
π§ Data Cleaning
Technologies Used
βοΈ Python
βοΈ scikit-learn
βοΈ Pandas
βοΈ Matplotlib
βοΈ Jupyter Notebooks
Key Findings
π‘ Achieved recall of 74% for 2021 season and 67% for 2022 season, capturing a majority of the All Stars selected for those seasons.
π‘ Looked at underrated players (false positives) and overrated players (false negatives) to show potential flaws and inconsistencies in the current voting process.

π Data Visualization - Highlighting Cool NBA Facts
GitHub Repository
Project Summary
I created 5 visualizations to illustrate interesting NBA statistics, like how many NBA players are from each state. View the visualizations created at this link.
Skills
π§ Data visualization
Technologies Used
βοΈ D3
βοΈ JavaScript
βοΈ HTML
π΅ Natural Language Processing and Web Scraping - Has Hip Hop Gotten Worse?
GitHub Repository
Project Summary
This project was one of my first personal projects done during Summer 2020. It gave me a way to practice my burgeoning Data Science and coding skills while diving into a personal interest of mine - hip-hop and rap music. I sought to find an objective way of determining whether the quality of hip-hop music has gone down overtime.
Skills
π§ Data wrangling (web scraping)
π§ Data cleaning
π§ Natural Language Processing
π§ Linear Regression
π§ Data visualization
Technologies Used
βοΈ Python
βοΈ Pandas
βοΈ Matplotlib
βοΈ SciPy
βοΈ Jupyter Notebooks
Key Findings
π‘ Identified 7 genes that are highly associated with IBD.
π‘ Techniques and findings have implications for preventative healthcare, in which they could be employed to help diagnose individuals before onset of symptoms/disease.

π©π»βπ»π€ SQL Interview Helper App
GitHub Repository
Project Summary
The SQL Interview Helper uses and LLM to give users personalized feedback on SQL practice questions. After learning about deep learning and large language models and experimenting with GPT 3.5 as a student in Fall 2022, I re-vamped my final project in December 2025 to use the recent version of the OpenAI SDK, GPT-5 nano, and a Streamlit user interface.
Skills
π§ Data wrangling (web scraping)
π§ Prompt engineering
π§ LLM integration
π§ Web app building
π§ Object-oriented programming
Technologies Used
βοΈ OpenAI SDK
βοΈ Streamlit
βοΈ Python
Key Findings
π‘ Married publicly available SQL practice questions with a GPT LLM to give users instant, personalized feedback on SQL so users don't have to pay for additional services or worry about setting up their own SQL database.
π‘ Used few-shot prompting and OpenAI SDK.
π‘ Built web-app with Streamlit and OOP to provide simple and seamless user experience.

π Random Forest and XGBoost - Predicting NBA Player Position
GitHub Repository
Project Summary
Personal project I did in undergrad as a way to dive deeper into both basketball and data science. This attempts to use player statistics to classify players as either a guard, forward, or center using 2 Machine Learning algorithms - random forest and XGBoost. I went in with the hypothesis that if a player is truly positionless, a ML model would not be able to correctly classify him. Please see the repository linked above to read more details on the data, models, and analysis performed.
Skills
π§ Decision Trees
π§ Feature Engineering
π§ Hyperparameter Tuning
π§ Model Evaluation
Technologies Used
βοΈ Python
βοΈ scikit-learn
βοΈ Pandas
βοΈ Jupyter Notebooks
Key Findings
π‘ Invesitgated both XGBoost and Random Forest algorithms to combat overfitting.
π‘ Increased accuracy by 30 percentage points (~40% to ~70%) with feature engineering and feature selection.
π‘ Found model accuracy decreases over time and certain players, like Nikola JokiΔ and LeBron James, get misclassified consistently, indicating how basketball is evolving into a βpositionlessβ sport.

𧬠Bash/Python/ETL/Regression β Transcriptome-Wide Association Studies for Finding Genes Associated with IBD
GitHub Repository
Project Summary
This was my capstone project completed as part of getting my [B.S. in Data Science at UC San Diego]({{site.baseurl}}/education) with 3 other classmates. We leveraged Transcriptome-Wide Association Studies to identify genes that are associated with irritable bowel disease using various genetic data, bioinformatics tools, and simple linear regression. In the end, we identified 7 genes that are highly associated with IBD. Such techniques have implications for preventative healthcare, in which they could be employed to help diagnose individuals before onset of symptoms/disease.
Skills
π§ Genetic Data Wrangling
π§ Linear Regression
π§ Data Visualization
Technologies Used
βοΈ Python
βοΈ Bash
βοΈ Docker
Key Findings
π‘ Identified 7 genes that are highly associated with IBD.
π‘ Techniques and findings have implications for preventative healthcare, in which they could be employed to help diagnose individuals before onset of symptoms/disease.


