Overview
A multi-dataset churn prediction system that goes beyond a single accuracy number — the goal is to understand which features drive churn at the individual customer level. Three industry datasets are used (telecom, SaaS, banking), each with different class imbalance ratios and feature sets. Three classifiers are compared: Logistic Regression (baseline), Random Forest (900 estimators), and XGBoost (2000 estimators). SHAP (SHapley Additive exPlanations) is applied to every trained model to produce feature importance visualizations that explain predictions, not just make them. Academic project under Prof. Dr. Ruixiang Tang.
Architecture & Approach
Each dataset goes through a three-stage CLI pipeline: train (python -m src.train), evaluate (python -m src.eval), and explain (python -m src.explain). Training uses an 80/20 stratified split with class_weight='balanced' for LR and RF, and a computed scale_pos_weight for XGBoost. XGBoost is the only model subjected to RandomizedSearchCV (5-fold stratified CV, roc_auc scoring) with dataset-specific search grids: SaaS uses 120 iterations with a wider grid due to small sample size; Telco and BankChurners use 80 iterations. Evaluation produces ROC curves and confusion matrices. The explain step uses TreeExplainer for tree-based models and LinearExplainer for LR, generating both a mean-|SHAP| bar chart (top 15 features) and a beeswarm plot showing per-customer contribution distributions.
Results & Outcome
XGBoost achieved AUC ≥ 0.85 on Telco and BankChurners datasets where strong predictive signals exist (contract type, tenure, transaction patterns). SaaS performance was lower due to its small sample (963 rows) and limited pre-churn behavioral indicators. SHAP analysis revealed that contract type and customer tenure dominate churn risk in telecom, while credit limit utilization and transaction frequency are the strongest signals in banking data.