This project developed a binary classification model to predict bank customer churn using an Artificial Neural Network (ANN). I built a deep learning solution to identify customers likely to leave the bank based on their demographic and account information, enabling proactive retention strategies.
(columns [:, 3:-1]) as input variables and extracted churn status as target binary variableloc[]. Label Encoding to convert Gender column to numerical format, Implemented One-Hot Encoding for Geography column to handle multiple categories and used ColumnTransformer to apply different encodings to specific columnsSequential ANN with three layers: two hidden layers (6 units each, ReLU activation) and output layer (1 unit, sigmoid activation), compiled with Adam optimizer and binary crossentropy loss for binary classification and trained for 100 epochs with batch size of 32# Importing the libraries
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
# Import dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:-1].values
y = dataset.iloc[:, -1].values
print("Features (X):") # Table 1
print(X)
print("\nTarget variable (y):") # Table 2
print(y)
# Encoding categorical data (encoding gender column) (Table 3)
le = LabelEncoder()
X[:, 2] = le.fit_transform(X[:, 2])
print("\nAfter Label Encoding Gender:") # Table 3
print(X)
# Encoding categorical data (One Hot encoding geography column) (Table 4)
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
print("\nAfter One-Hot Encoding Geography:") # Table 4
print(X)
# Splitting the dataset into Training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Building ANN
ann = tf.keras.models.Sequential()
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))
ann.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))
# Compiling the ANN
ann.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Training ANN (Table 5)
history = ann.fit(X_train, y_train, batch_size=32, epochs=100)
# Display training results
print("\nTraining completed. Final accuracy:", history.history['accuracy'][-1])
# Predicting results of a single observation
# Note: The input should match the preprocessing (one-hot encoded geography + other features)
sample_prediction = ann.predict(sc.transform([[1, 0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]]))
print("\nSingle prediction probability:", sample_prediction[0][0])
print("Single prediction (>0.5):", sample_prediction > 0.5)
# Predicting Test set results (Table 6)
y_pred = ann.predict(X_test)
y_pred_binary = (y_pred > 0.5)
print("\nPredictions vs Actual (first 20 samples):") # Table 6
comparison = np.concatenate((y_pred_binary.reshape(len(y_pred_binary), 1),
y_test.reshape(len(y_test), 1)), 1)
print("Predicted | Actual")
print(comparison[:20])
# Making the Confusion Matrix (Table 7)
cm = confusion_matrix(y_test, y_pred_binary)
accuracy = accuracy_score(y_test, y_pred_binary)
print("\nConfusion Matrix:") # Table 7
print(cm)
print(f"\nAccuracy Score: {accuracy:.4f}")
# Additional metrics for better evaluation
from sklearn.metrics import classification_report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_binary))
# Model summary
print("\nModel Architecture:")
ann.summary()
The main challenge was handling mixed categorical and numerical data types efficiently. Initially, I struggled with applying different encoding methods to different columns simultaneously. After experimenting with various approaches, I discovered ColumnTransformer, which allowed me to apply One-Hot Encoding to geography while preserving other numerical features, streamlining the preprocessing pipeline significantly.
View on GitHub