Bank Customer Segmentation

📽️ Bank Customer Segmentation

Analyzed bank customer data to segment customers using K-Means Clustering, with dimensionality reduction achieved through PCA. This approach resulted in 7 well-defined customer segments based on key financial behaviors, optimizing the bank's ability to market tailored products and services to their customers.

💻 Tech Stack:

Python for machine learning model development and comparison

Scikit-learn for Clustering (KMeans) and PCA

Pandas for data manipulation

Seaborn for for data visualisation

Matplotlib for data visualisation and model performance analysis

🧪 Data Pipeline:

Load & inspect data: Loaded the dataset using pd.read_csv(), checked for nulls and reviewed data types using .info() and .describe().

Exploratory Analysis: Removed customer ID column. Handled missing values, especially in MINIMUM_PAYMENTS and CREDIT_LIMIT. Used pair plots and distribution plots to understand feature distributions and detect outliers.

Feature selection & scaling: Selected numerical columns (like Age, Income, Spending Score) and scaled them using StandardScaler for optimal clustering.

Clustering wiht KMeans: Applied the Elbow Method to determine the optimal number of clusters and used KMeans to group customers.

Visualisation: Plotted clusters using PCA components. Created scatter plots with cluster labels to visualise customer groupings based on income and spending behaviour.

📊 Code Snippets & Visualisations:

# Importing the libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans from sklearn.decomposition import PCA from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import confusion_matrix, accuracy_score # Load the data credit_card_df = pd.read_csv('/content/4.+Marketing_data.csv') credit_card_df.info() # Display descriptive statistics (Table 1) credit_card_df.describe() # Comments from analysis: # - Mean balance is $1564 # - Balance frequency is frequently updated on average ~0.9, Purchases average is $1000, one off purchase average is ~$600 # - Average purchases frequency is around 0.5, Average ONEOFF_PURCHASES_FREQUENCY, PURCHASES_INSTALLMENTS_FREQUENCY, and CASH_ADVANCE_FREQUENCY are generally low # - Average credit limit ~ 4500, Percent of full payment is 15%, Average tenure is 11 years # Check for missing data (Table 2) credit_card_df.isnull().sum() # Replace the missing elements with mean of the 'MINIMUM_PAYMENT' credit_card_df.MINIMUM_PAYMENTS.fillna(credit_card_df.MINIMUM_PAYMENTS.mean(), inplace=True) # Replace the missing elements with mean of the 'CREDIT_LIMIT' credit_card_df.CREDIT_LIMIT.fillna(credit_card_df.CREDIT_LIMIT.mean(), inplace=True) # Plot to check for missing data (Figure 1) sns.heatmap(credit_card_df.isnull(), yticklabels=False, cbar=False, cmap='Reds') # Check for duplicate entries credit_card_df.duplicated().sum() # Remove Customer ID credit_card_df.drop('CUST_ID', axis=1, inplace=True) # Define function to create subplots of distplots with KDE for all columns def dist_plots(dataframe): fig, ax = plt.subplots(nrows=7, ncols=2, figsize=(15, 30)) index = 0 for row in range(7): for col in range(2): if index < dataframe.shape[1]: # Added safety check sns.distplot(dataframe.iloc[:, index], ax=ax[row][col], kde_kws={'color': 'blue', 'lw': 3, 'label': 'KDE'}, hist_kws={'histtype': 'step', 'lw': 3, 'color': 'green'}) index += 1 plt.tight_layout() plt.show() # Visualise distplots (Figure 2) dist_plots(credit_card_df) # Analysis comments: # - 'Balance_Frequency' for most customers is updated frequently ~1, For 'PURCHASES_FREQUENCY', there are two distinct group of customers # - For 'ONEOFF_PURCHASES_FREQUENCY' and 'PURCHASES_INSTALLMENT_FREQUENCY' most users don't do one off purchases or installment purchases frequently, Very small number of customers pay their balance in full 'PRC_FULL_PAYMENT'~0 # - Mean of balance is $1500, Credit limit average is around $4500, Most customers are ~11 years tenure # Heatmap to visualise correlations (Figure 3) correlations = credit_card_df.corr() plt.figure(figsize=(20, 20)) sns.heatmap(correlations, annot=True) # Analysis comments: # - 'PURCHASES' have high correlation between one-off purchases, 'installment purchases, purchase transactions, credit limit and payments. # - Strong Positive Correlation between 'PURCHASES_FREQUENCY' and 'PURCHASES_INSTALLMENT_FREQUENCY' # Note: The following section appears to be for classification, but X_train, X_test, y_train, y_test are not defined # You may need to add train_test_split and define your features and target variable # Training Model on the Training set # classifier = DecisionTreeClassifier(criterion='entropy', random_state=0) # classifier.fit(X_train, y_train) # Evaluating using confusion matrix # y_pred = classifier.predict(X_test) # cm = confusion_matrix(y_test, y_pred) # print(cm) # accuracy_score(y_test, y_pred) # Display first few rows credit_card_df.head() # Apply Feature scaling scaler = StandardScaler() credit_card_df_scaled = scaler.fit_transform(credit_card_df) # Display scaled data print(credit_card_df_scaled) # Use Elbow Method to find optimal number of clusters (Figure 4) wcss = [] for i in range(1, 20): kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42) kmeans.fit(credit_card_df_scaled) wcss.append(kmeans.inertia_) plt.plot(range(1, 20), wcss, 'bx-') plt.title('The Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('Score (WCSS)') plt.show() # Analysis comment: # We can observe that, 4th cluster seems to be forming the elbow of the curve. However, the values does not reduce linearly until 8th cluster. Let's choose the number of clusters to be 7. # Train data using K-Means method with 7 clusters kmeans = KMeans(n_clusters=7, init='k-means++', random_state=42) kmeans.fit(credit_card_df_scaled) labels = kmeans.labels_ # Use Principal Component Analysis to reduce dimensionality pca = PCA(n_components=2) principalComp = pca.fit_transform(credit_card_df_scaled) print(principalComp) # Create a dataframe with the two components pca_df = pd.DataFrame(data=principalComp, columns=['PCA1', 'PCA2']) print(pca_df) # Concatenate the clusters labels to the dataframe (Table 3) pca_df = pd.concat([pca_df, pd.DataFrame({'cluster': labels})], axis=1) pca_df.head() # Visualise Clusters (Figure 5) plt.figure(figsize=(10, 10)) ax = sns.scatterplot(x='PCA1', y='PCA2', hue='cluster', data=pca_df, palette='tab10') plt.title('Clusters identified by PCA') plt.show()

Data frame showing bank customer characteristics — **Table 1** Customer Data Frame

Table showing missing data patterns in customer records — **Table 2** Missing Data Analysis

Visual confirmation of complete data after cleaning — **Figure 1** Data Completeness Verification

Distribution plots of customer attributes — **Figure 2** Feature Distribution Analysis

Heatmap showing correlations between features — **Figure 3** Feature Correlation Heatmap

Elbow method plot for optimal cluster determination — **Figure 4** Optimal Cluster Determination

Principal Component Analysis results table — **Table 3** PCA Component Analysis

Visualization of bank customer clusters — **Figure 5** Customer Cluster Visualization

🌟 Key Insights:

High income earners tend to be low spenders and Low income earners tend to be high spenders

Customer spending behaviour is strongly differentiated by frequency of purchases and reliance on cash advances. Some customer groups showed heavy instalment purchases but minimal one-off spending, revealing clear segmentation potential for tailored credit card offers.

🧗🏾 Challenge Faced:

Initial visualisations of K-Means clusters were ambiguous due to the high dimensionality of features. Reducing dimensions with PCA made it easier to see meaningful separation, but it required balancing between retaining variance and simplifying complexity. I resolved this by examining explained variance ratios and adjusting the number of components accordingly.

Bank Customer Segmentation

📽️ Bank Customer Segmentation

💻 Tech Stack:

🧪 Data Pipeline:

📊 Code Snippets & Visualisations:

🌟 Key Insights:

🧗🏾 Challenge Faced: